VOOZH about

URL: https://deepwiki.com/SciSharp/LLamaSharp/7.4-web-and-api-examples

⇱ Web and API Examples | SciSharp/LLamaSharp | DeepWiki


Loading...
Last indexed: 18 May 2026 (ecd184)
Menu

Web and API Examples

LLamaSharp provides two primary examples for hosting Large Language Models in a web environment: LLama.WebAPI, a REST-based implementation using ASP.NET Core Controllers, and LLama.Web, a real-time implementation using SignalR. These examples demonstrate how to manage the lifecycle of models and contexts within a multi-user server context.

LLama.WebAPI: RESTful Integration

The LLama.WebAPI project demonstrates how to expose LLamaSharp functionality via standard HTTP endpoints. It supports both stateful (persistent session) and stateless (one-shot) chat interactions.

Architecture and Data Flow

The API is structured around two service implementations: StatefulChatService and StatelessChatService. These are registered in the DI container during application startup LLama.WebAPI/Program.cs10-11

Stateful vs Stateless Flow

FeatureStateful ServiceStateless Service
LifetimeSingleton LLama.WebAPI/Program.cs10Scoped LLama.WebAPI/Program.cs11
PersistenceMaintains a ChatSession LLama.WebAPI/Services/StatefulChatService.cs9Creates fresh session/history per request
ExecutorInteractiveExecutor LLama.WebAPI/Services/StatefulChatService.cs29InteractiveExecutor (manually managed history) LLama.WebAPI/Services/StatelessChatService.cs25

Sources: LLama.WebAPI/Program.cs10-11 LLama.WebAPI/Services/StatefulChatService.cs6-31 LLama.WebAPI/Services/StatelessChatService.cs7-28

Code Entity Space: LLama.WebAPI

The following diagram maps the API routes to the underlying service logic and LLamaSharp entities.

LLama.WebAPI Request Routing


Sources: LLama.WebAPI/Controllers/ChatController.cs19-50 LLama.WebAPI/Services/StatefulChatService.cs24-31 LLama.WebAPI/Services/StatefulChatService.cs38-47 LLama.WebAPI/Services/StatelessChatService.cs30-35

Implementation Details

  1. Stateful Chat: Uses a single ChatSession where the SystemPrompt is initialized once LLama.WebAPI/Services/StatefulChatService.cs14-30 Subsequent calls to Send append to this existing history LLama.WebAPI/Services/StatefulChatService.cs47
  2. Streaming: The /Send/Stream endpoint uses IAsyncEnumerable<string> and sets the response content type to text/event-stream to push tokens to the client as they are generated LLama.WebAPI/Controllers/ChatController.cs25-38
  3. Stateless Chat: Accepts a full ChatHistory object from the client LLama.WebAPI/Controllers/ChatController.cs41-50 It uses a custom HistoryTransform to ensure the model response starts with an "Assistant:" prefix LLama.WebAPI/Services/StatelessChatService.cs48-55

Sources: LLama.WebAPI/Services/StatefulChatService.cs69-96 LLama.WebAPI/Controllers/ChatController.cs29-34 LLama.WebAPI/Services/StatelessChatService.cs48-55


LLama.Web: SignalR and Multi-Session Management

The LLama.Web project is a more advanced demo that uses SignalR for real-time bidirectional communication. It is designed to handle multiple concurrent sessions, where each browser tab can maintain its own unique context.

Session Lifecycle Management

The core of this implementation is the ModelSessionService, which acts as a coordinator between the SignalR Hub and the underlying LLamaSharp models.

Model and Session Management Flow


Sources: LLama.Web/Services/ModelSessionService.cs64-81 LLama.Web/Models/ModelSession.cs19-31 LLama.Web/Services/ModelSessionService.cs108-111

Key Components

Sources: LLama.Web/Services/ModelSessionService.cs15-30 LLama.Web/Models/LLamaModel.cs11 LLama.Web/Models/LLamaModel.cs76 LLama.Web/Models/ModelSession.cs7-31 LLama.Web/Models/ModelSession.cs121-128 LLama.Web/Services/ModelSessionService.cs108-111

Configuration (appsettings.json)

The web demo uses a structured configuration to define model parameters like GpuLayerCount, ContextSize, and BatchSize.

ParameterDescriptionSource
ModelPathPath to the GGUF model fileLLama.Web/appsettings.json15
MaxInstancesMaximum number of context instances allowedLLama.Web/appsettings.json14
GpuLayerCountNumber of layers to offload to GPULLama.Web/appsettings.json19
ModelLoadTypeControls if models are preloaded or loaded on demandLLama.Web/appsettings.json10

Sources: LLama.Web/appsettings.json9-38

Inference Execution

When a user sends a message via SignalR, the ModelSessionService.InferAsync method is called. It yields TokenModel objects which contain the generated text and metadata (e.g., TokenType.Begin, TokenType.Content, TokenType.End) LLama.Web/Services/ModelSessionService.cs118-132

The ModelSession chooses the appropriate executor (Interactive, Instruct, or Stateless) based on the session configuration LLama.Web/Models/ModelSession.cs130-139

Sources: LLama.Web/Services/ModelSessionService.cs106-138 LLama.Web/Models/ModelSession.cs88-98 LLama.Web/Models/ModelSession.cs130-139