Last indexed: 18 May 2026 (ecd184)

Web and API Examples

LLamaSharp provides two primary examples for hosting Large Language Models in a web environment: LLama.WebAPI, a REST-based implementation using ASP.NET Core Controllers, and LLama.Web, a real-time implementation using SignalR. These examples demonstrate how to manage the lifecycle of models and contexts within a multi-user server context.

LLama.WebAPI: RESTful Integration

The LLama.WebAPI project demonstrates how to expose LLamaSharp functionality via standard HTTP endpoints. It supports both stateful (persistent session) and stateless (one-shot) chat interactions.

Architecture and Data Flow

The API is structured around two service implementations: StatefulChatService and StatelessChatService. These are registered in the DI container during application startup LLama.WebAPI/Program.cs10-11

Stateful vs Stateless Flow

Feature	Stateful Service	Stateless Service
Lifetime	`Singleton` LLama.WebAPI/Program.cs10	`Scoped` LLama.WebAPI/Program.cs11
Persistence	Maintains a `ChatSession` LLama.WebAPI/Services/StatefulChatService.cs9	Creates fresh session/history per request
Executor	`InteractiveExecutor` LLama.WebAPI/Services/StatefulChatService.cs29	`InteractiveExecutor` (manually managed history) LLama.WebAPI/Services/StatelessChatService.cs25

Sources: LLama.WebAPI/Program.cs10-11 LLama.WebAPI/Services/StatefulChatService.cs6-31 LLama.WebAPI/Services/StatelessChatService.cs7-28

Code Entity Space: LLama.WebAPI

The following diagram maps the API routes to the underlying service logic and LLamaSharp entities.

LLama.WebAPI Request Routing

Sources: LLama.WebAPI/Controllers/ChatController.cs19-50 LLama.WebAPI/Services/StatefulChatService.cs24-31 LLama.WebAPI/Services/StatefulChatService.cs38-47 LLama.WebAPI/Services/StatelessChatService.cs30-35

Implementation Details

Stateful Chat: Uses a single ChatSession where the SystemPrompt is initialized once LLama.WebAPI/Services/StatefulChatService.cs14-30 Subsequent calls to Send append to this existing history LLama.WebAPI/Services/StatefulChatService.cs47
Streaming: The /Send/Stream endpoint uses IAsyncEnumerable<string> and sets the response content type to text/event-stream to push tokens to the client as they are generated LLama.WebAPI/Controllers/ChatController.cs25-38
Stateless Chat: Accepts a full ChatHistory object from the client LLama.WebAPI/Controllers/ChatController.cs41-50 It uses a custom HistoryTransform to ensure the model response starts with an "Assistant:" prefix LLama.WebAPI/Services/StatelessChatService.cs48-55

Sources: LLama.WebAPI/Services/StatefulChatService.cs69-96 LLama.WebAPI/Controllers/ChatController.cs29-34 LLama.WebAPI/Services/StatelessChatService.cs48-55

LLama.Web: SignalR and Multi-Session Management

The LLama.Web project is a more advanced demo that uses SignalR for real-time bidirectional communication. It is designed to handle multiple concurrent sessions, where each browser tab can maintain its own unique context.

Session Lifecycle Management

The core of this implementation is the ModelSessionService, which acts as a coordinator between the SignalR Hub and the underlying LLamaSharp models.

Model and Session Management Flow

Sources: LLama.Web/Services/ModelSessionService.cs64-81 LLama.Web/Models/ModelSession.cs19-31 LLama.Web/Services/ModelSessionService.cs108-111

Key Components

ModelService: Manages the loading and unloading of LLamaWeights (wrapped in LLamaModel LLama.Web/Models/LLamaModel.cs11). It supports different loading strategies and creates LLamaContext instances using the loaded weights LLama.Web/Models/LLamaModel.cs76
ModelSession: Encapsulates a specific user's interaction. It holds the ILLamaExecutor, the LLamaContext, and handles output transformations like keyword filtering using KeywordTextOutputStreamTransform LLama.Web/Models/ModelSession.cs7-31 LLama.Web/Models/ModelSession.cs121-128
AsyncGuard: Used in ModelSessionService to ensure that only one inference request can run per session ID at any given time LLama.Web/Services/ModelSessionService.cs108-111

Sources: LLama.Web/Services/ModelSessionService.cs15-30 LLama.Web/Models/LLamaModel.cs11 LLama.Web/Models/LLamaModel.cs76 LLama.Web/Models/ModelSession.cs7-31 LLama.Web/Models/ModelSession.cs121-128 LLama.Web/Services/ModelSessionService.cs108-111

Configuration (appsettings.json)

The web demo uses a structured configuration to define model parameters like GpuLayerCount, ContextSize, and BatchSize.

Parameter	Description	Source
`ModelPath`	Path to the GGUF model file	LLama.Web/appsettings.json15
`MaxInstances`	Maximum number of context instances allowed	LLama.Web/appsettings.json14
`GpuLayerCount`	Number of layers to offload to GPU	LLama.Web/appsettings.json19
`ModelLoadType`	Controls if models are preloaded or loaded on demand	LLama.Web/appsettings.json10

Sources: LLama.Web/appsettings.json9-38

Inference Execution

When a user sends a message via SignalR, the ModelSessionService.InferAsync method is called. It yields TokenModel objects which contain the generated text and metadata (e.g., TokenType.Begin, TokenType.Content, TokenType.End) LLama.Web/Services/ModelSessionService.cs118-132

The ModelSession chooses the appropriate executor (Interactive, Instruct, or Stateless) based on the session configuration LLama.Web/Models/ModelSession.cs130-139

Sources: LLama.Web/Services/ModelSessionService.cs106-138 LLama.Web/Models/ModelSession.cs88-98 LLama.Web/Models/ModelSession.cs130-139

Refresh this wiki

URL: https://deepwiki.com/SciSharp/LLamaSharp/7.4-web-and-api-examples