![]() |
VOOZH | about |
This page walks through the minimum code required to load a model, create a context, set up a ChatSession, and read streamed output token by token. It assumes packages are already installed and a GGUF model file is available. For installation steps and backend selection, see Installation and Setup. For a conceptual explanation of the layered architecture, see Core Architecture.
| Requirement | Details |
|---|---|
| NuGet packages | LLamaSharp + one backend (e.g. LLamaSharp.Backend.Cpu) README.md92-104 |
| Model file | A .gguf file; see Installation and Setup for sourcing guidance README.md108-115 |
| Target framework | net8.0 or netstandard2.0 LLama/LLamaSharp.csproj3 |
The following diagram maps the conceptual steps (configure → load → wrap → chat) to the specific C# types involved in a typical text generation workflow.
Diagram: From configuration to streaming output
Sources: LLama.Examples/Examples/LLama3ChatSession.cs17-33 LLama.Examples/Examples/LLama3ChatSession.cs64-72
Before any LLM operations, you can configure global settings like logging or preferred hardware backends using NativeLibraryConfig. This must be done before any other native calls are made, as the static constructor of NativeApi locks the configuration upon the first P/Invoke.
Sources: LLama.Examples/Examples/SpeechChat.cs55-65 LLama/LLamaSharp.csproj71-81
ModelParamsModelParams (in the LLama.Common namespace) holds the model path and hardware tuning options.
| Property | Purpose |
|---|---|
ContextSize | Maximum token context window (e.g. 1024, 4096) |
GpuLayerCount | Number of transformer layers offloaded to GPU; 0 = CPU only LLama.Examples/Examples/LLama3ChatSession.cs19 |
Sources: LLama.Examples/Examples/LLama3ChatSession.cs17-20
LLamaWeightsLLamaWeights.LoadFromFile(parameters) (or the async version LoadFromFileAsync) reads the GGUF file from disk and allocates native memory. The returned object is IDisposable; use using to ensure the native handle is released.
Sources: LLama.Examples/Examples/LLama3ChatSession.cs22 LLama.Examples/Examples/ChatSessionWithHistory.cs16
LLamaContextmodel.CreateContext(parameters) allocates the KV cache and other per-context state. A single LLamaWeights object can back multiple contexts simultaneously.
Sources: LLama.Examples/Examples/LLama3ChatSession.cs23
For interactive chat (stateful, conversation memory across turns) use InteractiveExecutor. For one-shot tasks where context is not preserved, use StatelessExecutor.
| Executor | Use case |
|---|---|
InteractiveExecutor | Multi-turn chat; maintains KV cache state between calls LLama.Examples/Examples/LLama3ChatSession.cs24 |
StatelessExecutor | Fresh context on every call; no memory between turns LLama.Examples/Examples/StatelessModeExecute.cs18-22 |
ChatSessionChatSession wraps an executor together with ChatHistory. ChatHistory seeds the conversation with system and example messages.
Sources: LLama.Examples/Examples/LLama3ChatSession.cs27-29
InferenceParamsInferenceParams controls how the executor generates tokens for a single call, including sampling logic via DefaultSamplingPipeline.
Sources: LLama.Examples/Examples/LLama3ChatSession.cs40-49
ChatSession.ChatAsync returns IAsyncEnumerable<string>. Consume it with await foreach to get real-time token output.
Sources: LLama.Examples/Examples/LLama3ChatSession.cs64-72
Diagram: Runtime call chain from ChatAsync to native decode
Sources: LLama.Examples/Examples/LLama3ChatSession.cs64-72 LLama.Examples/Examples/StatelessModeExecute.cs50-53
Combining all steps above (adapted from LLama3ChatSession.cs):
Sources: LLama.Examples/Examples/LLama3ChatSession.cs14-80
| Type | Namespace | Role |
|---|---|---|
ModelParams | LLama.Common | Model path + hardware config LLama.Examples/Examples/LLama3ChatSession.cs17 |
LLamaWeights | LLama | Loaded model weights LLama.Examples/Examples/LLama3ChatSession.cs22 |
LLamaContext | LLama | KV cache + inference state LLama.Examples/Examples/LLama3ChatSession.cs23 |
InteractiveExecutor | LLama | Stateful multi-turn executor LLama.Examples/Examples/LLama3ChatSession.cs24 |
StatelessExecutor | LLama | Fresh context per turn LLama.Examples/Examples/StatelessModeExecute.cs18 |
ChatSession | LLama | High-level chat API LLama.Examples/Examples/LLama3ChatSession.cs29 |
InferenceParams | LLama.Common | Generation settings LLama.Examples/Examples/LLama3ChatSession.cs40 |
DefaultSamplingPipeline | LLama.Sampling | Sampling logic (Temperature, etc.) LLama.Examples/Examples/LLama3ChatSession.cs42 |
PromptTemplateTransformer | LLama.Transformers | Formats history into model-specific prompt templates LLama.Examples/Examples/LLama3ChatSession.cs33 |
Sources: LLama.Examples/Examples/LLama3ChatSession.cs1-81 LLama.Examples/Examples/StatelessModeExecute.cs1-58
Refresh this wiki