![]() |
VOOZH | about |
This page documents the LLamaContext class and its role in managing inference contexts for language models. A context holds all the runtime state required to interact with a model, including the KV cache, tokenization state, and inference configuration. This page covers context creation, lifecycle management, tokenization, encoding/decoding operations, state persistence, and memory management.
For information about loading model weights, see Model Loading and LLamaWeights. For information about high-level inference patterns using executors, see Executors and Inference.
The LLamaContext class serves as the managed wrapper around a native llama.cpp context (llama_context). It provides a safe, idiomatic C# API for all context-related operations while managing native resources through the SafeLLamaContextHandle.
The following diagram associates the managed system entities with their underlying native handles and P/Invoke targets.
Sources: LLama/LLamaContext.cs18-98 LLama/Native/SafeLLamaContextHandle.cs13-122 LLama/LLamaWeights.cs152-155
| Component | Responsibilities |
|---|---|
LLamaContext | High-level managed API, resource disposal, state serialization |
SafeLLamaContextHandle | Native resource lifecycle, P/Invoke operations, reference counting |
Native llama_context | KV cache storage, attention computation, token processing |
Sources: LLama/LLamaContext.cs15-98 LLama/Native/SafeLLamaContextHandle.cs13-122
Contexts are created from a LLamaWeights instance using context parameters that define the runtime configuration. The LLamaContext constructor initializes the native handle using SafeLLamaContextHandle.Create.
Sources: LLama/LLamaWeights.cs152-155 LLama/LLamaContext.cs78-98 LLama/Native/SafeLLamaContextHandle.cs109-122
The context exposes key configuration and state properties, mostly by delegating to the NativeHandle:
| Property | Type | Description | Source |
|---|---|---|---|
ContextSize | uint | Total number of tokens in context window | LLama/LLamaContext.cs26 |
EmbeddingSize | int | Dimension of embedding vectors | LLama/LLamaContext.cs31 |
BatchSize | uint | Maximum batch size for this context | LLama/LLamaContext.cs70 |
GenerationThreads | int | Number of threads for single-token generation | LLama/LLamaContext.cs52-56 |
BatchThreads | int | Number of threads for batch processing | LLama/LLamaContext.cs61-65 |
Vocab | Vocabulary | Special tokens for the model | LLama/LLamaContext.cs75 |
NativeHandle | SafeLLamaContextHandle | The underlying native handle | LLama/LLamaContext.cs42 |
Sources: LLama/LLamaContext.cs24-75 LLama/Native/SafeLLamaContextHandle.cs16-73
The context implements IDisposable and follows the SafeHandle pattern for deterministic resource cleanup. When SafeLLamaContextHandle.ReleaseHandle is called, it invokes llama_free and decrements the reference count on the associated model.
Sources: LLama/LLamaContext.cs19-20 LLama/Native/SafeLLamaContextHandle.cs80-90
The LLamaContext provides tokenization methods that convert between text and token sequences by wrapping the model's vocabulary.
The Tokenize method converts a string into an array of LLamaToken.
text: The string to tokenize.addBos: Whether to add the Beginning of Sentence token LLama/LLamaContext.cs107special: Whether to parse special/control tokens LLama/LLamaContext.cs107LLamaToken values.Sources: LLama/LLamaContext.cs107-110 LLama/LLamaWeights.cs165-168
For converting tokens back to text, it is recommended to use StreamingTokenDecoder LLama/LLamaContext.cs123 The legacy DeTokenize method is marked as obsolete because it does not handle partial UTF-8 sequences across calls as efficiently as the decoder.
Sources: LLama/LLamaContext.cs117-126
LLamaContext supports saving and loading the complete state of the context or specific sequences.
The SaveState method writes the context state directly to a file using a MemoryMappedFile to avoid copying large buffers into managed memory.
NativeHandle.GetStateSize() to determine required buffer size LLama/LLamaContext.cs140MemoryMappedFile and acquires a pointer to the view LLama/LLamaContext.cs144-150NativeHandle.GetState(ptr, stateSize) to write data directly to the mapped file LLama/LLamaContext.cs153Sources: LLama/LLamaContext.cs129-163 LLama/Native/NativeApi.cs107-122
LLamaSharp supports saving state for individual sequences, which is useful for managing multiple parallel conversations in a single context.
SaveState(string filename, LLamaSeqId sequence) LLama/LLamaContext.cs170LoadState(string filename, LLamaSeqId sequence) LLama/LLamaContext.cs243Native support for sequence persistence is provided by llama_state_seq_save_file and llama_state_seq_load_file.
Sources: LLama/LLamaContext.cs166-274 LLama/Native/NativeApi.cs123-133
Low-level KV cache management is performed via native functions wrapped in SafeLLamaContextHandle or accessed through NativeApi.Memory.cs.
| Method | Native Function | Purpose |
|---|---|---|
llama_memory_clear | NativeApi.llama_memory_clear | Clears all metadata and optionally data buffers LLama/Native/NativeApi.Memory.cs13 |
llama_memory_seq_rm | NativeApi.llama_memory_seq_rm | Removes tokens for a sequence in a position range LLama/Native/NativeApi.Memory.cs25 |
llama_memory_seq_cp | NativeApi.llama_memory_seq_cp | Copies tokens from one sequence ID to another LLama/Native/NativeApi.Memory.cs37 |
llama_memory_seq_keep | NativeApi.llama_memory_seq_keep | Removes all tokens except those in the specified sequence LLama/Native/NativeApi.Memory.cs45 |
llama_memory_seq_add | NativeApi.llama_memory_seq_add | Adds a delta to positions of tokens in a sequence LLama/Native/NativeApi.Memory.cs56 |
Sources: LLama/Native/NativeApi.Memory.cs7-94
The context allows runtime control over thread allocation for generation (single token) and batch (multiple tokens) processing.
llama_n_threads LLama/Native/SafeLLamaContextHandle.cs47llama_n_threads_batch LLama/Native/SafeLLamaContextHandle.cs56llama_set_n_threads to update the native context LLama/Native/SafeLLamaContextHandle.cs48-57Sources: LLama/LLamaContext.cs50-65 LLama/Native/SafeLLamaContextHandle.cs43-58
To ensure thread safety across context operations, the library utilizes internal synchronization. While llama_init_from_model is called via SafeLLamaContextHandle.Create, the NativeApi ensures that the backend is initialized only once via llama_backend_init LLama/Native/NativeApi.cs87
The context provides access to the output of the model:
llama_get_embeddings retrieves vector representations if the context was initialized with embedding support LLama/Native/NativeApi.cs153The LLamaBatch class is used to submit multiple tokens across multiple sequences simultaneously. It manages the pinning of memory for tokens, positions, and sequence IDs to ensure safe P/Invoke calls.
Sources: LLama/Native/LLamaBatch.cs12-152 LLama/Native/SafeLLamaContextHandle.cs180-210
Refresh this wiki