Last indexed: 18 May 2026 (ecd184)

Context Management and LLamaContext

Purpose and Scope

This page documents the LLamaContext class and its role in managing inference contexts for language models. A context holds all the runtime state required to interact with a model, including the KV cache, tokenization state, and inference configuration. This page covers context creation, lifecycle management, tokenization, encoding/decoding operations, state persistence, and memory management.

For information about loading model weights, see Model Loading and LLamaWeights. For information about high-level inference patterns using executors, see Executors and Inference.

Context Architecture Overview

The LLamaContext class serves as the managed wrapper around a native llama.cpp context (llama_context). It provides a safe, idiomatic C# API for all context-related operations while managing native resources through the SafeLLamaContextHandle.

Context Class Hierarchy

The following diagram associates the managed system entities with their underlying native handles and P/Invoke targets.

Sources: LLama/LLamaContext.cs18-98 LLama/Native/SafeLLamaContextHandle.cs13-122 LLama/LLamaWeights.cs152-155

Key Responsibilities

Component	Responsibilities
`LLamaContext`	High-level managed API, resource disposal, state serialization
`SafeLLamaContextHandle`	Native resource lifecycle, P/Invoke operations, reference counting
Native `llama_context`	KV cache storage, attention computation, token processing

Sources: LLama/LLamaContext.cs15-98 LLama/Native/SafeLLamaContextHandle.cs13-122

Context Creation and Lifecycle

Creating a Context

Contexts are created from a LLamaWeights instance using context parameters that define the runtime configuration. The LLamaContext constructor initializes the native handle using SafeLLamaContextHandle.Create.

Sources: LLama/LLamaWeights.cs152-155 LLama/LLamaContext.cs78-98 LLama/Native/SafeLLamaContextHandle.cs109-122

Context Properties

The context exposes key configuration and state properties, mostly by delegating to the NativeHandle:

Property	Type	Description	Source
`ContextSize`	`uint`	Total number of tokens in context window	LLama/LLamaContext.cs26
`EmbeddingSize`	`int`	Dimension of embedding vectors	LLama/LLamaContext.cs31
`BatchSize`	`uint`	Maximum batch size for this context	LLama/LLamaContext.cs70
`GenerationThreads`	`int`	Number of threads for single-token generation	LLama/LLamaContext.cs52-56
`BatchThreads`	`int`	Number of threads for batch processing	LLama/LLamaContext.cs61-65
`Vocab`	`Vocabulary`	Special tokens for the model	LLama/LLamaContext.cs75
`NativeHandle`	`SafeLLamaContextHandle`	The underlying native handle	LLama/LLamaContext.cs42

Sources: LLama/LLamaContext.cs24-75 LLama/Native/SafeLLamaContextHandle.cs16-73

Resource Management

The context implements IDisposable and follows the SafeHandle pattern for deterministic resource cleanup. When SafeLLamaContextHandle.ReleaseHandle is called, it invokes llama_free and decrements the reference count on the associated model.

Sources: LLama/LLamaContext.cs19-20 LLama/Native/SafeLLamaContextHandle.cs80-90

Tokenization Operations

The LLamaContext provides tokenization methods that convert between text and token sequences by wrapping the model's vocabulary.

Text to Tokens

The Tokenize method converts a string into an array of LLamaToken.

Parameters:
- text: The string to tokenize.
- addBos: Whether to add the Beginning of Sentence token LLama/LLamaContext.cs107
- special: Whether to parse special/control tokens LLama/LLamaContext.cs107
Returns: Array of LLamaToken values.

Sources: LLama/LLamaContext.cs107-110 LLama/LLamaWeights.cs165-168

Tokens to Text (Detokenization)

For converting tokens back to text, it is recommended to use StreamingTokenDecoder LLama/LLamaContext.cs123 The legacy DeTokenize method is marked as obsolete because it does not handle partial UTF-8 sequences across calls as efficiently as the decoder.

Sources: LLama/LLamaContext.cs117-126

State Management

LLamaContext supports saving and loading the complete state of the context or specific sequences.

Full Context State Persistence

The SaveState method writes the context state directly to a file using a MemoryMappedFile to avoid copying large buffers into managed memory.

Size Query: Calls NativeHandle.GetStateSize() to determine required buffer size LLama/LLamaContext.cs140
Memory Mapping: Creates a MemoryMappedFile and acquires a pointer to the view LLama/LLamaContext.cs144-150
Native Save: Calls NativeHandle.GetState(ptr, stateSize) to write data directly to the mapped file LLama/LLamaContext.cs153

Sources: LLama/LLamaContext.cs129-163 LLama/Native/NativeApi.cs107-122

Sequence-Specific State

LLamaSharp supports saving state for individual sequences, which is useful for managing multiple parallel conversations in a single context.

Methods:
- SaveState(string filename, LLamaSeqId sequence) LLama/LLamaContext.cs170
- LoadState(string filename, LLamaSeqId sequence) LLama/LLamaContext.cs243

Native support for sequence persistence is provided by llama_state_seq_save_file and llama_state_seq_load_file.

Sources: LLama/LLamaContext.cs166-274 LLama/Native/NativeApi.cs123-133

KV Cache and Memory Management

Low-level KV cache management is performed via native functions wrapped in SafeLLamaContextHandle or accessed through NativeApi.Memory.cs.

Memory Operations

Method	Native Function	Purpose
`llama_memory_clear`	`NativeApi.llama_memory_clear`	Clears all metadata and optionally data buffers LLama/Native/NativeApi.Memory.cs13
`llama_memory_seq_rm`	`NativeApi.llama_memory_seq_rm`	Removes tokens for a sequence in a position range LLama/Native/NativeApi.Memory.cs25
`llama_memory_seq_cp`	`NativeApi.llama_memory_seq_cp`	Copies tokens from one sequence ID to another LLama/Native/NativeApi.Memory.cs37
`llama_memory_seq_keep`	`NativeApi.llama_memory_seq_keep`	Removes all tokens except those in the specified sequence LLama/Native/NativeApi.Memory.cs45
`llama_memory_seq_add`	`NativeApi.llama_memory_seq_add`	Adds a delta to positions of tokens in a sequence LLama/Native/NativeApi.Memory.cs56

Sources: LLama/Native/NativeApi.Memory.cs7-94

Thread Configuration and Safety

Thread Configuration

The context allows runtime control over thread allocation for generation (single token) and batch (multiple tokens) processing.

GenerationThreads: Maps to llama_n_threads LLama/Native/SafeLLamaContextHandle.cs47
BatchThreads: Maps to llama_n_threads_batch LLama/Native/SafeLLamaContextHandle.cs56
Setting: Both properties use llama_set_n_threads to update the native context LLama/Native/SafeLLamaContextHandle.cs48-57

Sources: LLama/LLamaContext.cs50-65 LLama/Native/SafeLLamaContextHandle.cs43-58

Global Inference Lock

To ensure thread safety across context operations, the library utilizes internal synchronization. While llama_init_from_model is called via SafeLLamaContextHandle.Create, the NativeApi ensures that the backend is initialized only once via llama_backend_init LLama/Native/NativeApi.cs87

Advanced Operations

Embeddings and Logits

The context provides access to the output of the model:

Logits: Prediction scores for tokens are managed through the native API.
Embeddings: llama_get_embeddings retrieves vector representations if the context was initialized with embedding support LLama/Native/NativeApi.cs153

Batching and Sequence Management

The LLamaBatch class is used to submit multiple tokens across multiple sequences simultaneously. It manages the pinning of memory for tokens, positions, and sequence IDs to ensure safe P/Invoke calls.

Sources: LLama/Native/LLamaBatch.cs12-152 LLama/Native/SafeLLamaContextHandle.cs180-210

Refresh this wiki

URL: https://deepwiki.com/SciSharp/LLamaSharp/2.3-context-management-and-llamacontext