VOOZH about

URL: https://deepwiki.com/SciSharp/LLamaSharp/2.3-context-management-and-llamacontext

⇱ Context Management and LLamaContext | SciSharp/LLamaSharp | DeepWiki


Loading...
Last indexed: 18 May 2026 (ecd184)
Menu

Context Management and LLamaContext

Purpose and Scope

This page documents the LLamaContext class and its role in managing inference contexts for language models. A context holds all the runtime state required to interact with a model, including the KV cache, tokenization state, and inference configuration. This page covers context creation, lifecycle management, tokenization, encoding/decoding operations, state persistence, and memory management.

For information about loading model weights, see Model Loading and LLamaWeights. For information about high-level inference patterns using executors, see Executors and Inference.


Context Architecture Overview

The LLamaContext class serves as the managed wrapper around a native llama.cpp context (llama_context). It provides a safe, idiomatic C# API for all context-related operations while managing native resources through the SafeLLamaContextHandle.

Context Class Hierarchy

The following diagram associates the managed system entities with their underlying native handles and P/Invoke targets.


Sources: LLama/LLamaContext.cs18-98 LLama/Native/SafeLLamaContextHandle.cs13-122 LLama/LLamaWeights.cs152-155

Key Responsibilities

ComponentResponsibilities
LLamaContextHigh-level managed API, resource disposal, state serialization
SafeLLamaContextHandleNative resource lifecycle, P/Invoke operations, reference counting
Native llama_contextKV cache storage, attention computation, token processing

Sources: LLama/LLamaContext.cs15-98 LLama/Native/SafeLLamaContextHandle.cs13-122


Context Creation and Lifecycle

Creating a Context

Contexts are created from a LLamaWeights instance using context parameters that define the runtime configuration. The LLamaContext constructor initializes the native handle using SafeLLamaContextHandle.Create.


Sources: LLama/LLamaWeights.cs152-155 LLama/LLamaContext.cs78-98 LLama/Native/SafeLLamaContextHandle.cs109-122

Context Properties

The context exposes key configuration and state properties, mostly by delegating to the NativeHandle:

PropertyTypeDescriptionSource
ContextSizeuintTotal number of tokens in context windowLLama/LLamaContext.cs26
EmbeddingSizeintDimension of embedding vectorsLLama/LLamaContext.cs31
BatchSizeuintMaximum batch size for this contextLLama/LLamaContext.cs70
GenerationThreadsintNumber of threads for single-token generationLLama/LLamaContext.cs52-56
BatchThreadsintNumber of threads for batch processingLLama/LLamaContext.cs61-65
VocabVocabularySpecial tokens for the modelLLama/LLamaContext.cs75
NativeHandleSafeLLamaContextHandleThe underlying native handleLLama/LLamaContext.cs42

Sources: LLama/LLamaContext.cs24-75 LLama/Native/SafeLLamaContextHandle.cs16-73

Resource Management

The context implements IDisposable and follows the SafeHandle pattern for deterministic resource cleanup. When SafeLLamaContextHandle.ReleaseHandle is called, it invokes llama_free and decrements the reference count on the associated model.


Sources: LLama/LLamaContext.cs19-20 LLama/Native/SafeLLamaContextHandle.cs80-90


Tokenization Operations

The LLamaContext provides tokenization methods that convert between text and token sequences by wrapping the model's vocabulary.

Text to Tokens

The Tokenize method converts a string into an array of LLamaToken.

Sources: LLama/LLamaContext.cs107-110 LLama/LLamaWeights.cs165-168

Tokens to Text (Detokenization)

For converting tokens back to text, it is recommended to use StreamingTokenDecoder LLama/LLamaContext.cs123 The legacy DeTokenize method is marked as obsolete because it does not handle partial UTF-8 sequences across calls as efficiently as the decoder.

Sources: LLama/LLamaContext.cs117-126


State Management

LLamaContext supports saving and loading the complete state of the context or specific sequences.

Full Context State Persistence

The SaveState method writes the context state directly to a file using a MemoryMappedFile to avoid copying large buffers into managed memory.

  1. Size Query: Calls NativeHandle.GetStateSize() to determine required buffer size LLama/LLamaContext.cs140
  2. Memory Mapping: Creates a MemoryMappedFile and acquires a pointer to the view LLama/LLamaContext.cs144-150
  3. Native Save: Calls NativeHandle.GetState(ptr, stateSize) to write data directly to the mapped file LLama/LLamaContext.cs153

Sources: LLama/LLamaContext.cs129-163 LLama/Native/NativeApi.cs107-122

Sequence-Specific State

LLamaSharp supports saving state for individual sequences, which is useful for managing multiple parallel conversations in a single context.

Native support for sequence persistence is provided by llama_state_seq_save_file and llama_state_seq_load_file.

Sources: LLama/LLamaContext.cs166-274 LLama/Native/NativeApi.cs123-133


KV Cache and Memory Management

Low-level KV cache management is performed via native functions wrapped in SafeLLamaContextHandle or accessed through NativeApi.Memory.cs.

Memory Operations

MethodNative FunctionPurpose
llama_memory_clearNativeApi.llama_memory_clearClears all metadata and optionally data buffers LLama/Native/NativeApi.Memory.cs13
llama_memory_seq_rmNativeApi.llama_memory_seq_rmRemoves tokens for a sequence in a position range LLama/Native/NativeApi.Memory.cs25
llama_memory_seq_cpNativeApi.llama_memory_seq_cpCopies tokens from one sequence ID to another LLama/Native/NativeApi.Memory.cs37
llama_memory_seq_keepNativeApi.llama_memory_seq_keepRemoves all tokens except those in the specified sequence LLama/Native/NativeApi.Memory.cs45
llama_memory_seq_addNativeApi.llama_memory_seq_addAdds a delta to positions of tokens in a sequence LLama/Native/NativeApi.Memory.cs56

Sources: LLama/Native/NativeApi.Memory.cs7-94


Thread Configuration and Safety

Thread Configuration

The context allows runtime control over thread allocation for generation (single token) and batch (multiple tokens) processing.

Sources: LLama/LLamaContext.cs50-65 LLama/Native/SafeLLamaContextHandle.cs43-58

Global Inference Lock

To ensure thread safety across context operations, the library utilizes internal synchronization. While llama_init_from_model is called via SafeLLamaContextHandle.Create, the NativeApi ensures that the backend is initialized only once via llama_backend_init LLama/Native/NativeApi.cs87


Advanced Operations

Embeddings and Logits

The context provides access to the output of the model:

  • Logits: Prediction scores for tokens are managed through the native API.
  • Embeddings: llama_get_embeddings retrieves vector representations if the context was initialized with embedding support LLama/Native/NativeApi.cs153

Batching and Sequence Management

The LLamaBatch class is used to submit multiple tokens across multiple sequences simultaneously. It manages the pinning of memory for tokens, positions, and sequence IDs to ensure safe P/Invoke calls.


Sources: LLama/Native/LLamaBatch.cs12-152 LLama/Native/SafeLLamaContextHandle.cs180-210