Last indexed: 18 May 2026 (ecd184)

Batched Execution

Purpose and Scope

The BatchedExecutor enables efficient management of multiple independent conversation threads that share a single LLamaContext and KV cache. Unlike stateful executors that maintain a single linear conversation (3.3), the batched executor can simultaneously process multiple conversations, scheduling their tokens into batches for efficient GPU/CPU utilization.

This system uses an epoch-based synchronization mechanism to coordinate prompt submission and inference across conversations. Each Conversation object represents an independent dialogue thread with its own sequence ID in the KV cache.

For basic single-conversation inference, see the stateful executors (3.3). For text embeddings, see 5.1 For multimodal capabilities used with the batched executor, see 5.2

Sources: LLama/Batched/BatchedExecutor.cs14-414 LLama/Batched/Conversation.cs14-778

Core Concepts

BatchedExecutor and Conversation Relationship

The BatchedExecutor manages a shared LLamaContext and coordinates multiple Conversation instances. Each conversation is assigned a unique LLamaSeqId that identifies its position in the KV cache. The executor maintains a pool of available sequence IDs to ensure they are reused and stay within the backend's limits.

Code Entity Space Mapping

Sources: LLama/Batched/BatchedExecutor.cs14-97 LLama/Batched/BatchedExecutor.cs21-48 LLama/Batched/Conversation.cs14-73 LLama/Batched/Conversation.cs181-184

Epoch Synchronization System

The executor uses a monotonically increasing Epoch counter to synchronize conversation states. Each conversation tracks a _requiredEpoch value that determines whether it needs inference or is ready for sampling.

Conversation State	Condition	Allowed Operations
Ready for Prompting	`_requiredEpoch <= Epoch`	`Prompt()`, `Modify()`, `Fork()`
Waiting for Inference	`_requiredEpoch > Epoch`	None (blocked)
Ready for Sampling	`_requiredEpoch == Epoch`	`Sample()`, `GetSampleIndex()`

The epoch advances twice per Infer() call:

Once before decoding (to block sampling during inference). LLama/Batched/BatchedExecutor.cs167-168 (approximate logic in the Infer loop)
Once after successful decoding (to enable sampling). LLama/Batched/BatchedExecutor.cs182 (approximate logic in the Infer loop)

Sources: LLama/Batched/BatchedExecutor.cs79-195 LLama/Batched/Conversation.cs17-65

Batch Queue Architecture

The BatchedExecutor maintains a queue of pending work. When a conversation is prompted, it adds its requirements to an IBatch in the _batchQueue.

Inference Data Flow

Sources: LLama/Batched/BatchedExecutor.cs65-236 LLama/Batched/Conversation.cs319-456

Creating and Managing Conversations

Creating and Loading

A Conversation is created via the executor, which allocates the lowest available sequence ID from an internal pool using a linear search and a HashSet<int>. LLama/Batched/BatchedExecutor.cs21-48

Conversations can also be loaded from a Conversation.State object or a file path, restoring their position in the KV cache via Load. LLama/Batched/BatchedExecutor.cs162-186

Disposal

Disposing a conversation is critical as it explicitly calls MemorySequenceRemove on the native handle and releases the sequence ID back to the pool for reuse. LLama/Batched/Conversation.cs126-143

Sources: LLama/Batched/BatchedExecutor.cs28-186 LLama/Batched/Conversation.cs126-143

Prompting and Inference Workflow

The Prompt-Infer-Sample Cycle

Prompt: Tokens are added to the conversation. This requests a batch from the executor. LLama/Batched/Conversation.cs319-352
Infer: The application calls executor.Infer(). This processes the next IBatch in the queue. LLama/Batched/BatchedExecutor.cs194-195
Sample: Once inference is complete (RequiresSampling == true), the application retrieves the chosen token using a sampler. LLama/Batched/ConversationExtensions.cs19-36

Prompting Variants

Method	Input	Description
`Prompt(string)`	Text	Tokenizes text and adds to batch. LLama/Batched/Conversation.cs251-257
`Prompt(ReadOnlySpan<LLamaToken>)`	Tokens	Adds tokens to batch; can generate logits for all tokens if `allLogits` is true. LLama/Batched/Conversation.cs319-352
`Prompt(ReadOnlyMemory<float>)`	Embeddings	Adds raw embeddings to the batch. LLama/Batched/Conversation.cs412-456

Sources: LLama/Batched/Conversation.cs251-456 LLama/Batched/BatchedExecutor.cs194-195

Advanced Features

Forking Conversations

Fork() creates a new conversation that shares the same KV cache history as the parent. It uses MemorySequenceCopy at the native level. LLama/Batched/Conversation.cs184

To prevent forked conversations from corrupting each other's logits (since sampling can be destructive), both conversations set a _forked flag. This ensures they copy the logits to a private buffer before the next sampling run. LLama/Batched/Conversation.cs165-181

KV Cache Manipulation

The Modify() method allows direct manipulation of the conversation's tokens in the KV cache using SafeLLamaContextHandle methods. The ConversationExtensions class provides high-level wrappers:

Rewind: Removes tokens from the end of the conversation. LLama/Batched/ConversationExtensions.cs44-57
ShiftLeft: Removes tokens from the middle/start while preserving a "keep" prefix, shifting remaining tokens left. LLama/Batched/ConversationExtensions.cs67-86

Sources: LLama/Batched/Conversation.cs459-612 LLama/Batched/ConversationExtensions.cs44-86

Multimodal Execution (MTMD)

If a MtmdWeights (CLIP model) is provided to the BatchedExecutor, conversations can handle multimodal inputs.

Loading Media: Creating MtmdChunkSequence from media data via SafeMtmdInputChunks. LLama/Batched/Conversation.cs88-106
Queuing: The conversation is prompted with multimodal content. The executor wraps these in a MtmdChunkSequence and adds it to the queue. LLama/Batched/Conversation.cs74-112
Evaluation: During Infer(), the executor processes the queue which performs multimodal encoding. LLama/Batched/BatchedExecutor.cs194-195

Sources: LLama/Batched/BatchedExecutor.cs135-141 LLama/Batched/Conversation.cs74-112

Implementation Details

LLamaBatch vs LLamaNativeBatch

LLamaBatch is a managed wrapper that handles memory pinning for the native LLamaNativeBatch struct used by llama_decode.

LLamaBatch: Manages arrays for tokens, positions, and sequence IDs. LLama/Native/LLamaBatch.cs12-70
ToNativeBatch: Pins managed memory and populates the LLamaNativeBatch struct. LLama/Native/LLamaBatch.cs104-152
LLamaNativeBatch: The raw struct passed to llama_decode. LLama/Native/LLamaNativeBatch.cs9-52

Native Data Mapping

Inference Locking

To prevent race conditions, BatchedExecutor uses an _inferenceLock (via Interlocked.CompareExchange). This ensures that while llama_decode is running, no other thread can start another inference. LLama/Batched/BatchedExecutor.cs76-205

Sources: LLama/Native/LLamaBatch.cs12-152 LLama/Batched/BatchedExecutor.cs194-205

Refresh this wiki

URL: https://deepwiki.com/SciSharp/LLamaSharp/5.3-batched-execution