VOOZH about

URL: https://deepwiki.com/SciSharp/LLamaSharp/5.3-batched-execution

⇱ Batched Execution | SciSharp/LLamaSharp | DeepWiki


Loading...
Last indexed: 18 May 2026 (ecd184)
Menu

Batched Execution

Purpose and Scope

The BatchedExecutor enables efficient management of multiple independent conversation threads that share a single LLamaContext and KV cache. Unlike stateful executors that maintain a single linear conversation (3.3), the batched executor can simultaneously process multiple conversations, scheduling their tokens into batches for efficient GPU/CPU utilization.

This system uses an epoch-based synchronization mechanism to coordinate prompt submission and inference across conversations. Each Conversation object represents an independent dialogue thread with its own sequence ID in the KV cache.

For basic single-conversation inference, see the stateful executors (3.3). For text embeddings, see 5.1 For multimodal capabilities used with the batched executor, see 5.2

Sources: LLama/Batched/BatchedExecutor.cs14-414 LLama/Batched/Conversation.cs14-778

Core Concepts

BatchedExecutor and Conversation Relationship

The BatchedExecutor manages a shared LLamaContext and coordinates multiple Conversation instances. Each conversation is assigned a unique LLamaSeqId that identifies its position in the KV cache. The executor maintains a pool of available sequence IDs to ensure they are reused and stay within the backend's limits.

Code Entity Space Mapping


Sources: LLama/Batched/BatchedExecutor.cs14-97 LLama/Batched/BatchedExecutor.cs21-48 LLama/Batched/Conversation.cs14-73 LLama/Batched/Conversation.cs181-184

Epoch Synchronization System

The executor uses a monotonically increasing Epoch counter to synchronize conversation states. Each conversation tracks a _requiredEpoch value that determines whether it needs inference or is ready for sampling.

Conversation StateConditionAllowed Operations
Ready for Prompting_requiredEpoch <= EpochPrompt(), Modify(), Fork()
Waiting for Inference_requiredEpoch > EpochNone (blocked)
Ready for Sampling_requiredEpoch == EpochSample(), GetSampleIndex()

The epoch advances twice per Infer() call:

  1. Once before decoding (to block sampling during inference). LLama/Batched/BatchedExecutor.cs167-168 (approximate logic in the Infer loop)
  2. Once after successful decoding (to enable sampling). LLama/Batched/BatchedExecutor.cs182 (approximate logic in the Infer loop)

Sources: LLama/Batched/BatchedExecutor.cs79-195 LLama/Batched/Conversation.cs17-65

Batch Queue Architecture

The BatchedExecutor maintains a queue of pending work. When a conversation is prompted, it adds its requirements to an IBatch in the _batchQueue.

Inference Data Flow


Sources: LLama/Batched/BatchedExecutor.cs65-236 LLama/Batched/Conversation.cs319-456

Creating and Managing Conversations

Creating and Loading

A Conversation is created via the executor, which allocates the lowest available sequence ID from an internal pool using a linear search and a HashSet<int>. LLama/Batched/BatchedExecutor.cs21-48


Conversations can also be loaded from a Conversation.State object or a file path, restoring their position in the KV cache via Load. LLama/Batched/BatchedExecutor.cs162-186

Disposal

Disposing a conversation is critical as it explicitly calls MemorySequenceRemove on the native handle and releases the sequence ID back to the pool for reuse. LLama/Batched/Conversation.cs126-143

Sources: LLama/Batched/BatchedExecutor.cs28-186 LLama/Batched/Conversation.cs126-143

Prompting and Inference Workflow

The Prompt-Infer-Sample Cycle

  1. Prompt: Tokens are added to the conversation. This requests a batch from the executor. LLama/Batched/Conversation.cs319-352
  2. Infer: The application calls executor.Infer(). This processes the next IBatch in the queue. LLama/Batched/BatchedExecutor.cs194-195
  3. Sample: Once inference is complete (RequiresSampling == true), the application retrieves the chosen token using a sampler. LLama/Batched/ConversationExtensions.cs19-36

Prompting Variants

MethodInputDescription
Prompt(string)TextTokenizes text and adds to batch. LLama/Batched/Conversation.cs251-257
Prompt(ReadOnlySpan<LLamaToken>)TokensAdds tokens to batch; can generate logits for all tokens if allLogits is true. LLama/Batched/Conversation.cs319-352
Prompt(ReadOnlyMemory<float>)EmbeddingsAdds raw embeddings to the batch. LLama/Batched/Conversation.cs412-456

Sources: LLama/Batched/Conversation.cs251-456 LLama/Batched/BatchedExecutor.cs194-195

Advanced Features

Forking Conversations

Fork() creates a new conversation that shares the same KV cache history as the parent. It uses MemorySequenceCopy at the native level. LLama/Batched/Conversation.cs184

To prevent forked conversations from corrupting each other's logits (since sampling can be destructive), both conversations set a _forked flag. This ensures they copy the logits to a private buffer before the next sampling run. LLama/Batched/Conversation.cs165-181

KV Cache Manipulation

The Modify() method allows direct manipulation of the conversation's tokens in the KV cache using SafeLLamaContextHandle methods. The ConversationExtensions class provides high-level wrappers:

Sources: LLama/Batched/Conversation.cs459-612 LLama/Batched/ConversationExtensions.cs44-86

Multimodal Execution (MTMD)

If a MtmdWeights (CLIP model) is provided to the BatchedExecutor, conversations can handle multimodal inputs.

  1. Loading Media: Creating MtmdChunkSequence from media data via SafeMtmdInputChunks. LLama/Batched/Conversation.cs88-106
  2. Queuing: The conversation is prompted with multimodal content. The executor wraps these in a MtmdChunkSequence and adds it to the queue. LLama/Batched/Conversation.cs74-112
  3. Evaluation: During Infer(), the executor processes the queue which performs multimodal encoding. LLama/Batched/BatchedExecutor.cs194-195

Sources: LLama/Batched/BatchedExecutor.cs135-141 LLama/Batched/Conversation.cs74-112

Implementation Details

LLamaBatch vs LLamaNativeBatch

LLamaBatch is a managed wrapper that handles memory pinning for the native LLamaNativeBatch struct used by llama_decode.

Native Data Mapping


Inference Locking

To prevent race conditions, BatchedExecutor uses an _inferenceLock (via Interlocked.CompareExchange). This ensures that while llama_decode is running, no other thread can start another inference. LLama/Batched/BatchedExecutor.cs76-205

Sources: LLama/Native/LLamaBatch.cs12-152 LLama/Batched/BatchedExecutor.cs194-205