![]() |
VOOZH | about |
The BatchedExecutor enables efficient management of multiple independent conversation threads that share a single LLamaContext and KV cache. Unlike stateful executors that maintain a single linear conversation (3.3), the batched executor can simultaneously process multiple conversations, scheduling their tokens into batches for efficient GPU/CPU utilization.
This system uses an epoch-based synchronization mechanism to coordinate prompt submission and inference across conversations. Each Conversation object represents an independent dialogue thread with its own sequence ID in the KV cache.
For basic single-conversation inference, see the stateful executors (3.3). For text embeddings, see 5.1 For multimodal capabilities used with the batched executor, see 5.2
Sources: LLama/Batched/BatchedExecutor.cs14-414 LLama/Batched/Conversation.cs14-778
The BatchedExecutor manages a shared LLamaContext and coordinates multiple Conversation instances. Each conversation is assigned a unique LLamaSeqId that identifies its position in the KV cache. The executor maintains a pool of available sequence IDs to ensure they are reused and stay within the backend's limits.
Code Entity Space Mapping
Sources: LLama/Batched/BatchedExecutor.cs14-97 LLama/Batched/BatchedExecutor.cs21-48 LLama/Batched/Conversation.cs14-73 LLama/Batched/Conversation.cs181-184
The executor uses a monotonically increasing Epoch counter to synchronize conversation states. Each conversation tracks a _requiredEpoch value that determines whether it needs inference or is ready for sampling.
| Conversation State | Condition | Allowed Operations |
|---|---|---|
| Ready for Prompting | _requiredEpoch <= Epoch | Prompt(), Modify(), Fork() |
| Waiting for Inference | _requiredEpoch > Epoch | None (blocked) |
| Ready for Sampling | _requiredEpoch == Epoch | Sample(), GetSampleIndex() |
The epoch advances twice per Infer() call:
Infer loop)Infer loop)Sources: LLama/Batched/BatchedExecutor.cs79-195 LLama/Batched/Conversation.cs17-65
The BatchedExecutor maintains a queue of pending work. When a conversation is prompted, it adds its requirements to an IBatch in the _batchQueue.
Inference Data Flow
Sources: LLama/Batched/BatchedExecutor.cs65-236 LLama/Batched/Conversation.cs319-456
A Conversation is created via the executor, which allocates the lowest available sequence ID from an internal pool using a linear search and a HashSet<int>. LLama/Batched/BatchedExecutor.cs21-48
Conversations can also be loaded from a Conversation.State object or a file path, restoring their position in the KV cache via Load. LLama/Batched/BatchedExecutor.cs162-186
Disposing a conversation is critical as it explicitly calls MemorySequenceRemove on the native handle and releases the sequence ID back to the pool for reuse. LLama/Batched/Conversation.cs126-143
Sources: LLama/Batched/BatchedExecutor.cs28-186 LLama/Batched/Conversation.cs126-143
executor.Infer(). This processes the next IBatch in the queue. LLama/Batched/BatchedExecutor.cs194-195RequiresSampling == true), the application retrieves the chosen token using a sampler. LLama/Batched/ConversationExtensions.cs19-36| Method | Input | Description |
|---|---|---|
Prompt(string) | Text | Tokenizes text and adds to batch. LLama/Batched/Conversation.cs251-257 |
Prompt(ReadOnlySpan<LLamaToken>) | Tokens | Adds tokens to batch; can generate logits for all tokens if allLogits is true. LLama/Batched/Conversation.cs319-352 |
Prompt(ReadOnlyMemory<float>) | Embeddings | Adds raw embeddings to the batch. LLama/Batched/Conversation.cs412-456 |
Sources: LLama/Batched/Conversation.cs251-456 LLama/Batched/BatchedExecutor.cs194-195
Fork() creates a new conversation that shares the same KV cache history as the parent. It uses MemorySequenceCopy at the native level. LLama/Batched/Conversation.cs184
To prevent forked conversations from corrupting each other's logits (since sampling can be destructive), both conversations set a _forked flag. This ensures they copy the logits to a private buffer before the next sampling run. LLama/Batched/Conversation.cs165-181
The Modify() method allows direct manipulation of the conversation's tokens in the KV cache using SafeLLamaContextHandle methods. The ConversationExtensions class provides high-level wrappers:
Sources: LLama/Batched/Conversation.cs459-612 LLama/Batched/ConversationExtensions.cs44-86
If a MtmdWeights (CLIP model) is provided to the BatchedExecutor, conversations can handle multimodal inputs.
MtmdChunkSequence from media data via SafeMtmdInputChunks. LLama/Batched/Conversation.cs88-106MtmdChunkSequence and adds it to the queue. LLama/Batched/Conversation.cs74-112Infer(), the executor processes the queue which performs multimodal encoding. LLama/Batched/BatchedExecutor.cs194-195Sources: LLama/Batched/BatchedExecutor.cs135-141 LLama/Batched/Conversation.cs74-112
LLamaBatch is a managed wrapper that handles memory pinning for the native LLamaNativeBatch struct used by llama_decode.
LLamaNativeBatch struct. LLama/Native/LLamaBatch.cs104-152llama_decode. LLama/Native/LLamaNativeBatch.cs9-52Native Data Mapping
To prevent race conditions, BatchedExecutor uses an _inferenceLock (via Interlocked.CompareExchange). This ensures that while llama_decode is running, no other thread can start another inference. LLama/Batched/BatchedExecutor.cs76-205
Sources: LLama/Native/LLamaBatch.cs12-152 LLama/Batched/BatchedExecutor.cs194-205
Refresh this wiki