Last indexed: 18 May 2026 (ecd184)

Executor API

This page provides detailed API reference documentation for the executor classes and interfaces in LLamaSharp. Executors are the primary abstraction for text generation, managing the interaction between user prompts, model inference, and token generation.

For high-level usage patterns and examples, see Chat Sessions For configuration of sampling behavior, see Sampling API (9.3). For the underlying context management, see Core API Classes (9.1).

Overview

The executor API is built around the ILLamaExecutor interface, which defines the contract for all text generation operations. LLamaSharp provides several primary executor implementations:

InteractiveExecutor: Stateful executor for conversational interactions with KV cache reuse across turns. LLama/LLamaInteractExecutor.cs23-24
InstructExecutor: Stateful executor for instruction-following patterns with prefix/suffix templating. LLama/LLamaInstructExecutor.cs22-23
StatelessExecutor: Independent inference with fresh context creation per request. LLama/LLamaStatelessExecutor.cs19-21
BatchedExecutor: Advanced executor for managing multiple concurrent "conversations" with manual control over decoding steps. LLama/Batched/BatchedExecutor.cs14-16

Sources: LLama/LLamaInteractExecutor.cs23-24 LLama/LLamaInstructExecutor.cs22-23 LLama/LLamaStatelessExecutor.cs19-21 LLama/Batched/BatchedExecutor.cs14-16

ILLamaExecutor Interface

The ILLamaExecutor interface defines the core contract for all executor implementations.

Interface Members

Member	Type	Description
`Context`	`LLamaContext`	The loaded context for inference operations. LLama/LLamaExecutorBase.cs65
`IsMultiModal`	`bool`	Indicates whether the executor supports multimodal inputs. LLama/LLamaExecutorBase.cs75-81
`ClipModel`	`MtmdWeights?`	Multimodal projection weights (MTMD) for image/audio processing. LLama/LLamaExecutor.cs84
`Embeds`	`List<SafeMtmdEmbed>`	Collection of media embeddings for multimodal models. LLama/LLamaExecutorBase.cs87
`InferAsync`	Method	Asynchronously generates text from a prompt. LLama/LLamaStatelessExecutor.cs73

Sources: LLama/LLamaExecutorBase.cs65-87 LLama/LLamaStatelessExecutor.cs73

Executor Architecture

The following diagram maps the relationship between the high-level executor abstractions and the concrete implementations used for inference.

Natural Language Space to Code Entity Space

Sources: LLama/LLamaExecutorBase.cs20-21 LLama/LLamaStatelessExecutor.cs19-21 LLama/LLamaInteractExecutor.cs23-24 LLama/LLamaInstructExecutor.cs22-23 LLama/Batched/BatchedExecutor.cs14-16

StatefulExecutorBase

Abstract base class providing common functionality for stateful executors that maintain conversation context across multiple turns. LLama/LLamaExecutorBase.cs20-21

Constructor

Sources: LLama/LLamaExecutorBase.cs103-126

Protected State Fields

Field	Type	Purpose
`_pastTokensCount`	`int`	Number of tokens already processed (`n_past`). LLama/LLamaExecutorBase.cs29
`_consumedTokensCount`	`int`	Tokens consumed during current inference (`n_consume`). LLama/LLamaExecutorBase.cs33
`_embed_inps`	`List<LLamaToken>`	Input tokens to be processed. LLama/LLamaExecutorBase.cs53
`_embeds`	`List<LLamaToken>`	Tokens pending evaluation. LLama/LLamaExecutorBase.cs49
`_last_n_tokens`	`FixedSizeQueue<LLamaToken>`	Recent token history for sampling. LLama/LLamaExecutorBase.cs61
`_session_tokens`	`List<LLamaToken>`	Tokens from loaded session file. LLama/LLamaExecutorBase.cs57
`_pathSession`	`string?`	Path to session cache file. LLama/LLamaExecutorBase.cs45

Sources: LLama/LLamaExecutorBase.cs25-61

Session File Management

The WithSessionFile method attaches a session cache file to the executor. If the file exists, it loads the KV cache state and attempts to reuse matching token prefixes to accelerate subsequent inference using NativeApi.llama_state_load_file. LLama/LLamaExecutorBase.cs147-152

Sources: LLama/LLamaExecutorBase.cs135-170

InteractiveExecutor

Stateful executor designed for conversational interactions. Maintains context across multiple turns and supports antiprompts for turn-taking. LLama/LLamaInteractExecutor.cs23-24

Execution Flow

Sources: LLama/LLamaInteractExecutor.cs131-174 LLama/LLamaExecutorBase.cs508-556

Multimodal Support (MTMD)

The InteractiveExecutor supports multimodal inputs via MtmdWeights. Media inputs (images/audio) are identified in text prompts and processed into SafeMtmdInputChunks before being evaluated by the model. LLama/LLamaInteractExecutor.cs49-52 LLama/LLamaInteractExecutor.cs28-30

Sources: LLama/LLamaInteractExecutor.cs49-52 LLama/LLamaInteractExecutor.cs28-30

InstructExecutor

Stateful executor for instruction-following patterns. Automatically wraps user input with configurable prefix and suffix templates. LLama/LLamaInstructExecutor.cs22-23

Template Wrapping

The executor pre-tokenizes the instruction prefix and suffix during construction:

During preprocessing, user input is wrapped: [_inp_pfx] + [user_text_tokens] + [_inp_sfx]. LLama/LLamaInstructExecutor.cs170-176

Sources: LLama/LLamaInstructExecutor.cs44-45 LLama/LLamaInstructExecutor.cs170-176

StatelessExecutor

Independent inference executor that creates a fresh LLamaContext for each InferAsync call. No state is preserved between calls. LLama/LLamaStatelessExecutor.cs19-21

It supports the ApplyTemplate property, which uses LLamaTemplate to apply model-specific chat templates (like ChatML or Llama-3) to the prompt before inference. LLama/LLamaStatelessExecutor.cs44-49 LLama/LLamaStatelessExecutor.cs97-104

Sources: LLama/LLamaStatelessExecutor.cs72-81 LLama/LLamaStatelessExecutor.cs97-104

BatchedExecutor

The BatchedExecutor manages multiple concurrent Conversation threads within a single context. It uses an Epoch system to synchronize inference across multiple sequences. LLama/Batched/BatchedExecutor.cs14-16

Key Components

BatchedExecutor: The central manager that executes Infer() to process all pending tokens across conversations. LLama/Batched/BatchedExecutor.cs194-210
Conversation: A single thread of dialogue. Supports Fork() to create branches that share KV cache history efficiently using NativeApi.llama_kv_cache_seq_cp. LLama/Batched/Conversation.cs14-15 LLama/Batched/Conversation.cs160-187

Batched Inference Lifecycle

Sources: LLama/Batched/BatchedExecutor.cs147-153 LLama/Batched/BatchedExecutor.cs194-210 LLama/Batched/Conversation.cs160-187

Chat Sessions

The ChatSession class provides a high-level API for managing conversations, including history management and text transformations. LLama/ChatSession.cs21-22

ChatSession Components

Component	Interface	Role
Executor	`ILLamaExecutor`	Handles the underlying model inference. LLama/ChatSession.cs51
History	`ChatHistory`	Stores the list of messages in the conversation. LLama/ChatSession.cs56
HistoryTransform	`IHistoryTransform`	Converts `ChatHistory` to/from prompt strings. LLama/ChatSession.cs61
InputTransform	`List<ITextTransform>`	Processes user text before it is sent to the executor. LLama/ChatSession.cs66
OutputTransform	`ITextStreamTransform`	Processes generated token streams. LLama/ChatSession.cs71

Sources: LLama/ChatSession.cs49-71

Session Persistence

ChatSession can serialize its entire state, including the KV cache and history, into a directory. LLama/ChatSession.cs163-166

ModelState.st: Serialized model state (KV cache) via LLamaContext.GetState(). LLama/ChatSession.cs26
ExecutorState.json: Serialized executor metadata via GetStateData(). LLama/ChatSession.cs30
ChatHistory.json: Serialized conversation history. LLama/ChatSession.cs34

Sources: LLama/ChatSession.cs23-46 LLama/ChatSession.cs163-183

Page Sources:

Refresh this wiki

URL: https://deepwiki.com/SciSharp/LLamaSharp/9.2-executor-api

⇱ Executor API | SciSharp/LLamaSharp | DeepWiki