Last indexed: 18 May 2026 (ecd184)

Inference Parameters

Purpose and Scope

Inference Parameters control the behavior of text generation during inference operations. These parameters are passed to executor methods (such as InferAsync) and determine token generation limits, stopping conditions, sampling strategies, and context management strategies when the window is exhausted. This page documents the IInferenceParams interface and InferenceParams class, as well as web-specific options.

For model loading configuration, see Model Parameters (IModelParams). For context initialization settings, see Context Parameters (IContextParams). For detailed sampling configuration within the SamplingPipeline property, see Sampling Pipeline Overview and DefaultSamplingPipeline.

Overview

The InferenceParams class implements IInferenceParams and provides runtime control over the generation process. Unlike model and context parameters which are set during initialization, inference parameters can vary between calls to InferAsync on the same executor instance.

"Inference Parameter Data Flow"

Sources: LLama/Abstractions/IInferenceParams.cs10-54 LLama/Common/InferenceParams.cs12-51 LLama.Web/Common/InferenceOptions.cs10-42

Configuration Properties

MaxTokens

Controls the maximum number of tokens to generate during inference.

Property	Type	Default	Description
`MaxTokens`	`int`	`-1`	Maximum tokens to generate. `-1` means unlimited generation until completion.

Behavior:

Also referred to as n_predict in underlying implementations LLama/Common/InferenceParams.cs21-24
Set to -1 to infinitely generate response until it complete LLama/Abstractions/IInferenceParams.cs18-21
Used by executors to determine the loop bound for token generation.

Sources: LLama/Common/InferenceParams.cs24 LLama/Abstractions/IInferenceParams.cs21

AntiPrompts

A collection of sequences that trigger the model to stop generating further tokens.

Property	Type	Default	Description
`AntiPrompts`	`IReadOnlyList<string>`	`[]`	Sequences where the model will stop generating further tokens.

Behavior:

The AntipromptProcessor tracks past tokens and checks if the current output buffer ends with any of these strings.
In StatelessExecutor, providing an anti-prompt allows the model to decide where to stop the generation during a multi-turn simulation.

Sources: LLama/Common/InferenceParams.cs29 LLama/Abstractions/IInferenceParams.cs26

TokensKeep

Specifies the number of tokens to preserve from the initial prompt when context shifting or eviction occurs.

Property	Type	Default	Description
`TokensKeep`	`int`	`0`	Number of tokens to keep from initial prompt during context shifting.

Behavior:

Ensures that vital system prompts or instructions at the beginning of the context are not lost when the model reaches its context limit LLama/Common/InferenceParams.cs15-18

Sources: LLama/Common/InferenceParams.cs18 LLama/Abstractions/IInferenceParams.cs15

Context Overflow Management

These properties define how the executor handles situations where the KV cache is full, especially for models that do not support native memory shifting (e.g., 2D RoPE models).

Property	Type	Default	Description
`OverflowStrategy`	`ContextOverflowStrategy`	`ThrowException`	Strategy to use when the context window is full LLama/Common/InferenceParams.cs43
`ContextTruncationPercentage`	`float`	`0.1f`	Percentage of past tokens to discard during truncation LLama/Common/InferenceParams.cs50

Strategies (ContextOverflowStrategy):

ThrowException: Triggers a ContextOverflowException. Useful for manual management in the application layer LLama/Common/ContextOverflowStrategy.cs18
TruncateAndReprefill: Silently drops the oldest tokens (preserving the system prompt via TokensKeep) and re-prefills the context LLama/Common/ContextOverflowStrategy.cs24

Sources: LLama/Common/InferenceParams.cs37-50 LLama/Common/ContextOverflowStrategy.cs7-25 LLama/Exceptions/ContextOverflowException.cs10-12

SamplingPipeline

The sampling strategy used to select the next token from the model's logits.

Property	Type	Default	Description
`SamplingPipeline`	`ISamplingPipeline`	`DefaultSamplingPipeline`	The pipeline used for token selection and logic transformation.

Behavior:

Defaults to a DefaultSamplingPipeline instance LLama/Common/InferenceParams.cs32
During the generation loop, the executor uses the pipeline to obtain the next token based on current context and logits.

Sources: LLama/Common/InferenceParams.cs32 LLama/Abstractions/IInferenceParams.cs31

DecodeSpecialTokens

Controls the behavior of token-to-text decoders regarding special tokens (like BOS, EOS, or EOT).

Property	Type	Default	Description
`DecodeSpecialTokens`	`bool`	`false`	If true, special characters are converted to text; otherwise, they are invisible.

Behavior:

Directly influences the behavior of decoders like StreamingTokenDecoder LLama/Abstractions/IInferenceParams.cs34-39
When true, special tokens that are usually filtered out (like control tokens) will be rendered into the output string stream.

Sources: LLama/Common/InferenceParams.cs35 LLama/Abstractions/IInferenceParams.cs39

Implementation and Data Flow

The following diagram illustrates how InferenceParams are consumed by the StreamingTokenDecoder and AntipromptProcessor during the inference loop within an executor.

"Inference Parameter Lifecycle"

Sources: LLama/Abstractions/IInferenceParams.cs10-54 LLama/Common/InferenceParams.cs12-51 LLama/Common/ContextOverflowStrategy.cs11-25

Web-Specific Inference Options

In web applications (specifically within the LLama.Web project), the InferenceOptions class is used to map configuration to the IInferenceParams interface for use in ASP.NET Core services. This allows parameters to be bound from appsettings.json or received via API requests.

Class	Purpose
`InferenceOptions`	Implements `IInferenceParams` for use in ASP.NET Core dependency injection and configuration binding LLama.Web/Common/InferenceOptions.cs10-12

The LLama.Web project also utilizes an AsyncLock to manage concurrent access to models and contexts, ensuring that inference parameters are applied safely in a multi-user environment.

Sources: LLama.Web/Common/InferenceOptions.cs10-42

Summary of Mirostat Types

While configured via the SamplingPipeline, the MirostatType enum is defined alongside inference parameters to categorize the version of the Mirostat algorithm used.

Type	Value	Description
`Disable`	`0`	Disable Mirostat sampling LLama/Common/InferenceParams.cs62
`Mirostat`	`1`	Original Mirostat algorithm LLama/Common/InferenceParams.cs67
`Mirostat2`	`2`	Mirostat 2.0 algorithm LLama/Common/InferenceParams.cs72

Sources: LLama/Common/InferenceParams.cs57-73

Refresh this wiki

URL: https://deepwiki.com/SciSharp/LLamaSharp/6.3-inference-parameters