![]() |
VOOZH | about |
Inference Parameters control the behavior of text generation during inference operations. These parameters are passed to executor methods (such as InferAsync) and determine token generation limits, stopping conditions, sampling strategies, and context management strategies when the window is exhausted. This page documents the IInferenceParams interface and InferenceParams class, as well as web-specific options.
For model loading configuration, see Model Parameters (IModelParams). For context initialization settings, see Context Parameters (IContextParams). For detailed sampling configuration within the SamplingPipeline property, see Sampling Pipeline Overview and DefaultSamplingPipeline.
The InferenceParams class implements IInferenceParams and provides runtime control over the generation process. Unlike model and context parameters which are set during initialization, inference parameters can vary between calls to InferAsync on the same executor instance.
"Inference Parameter Data Flow"
Sources: LLama/Abstractions/IInferenceParams.cs10-54 LLama/Common/InferenceParams.cs12-51 LLama.Web/Common/InferenceOptions.cs10-42
Controls the maximum number of tokens to generate during inference.
| Property | Type | Default | Description |
|---|---|---|---|
MaxTokens | int | -1 | Maximum tokens to generate. -1 means unlimited generation until completion. |
Behavior:
n_predict in underlying implementations LLama/Common/InferenceParams.cs21-24-1 to infinitely generate response until it complete LLama/Abstractions/IInferenceParams.cs18-21Sources: LLama/Common/InferenceParams.cs24 LLama/Abstractions/IInferenceParams.cs21
A collection of sequences that trigger the model to stop generating further tokens.
| Property | Type | Default | Description |
|---|---|---|---|
AntiPrompts | IReadOnlyList<string> | [] | Sequences where the model will stop generating further tokens. |
Behavior:
AntipromptProcessor tracks past tokens and checks if the current output buffer ends with any of these strings.StatelessExecutor, providing an anti-prompt allows the model to decide where to stop the generation during a multi-turn simulation.Sources: LLama/Common/InferenceParams.cs29 LLama/Abstractions/IInferenceParams.cs26
Specifies the number of tokens to preserve from the initial prompt when context shifting or eviction occurs.
| Property | Type | Default | Description |
|---|---|---|---|
TokensKeep | int | 0 | Number of tokens to keep from initial prompt during context shifting. |
Behavior:
Sources: LLama/Common/InferenceParams.cs18 LLama/Abstractions/IInferenceParams.cs15
These properties define how the executor handles situations where the KV cache is full, especially for models that do not support native memory shifting (e.g., 2D RoPE models).
| Property | Type | Default | Description |
|---|---|---|---|
OverflowStrategy | ContextOverflowStrategy | ThrowException | Strategy to use when the context window is full LLama/Common/InferenceParams.cs43 |
ContextTruncationPercentage | float | 0.1f | Percentage of past tokens to discard during truncation LLama/Common/InferenceParams.cs50 |
Strategies (ContextOverflowStrategy):
ThrowException: Triggers a ContextOverflowException. Useful for manual management in the application layer LLama/Common/ContextOverflowStrategy.cs18TruncateAndReprefill: Silently drops the oldest tokens (preserving the system prompt via TokensKeep) and re-prefills the context LLama/Common/ContextOverflowStrategy.cs24Sources: LLama/Common/InferenceParams.cs37-50 LLama/Common/ContextOverflowStrategy.cs7-25 LLama/Exceptions/ContextOverflowException.cs10-12
The sampling strategy used to select the next token from the model's logits.
| Property | Type | Default | Description |
|---|---|---|---|
SamplingPipeline | ISamplingPipeline | DefaultSamplingPipeline | The pipeline used for token selection and logic transformation. |
Behavior:
DefaultSamplingPipeline instance LLama/Common/InferenceParams.cs32Sources: LLama/Common/InferenceParams.cs32 LLama/Abstractions/IInferenceParams.cs31
Controls the behavior of token-to-text decoders regarding special tokens (like BOS, EOS, or EOT).
| Property | Type | Default | Description |
|---|---|---|---|
DecodeSpecialTokens | bool | false | If true, special characters are converted to text; otherwise, they are invisible. |
Behavior:
StreamingTokenDecoder LLama/Abstractions/IInferenceParams.cs34-39true, special tokens that are usually filtered out (like control tokens) will be rendered into the output string stream.Sources: LLama/Common/InferenceParams.cs35 LLama/Abstractions/IInferenceParams.cs39
The following diagram illustrates how InferenceParams are consumed by the StreamingTokenDecoder and AntipromptProcessor during the inference loop within an executor.
"Inference Parameter Lifecycle"
Sources: LLama/Abstractions/IInferenceParams.cs10-54 LLama/Common/InferenceParams.cs12-51 LLama/Common/ContextOverflowStrategy.cs11-25
In web applications (specifically within the LLama.Web project), the InferenceOptions class is used to map configuration to the IInferenceParams interface for use in ASP.NET Core services. This allows parameters to be bound from appsettings.json or received via API requests.
| Class | Purpose |
|---|---|
InferenceOptions | Implements IInferenceParams for use in ASP.NET Core dependency injection and configuration binding LLama.Web/Common/InferenceOptions.cs10-12 |
The LLama.Web project also utilizes an AsyncLock to manage concurrent access to models and contexts, ensuring that inference parameters are applied safely in a multi-user environment.
Sources: LLama.Web/Common/InferenceOptions.cs10-42
While configured via the SamplingPipeline, the MirostatType enum is defined alongside inference parameters to categorize the version of the Mirostat algorithm used.
| Type | Value | Description |
|---|---|---|
Disable | 0 | Disable Mirostat sampling LLama/Common/InferenceParams.cs62 |
Mirostat | 1 | Original Mirostat algorithm LLama/Common/InferenceParams.cs67 |
Mirostat2 | 2 | Mirostat 2.0 algorithm LLama/Common/InferenceParams.cs72 |
Sources: LLama/Common/InferenceParams.cs57-73