VOOZH about

URL: https://deepwiki.com/SciSharp/LLamaSharp/6.3-inference-parameters

⇱ Inference Parameters | SciSharp/LLamaSharp | DeepWiki


Loading...
Last indexed: 18 May 2026 (ecd184)
Menu

Inference Parameters

Purpose and Scope

Inference Parameters control the behavior of text generation during inference operations. These parameters are passed to executor methods (such as InferAsync) and determine token generation limits, stopping conditions, sampling strategies, and context management strategies when the window is exhausted. This page documents the IInferenceParams interface and InferenceParams class, as well as web-specific options.

For model loading configuration, see Model Parameters (IModelParams). For context initialization settings, see Context Parameters (IContextParams). For detailed sampling configuration within the SamplingPipeline property, see Sampling Pipeline Overview and DefaultSamplingPipeline.

Overview

The InferenceParams class implements IInferenceParams and provides runtime control over the generation process. Unlike model and context parameters which are set during initialization, inference parameters can vary between calls to InferAsync on the same executor instance.

"Inference Parameter Data Flow"


Sources: LLama/Abstractions/IInferenceParams.cs10-54 LLama/Common/InferenceParams.cs12-51 LLama.Web/Common/InferenceOptions.cs10-42

Configuration Properties

MaxTokens

Controls the maximum number of tokens to generate during inference.

PropertyTypeDefaultDescription
MaxTokensint-1Maximum tokens to generate. -1 means unlimited generation until completion.

Behavior:

Sources: LLama/Common/InferenceParams.cs24 LLama/Abstractions/IInferenceParams.cs21

AntiPrompts

A collection of sequences that trigger the model to stop generating further tokens.

PropertyTypeDefaultDescription
AntiPromptsIReadOnlyList<string>[]Sequences where the model will stop generating further tokens.

Behavior:

  • The AntipromptProcessor tracks past tokens and checks if the current output buffer ends with any of these strings.
  • In StatelessExecutor, providing an anti-prompt allows the model to decide where to stop the generation during a multi-turn simulation.

Sources: LLama/Common/InferenceParams.cs29 LLama/Abstractions/IInferenceParams.cs26

TokensKeep

Specifies the number of tokens to preserve from the initial prompt when context shifting or eviction occurs.

PropertyTypeDefaultDescription
TokensKeepint0Number of tokens to keep from initial prompt during context shifting.

Behavior:

Sources: LLama/Common/InferenceParams.cs18 LLama/Abstractions/IInferenceParams.cs15

Context Overflow Management

These properties define how the executor handles situations where the KV cache is full, especially for models that do not support native memory shifting (e.g., 2D RoPE models).

PropertyTypeDefaultDescription
OverflowStrategyContextOverflowStrategyThrowExceptionStrategy to use when the context window is full LLama/Common/InferenceParams.cs43
ContextTruncationPercentagefloat0.1fPercentage of past tokens to discard during truncation LLama/Common/InferenceParams.cs50

Strategies (ContextOverflowStrategy):

Sources: LLama/Common/InferenceParams.cs37-50 LLama/Common/ContextOverflowStrategy.cs7-25 LLama/Exceptions/ContextOverflowException.cs10-12

SamplingPipeline

The sampling strategy used to select the next token from the model's logits.

PropertyTypeDefaultDescription
SamplingPipelineISamplingPipelineDefaultSamplingPipelineThe pipeline used for token selection and logic transformation.

Behavior:

  • Defaults to a DefaultSamplingPipeline instance LLama/Common/InferenceParams.cs32
  • During the generation loop, the executor uses the pipeline to obtain the next token based on current context and logits.

Sources: LLama/Common/InferenceParams.cs32 LLama/Abstractions/IInferenceParams.cs31

DecodeSpecialTokens

Controls the behavior of token-to-text decoders regarding special tokens (like BOS, EOS, or EOT).

PropertyTypeDefaultDescription
DecodeSpecialTokensboolfalseIf true, special characters are converted to text; otherwise, they are invisible.

Behavior:

  • Directly influences the behavior of decoders like StreamingTokenDecoder LLama/Abstractions/IInferenceParams.cs34-39
  • When true, special tokens that are usually filtered out (like control tokens) will be rendered into the output string stream.

Sources: LLama/Common/InferenceParams.cs35 LLama/Abstractions/IInferenceParams.cs39

Implementation and Data Flow

The following diagram illustrates how InferenceParams are consumed by the StreamingTokenDecoder and AntipromptProcessor during the inference loop within an executor.

"Inference Parameter Lifecycle"


Sources: LLama/Abstractions/IInferenceParams.cs10-54 LLama/Common/InferenceParams.cs12-51 LLama/Common/ContextOverflowStrategy.cs11-25

Web-Specific Inference Options

In web applications (specifically within the LLama.Web project), the InferenceOptions class is used to map configuration to the IInferenceParams interface for use in ASP.NET Core services. This allows parameters to be bound from appsettings.json or received via API requests.

ClassPurpose
InferenceOptionsImplements IInferenceParams for use in ASP.NET Core dependency injection and configuration binding LLama.Web/Common/InferenceOptions.cs10-12

The LLama.Web project also utilizes an AsyncLock to manage concurrent access to models and contexts, ensuring that inference parameters are applied safely in a multi-user environment.

Sources: LLama.Web/Common/InferenceOptions.cs10-42

Summary of Mirostat Types

While configured via the SamplingPipeline, the MirostatType enum is defined alongside inference parameters to categorize the version of the Mirostat algorithm used.

TypeValueDescription
Disable0Disable Mirostat sampling LLama/Common/InferenceParams.cs62
Mirostat1Original Mirostat algorithm LLama/Common/InferenceParams.cs67
Mirostat22Mirostat 2.0 algorithm LLama/Common/InferenceParams.cs72

Sources: LLama/Common/InferenceParams.cs57-73