Last indexed: 18 May 2026 (ecd184)

Sampling and Token Selection

This document provides an overview of how LLamaSharp selects tokens during text generation. It covers the sampling pipeline architecture, the token selection process, and the data structures used to transform raw model outputs (logits) into selected tokens.

For detailed information about specific sampling implementations and parameters, see Sampling Pipeline Overview and DefaultSamplingPipeline. For creating custom sampling logic, see Custom Samplers. For constraining output with grammars, see Grammar-Constrained Generation. For streaming token output, see Token Streaming and Decoding.

Overview of Token Sampling

Token sampling is the process of selecting the next token from a probability distribution produced by a language model. During inference, the model generates a vector of logits (raw unnormalized probabilities) for each position in the vocabulary. The sampling pipeline transforms these logits into a single selected token using various strategies that balance randomness and determinism.

The quality of text generation depends heavily on the sampling strategy. Different approaches produce different characteristics:

Greedy sampling: Always selects the highest-probability token (deterministic, repetitive).
Temperature sampling: Scales probabilities to control randomness.
Top-K/Top-P sampling: Limits selection to a subset of likely tokens.
Nucleus sampling: Dynamically adjusts the candidate pool based on cumulative probability.

LLamaSharp provides a flexible pipeline architecture that allows combining multiple sampling techniques and adding custom logic.

Sources: LLama/Sampling/ISamplingPipeline.cs9-37 LLama/Sampling/DefaultSamplingPipeline.cs8-120

Sampling Pipeline Architecture

Diagram: Sampling System Architecture

The sampling architecture consists of three layers:

Interface Layer: ISamplingPipeline defines the contract for all sampling implementations LLama/Sampling/ISamplingPipeline.cs9-11
Pipeline Layer: Concrete implementations like DefaultSamplingPipeline and GreedySamplingPipeline configure sampling behavior LLama/Sampling/DefaultSamplingPipeline.cs11-13 LLama/Sampling/GreedySamplingPipeline.cs8-10
Native Layer: SafeLLamaSamplerChainHandle wraps the native llama_sampler_chain LLama/Native/SafeLLamaSamplerHandle.cs12-15

The BaseSamplingPipeline base class manages the lifecycle of the native sampler chain, caching it after creation and disposing it properly LLama/Sampling/BaseSamplingPipeline.cs10-32 Implementations override CreateChain() to configure the chain with specific samplers LLama/Sampling/DefaultSamplingPipeline.cs171-206

Sources: LLama/Sampling/ISamplingPipeline.cs6-37 LLama/Sampling/BaseSamplingPipeline.cs7-72 LLama/Sampling/DefaultSamplingPipeline.cs11-207 LLama/Native/SafeLLamaSamplerHandle.cs13-159 LLama/Sampling/GreedySamplingPipeline.cs8-27

ISamplingPipeline Interface

The ISamplingPipeline interface defines the core operations for token sampling:

Method	Purpose
`Sample(SafeLLamaContextHandle, int)`	Select a single token from logits at a given position LLama/Sampling/ISamplingPipeline.cs18
`Apply(SafeLLamaContextHandle, LLamaTokenDataArray)`	Apply the sampling pipeline to a token data array without selecting LLama/Sampling/ISamplingPipeline.cs25
`Reset()`	Reset all internal state (e.g., grammar state) LLama/Sampling/ISamplingPipeline.cs30
`Accept(LLamaToken)`	Update the pipeline after a token has been accepted LLama/Sampling/ISamplingPipeline.cs36
`Dispose()`	Clean up native resources LLama/Sampling/ISamplingPipeline.cs10

The Sample method is the primary entry point, extracting logits from the context and applying the sampling chain to return the selected token LLama/Sampling/BaseSamplingPipeline.cs35-40 The Accept method updates stateful components like grammar parsers after a token is generated LLama/Native/SafeLLamaSamplerHandle.cs106-112

Sources: LLama/Sampling/ISamplingPipeline.cs9-37 LLama/Sampling/BaseSamplingPipeline.cs35-71

Token Data Structures

Diagram: Logit to Token Transformation

The token selection process uses two primary data structures:

LLamaTokenDataArray

LLamaTokenDataArray is a managed wrapper around token data LLama/Native/LLamaTokenDataArray.cs12-13 It stores an array of LLamaTokenData structs, each containing an ID, Logit, and Probability. The Softmax() method sorts the array and computes probabilities using TensorPrimitives.SoftMax LLama/Native/LLamaTokenDataArray.cs95-118

LLamaTokenDataArrayNative

LLamaTokenDataArrayNative is a native structure that matches the llama_token_data_array layout LLama/Native/LLamaTokenDataArray.cs136-137 It contains a pointer to pinned token data (_data), the size of the array (_size), and the index of the selected token (_selected) LLama/Native/LLamaTokenDataArray.cs142-155

Sources: LLama/Native/LLamaTokenDataArray.cs12-204

Common Sampling Parameters

The DefaultSamplingPipeline exposes parameters that control token selection behavior:

Penalty Parameters

Parameter	Type	Default	Purpose
`RepeatPenalty`	float	1.0	Reduces probability of tokens that already appear LLama/Sampling/DefaultSamplingPipeline.cs22
`FrequencyPenalty`	float	0.0	Penalizes tokens based on their existing frequency LLama/Sampling/DefaultSamplingPipeline.cs29-41
`PresencePenalty`	float	0.0	Penalizes tokens based on whether they appear at all LLama/Sampling/DefaultSamplingPipeline.cs48-60
`PenaltyCount`	int	64	Number of tokens to consider for penalties LLama/Sampling/DefaultSamplingPipeline.cs65

Sampling Strategy Parameters

Parameter	Type	Default	Purpose
`Temperature`	float	0.75	Controls randomness (higher = more creative) LLama/Sampling/DefaultSamplingPipeline.cs80
`TopK`	int	40	Limits selection to K most likely tokens LLama/Sampling/DefaultSamplingPipeline.cs85
`TopP`	float	0.9	Nucleus sampling: cumulative probability threshold LLama/Sampling/DefaultSamplingPipeline.cs95
`MinP`	float	0.1	Minimum probability threshold relative to max token LLama/Sampling/DefaultSamplingPipeline.cs100

Sources: LLama/Sampling/DefaultSamplingPipeline.cs14-120

Sampler Chain Construction

The DefaultSamplingPipeline.CreateChain() method builds the sampler chain in a specific order to ensure proper filtering and transformation of logits:

Logit Bias: Applied first to manually adjust specific token probabilities LLama/Sampling/DefaultSamplingPipeline.cs175-193
Penalties: Repetition, frequency, and presence penalties LLama/Sampling/DefaultSamplingPipeline.cs195
Filters: TopK, TypicalP, TopP, and MinP filters are applied in sequence LLama/Sampling/DefaultSamplingPipeline.cs197-200
Temperature: Scales logits to control randomness LLama/Sampling/DefaultSamplingPipeline.cs201
Distribution Sampler: Final selection from remaining tokens using the random seed LLama/Sampling/DefaultSamplingPipeline.cs203

Sources: LLama/Sampling/DefaultSamplingPipeline.cs171-206

Grammar-Constrained Generation

Grammar-constrained generation uses GBNF (GGML BNF) to restrict the tokens that can be sampled at any given step. This is useful for ensuring the model outputs structured data like JSON or follows specific formatting rules.

Grammar Application: Grammars are integrated into the pipeline via SafeLLamaSamplerChainHandle.AddGrammar LLama/Sampling/DefaultSamplingPipeline.cs166
Optimization: GrammarOptimizationMode allows for performance tuning when applying constraints LLama/Sampling/DefaultSamplingPipeline.cs120
State Management: The pipeline must Accept tokens to update the internal state of the grammar parser LLama/Sampling/DefaultSamplingPipeline.cs153-158

For details, see Grammar-Constrained Generation.

Sources: LLama/Sampling/DefaultSamplingPipeline.cs103-125 LLama/Sampling/DefaultSamplingPipeline.cs160-168 LLama.Examples/Examples/BatchedExecutorBoolQ.cs13 LLama/Sampling/GreedySamplingPipeline.cs14-22

Refresh this wiki

URL: https://deepwiki.com/SciSharp/LLamaSharp/4-sampling-and-token-selection