VOOZH about

URL: https://deepwiki.com/SciSharp/LLamaSharp/4-sampling-and-token-selection

⇱ Sampling and Token Selection | SciSharp/LLamaSharp | DeepWiki


Loading...
Last indexed: 18 May 2026 (ecd184)
Menu

Sampling and Token Selection

This document provides an overview of how LLamaSharp selects tokens during text generation. It covers the sampling pipeline architecture, the token selection process, and the data structures used to transform raw model outputs (logits) into selected tokens.

For detailed information about specific sampling implementations and parameters, see Sampling Pipeline Overview and DefaultSamplingPipeline. For creating custom sampling logic, see Custom Samplers. For constraining output with grammars, see Grammar-Constrained Generation. For streaming token output, see Token Streaming and Decoding.

Overview of Token Sampling

Token sampling is the process of selecting the next token from a probability distribution produced by a language model. During inference, the model generates a vector of logits (raw unnormalized probabilities) for each position in the vocabulary. The sampling pipeline transforms these logits into a single selected token using various strategies that balance randomness and determinism.

The quality of text generation depends heavily on the sampling strategy. Different approaches produce different characteristics:

  • Greedy sampling: Always selects the highest-probability token (deterministic, repetitive).
  • Temperature sampling: Scales probabilities to control randomness.
  • Top-K/Top-P sampling: Limits selection to a subset of likely tokens.
  • Nucleus sampling: Dynamically adjusts the candidate pool based on cumulative probability.

LLamaSharp provides a flexible pipeline architecture that allows combining multiple sampling techniques and adding custom logic.

Sources: LLama/Sampling/ISamplingPipeline.cs9-37 LLama/Sampling/DefaultSamplingPipeline.cs8-120

Sampling Pipeline Architecture

Diagram: Sampling System Architecture


The sampling architecture consists of three layers:

  1. Interface Layer: ISamplingPipeline defines the contract for all sampling implementations LLama/Sampling/ISamplingPipeline.cs9-11
  2. Pipeline Layer: Concrete implementations like DefaultSamplingPipeline and GreedySamplingPipeline configure sampling behavior LLama/Sampling/DefaultSamplingPipeline.cs11-13 LLama/Sampling/GreedySamplingPipeline.cs8-10
  3. Native Layer: SafeLLamaSamplerChainHandle wraps the native llama_sampler_chain LLama/Native/SafeLLamaSamplerHandle.cs12-15

The BaseSamplingPipeline base class manages the lifecycle of the native sampler chain, caching it after creation and disposing it properly LLama/Sampling/BaseSamplingPipeline.cs10-32 Implementations override CreateChain() to configure the chain with specific samplers LLama/Sampling/DefaultSamplingPipeline.cs171-206

Sources: LLama/Sampling/ISamplingPipeline.cs6-37 LLama/Sampling/BaseSamplingPipeline.cs7-72 LLama/Sampling/DefaultSamplingPipeline.cs11-207 LLama/Native/SafeLLamaSamplerHandle.cs13-159 LLama/Sampling/GreedySamplingPipeline.cs8-27

ISamplingPipeline Interface

The ISamplingPipeline interface defines the core operations for token sampling:

MethodPurpose
Sample(SafeLLamaContextHandle, int)Select a single token from logits at a given position LLama/Sampling/ISamplingPipeline.cs18
Apply(SafeLLamaContextHandle, LLamaTokenDataArray)Apply the sampling pipeline to a token data array without selecting LLama/Sampling/ISamplingPipeline.cs25
Reset()Reset all internal state (e.g., grammar state) LLama/Sampling/ISamplingPipeline.cs30
Accept(LLamaToken)Update the pipeline after a token has been accepted LLama/Sampling/ISamplingPipeline.cs36
Dispose()Clean up native resources LLama/Sampling/ISamplingPipeline.cs10

The Sample method is the primary entry point, extracting logits from the context and applying the sampling chain to return the selected token LLama/Sampling/BaseSamplingPipeline.cs35-40 The Accept method updates stateful components like grammar parsers after a token is generated LLama/Native/SafeLLamaSamplerHandle.cs106-112

Sources: LLama/Sampling/ISamplingPipeline.cs9-37 LLama/Sampling/BaseSamplingPipeline.cs35-71

Token Data Structures

Diagram: Logit to Token Transformation


The token selection process uses two primary data structures:

LLamaTokenDataArray

LLamaTokenDataArray is a managed wrapper around token data LLama/Native/LLamaTokenDataArray.cs12-13 It stores an array of LLamaTokenData structs, each containing an ID, Logit, and Probability. The Softmax() method sorts the array and computes probabilities using TensorPrimitives.SoftMax LLama/Native/LLamaTokenDataArray.cs95-118

LLamaTokenDataArrayNative

LLamaTokenDataArrayNative is a native structure that matches the llama_token_data_array layout LLama/Native/LLamaTokenDataArray.cs136-137 It contains a pointer to pinned token data (_data), the size of the array (_size), and the index of the selected token (_selected) LLama/Native/LLamaTokenDataArray.cs142-155

Sources: LLama/Native/LLamaTokenDataArray.cs12-204

Common Sampling Parameters

The DefaultSamplingPipeline exposes parameters that control token selection behavior:

Penalty Parameters

ParameterTypeDefaultPurpose
RepeatPenaltyfloat1.0Reduces probability of tokens that already appear LLama/Sampling/DefaultSamplingPipeline.cs22
FrequencyPenaltyfloat0.0Penalizes tokens based on their existing frequency LLama/Sampling/DefaultSamplingPipeline.cs29-41
PresencePenaltyfloat0.0Penalizes tokens based on whether they appear at all LLama/Sampling/DefaultSamplingPipeline.cs48-60
PenaltyCountint64Number of tokens to consider for penalties LLama/Sampling/DefaultSamplingPipeline.cs65

Sampling Strategy Parameters

ParameterTypeDefaultPurpose
Temperaturefloat0.75Controls randomness (higher = more creative) LLama/Sampling/DefaultSamplingPipeline.cs80
TopKint40Limits selection to K most likely tokens LLama/Sampling/DefaultSamplingPipeline.cs85
TopPfloat0.9Nucleus sampling: cumulative probability threshold LLama/Sampling/DefaultSamplingPipeline.cs95
MinPfloat0.1Minimum probability threshold relative to max token LLama/Sampling/DefaultSamplingPipeline.cs100

Sources: LLama/Sampling/DefaultSamplingPipeline.cs14-120

Sampler Chain Construction

The DefaultSamplingPipeline.CreateChain() method builds the sampler chain in a specific order to ensure proper filtering and transformation of logits:

  1. Logit Bias: Applied first to manually adjust specific token probabilities LLama/Sampling/DefaultSamplingPipeline.cs175-193
  2. Penalties: Repetition, frequency, and presence penalties LLama/Sampling/DefaultSamplingPipeline.cs195
  3. Filters: TopK, TypicalP, TopP, and MinP filters are applied in sequence LLama/Sampling/DefaultSamplingPipeline.cs197-200
  4. Temperature: Scales logits to control randomness LLama/Sampling/DefaultSamplingPipeline.cs201
  5. Distribution Sampler: Final selection from remaining tokens using the random seed LLama/Sampling/DefaultSamplingPipeline.cs203

Sources: LLama/Sampling/DefaultSamplingPipeline.cs171-206

Grammar-Constrained Generation

Grammar-constrained generation uses GBNF (GGML BNF) to restrict the tokens that can be sampled at any given step. This is useful for ensuring the model outputs structured data like JSON or follows specific formatting rules.

For details, see Grammar-Constrained Generation.

Sources: LLama/Sampling/DefaultSamplingPipeline.cs103-125 LLama/Sampling/DefaultSamplingPipeline.cs160-168 LLama.Examples/Examples/BatchedExecutorBoolQ.cs13 LLama/Sampling/GreedySamplingPipeline.cs14-22