VOOZH about

URL: https://deepwiki.com/SciSharp/LLamaSharp/4.5-token-streaming-and-decoding

⇱ Token Streaming and Decoding | SciSharp/LLamaSharp | DeepWiki


Loading...
Last indexed: 18 May 2026 (ecd184)
Menu

Token Streaming and Decoding

This page documents the token processing and streaming components in LLamaSharp. These components sit between raw model output (logits) and the final text delivered to callers.

Scope:

ComponentKey TypesRole
Token-to-text decodingStreamingTokenDecoderConverts LLamaToken IDs incrementally to UTF-8 text, preserving state across multi-byte character boundaries LLama/StreamingTokenDecoder.cs14-15
Output post-processingAntipromptProcessorLogic for detecting stop sequences in the decoded string stream LLama/AntipromptProcessor.cs9-10
Inference ParametersInferenceParamsConfigures max tokens, antiprompts, and decoding behavior LLama/Common/InferenceParams.cs12-51
Context ManagementContextOverflowStrategyDefines behavior when the context window is full during streaming LLama/Common/ContextOverflowStrategy.cs11-25

Overview: Detokenization Architecture

Detokenization converts LLamaToken values into UTF-8 text. The primary challenge is that token boundaries do not align with character boundaries—a multi-byte UTF-8 character may be split across multiple tokens. The StreamingTokenDecoder class solves this by maintaining decode state across incremental token additions using a stateful System.Text.Decoder.

Title: Detokenization Flow from Tokens to Text


Sources: LLama/StreamingTokenDecoder.cs14-29 LLama/StreamingTokenDecoder.cs77-104 LLama/StreamingTokenDecoder.cs116-135

StreamingTokenDecoder Class

The StreamingTokenDecoder class is the primary API for converting tokens to text in LLamaSharp. It maintains internal state to correctly handle multi-byte UTF-8 characters split across token boundaries.

Core API Methods

MethodPurpose
StreamingTokenDecoder(LLamaContext)Constructor retrieving encoding and model weights from context LLama/StreamingTokenDecoder.cs46-49
Add(LLamaToken token)Add a single token to decode. Handles byte conversion and character buffering LLama/StreamingTokenDecoder.cs77-135
AddRange(IEnumerable<LLamaToken> tokens)Add multiple tokens in batch LLama/StreamingTokenDecoder.cs150-155
Read()Return all decoded text accumulated so far as a string and clear buffer LLama/StreamingTokenDecoder.cs181-194
Reset()Clear internal buffers and reset the stateful decoder LLama/StreamingTokenDecoder.cs199-203

Sources: LLama/StreamingTokenDecoder.cs31-71 LLama/StreamingTokenDecoder.cs141-204

Internal State Management

The decoder maintains two critical pieces of state:

  1. Character Buffer (List<char> _characters): Accumulated characters that have been successfully decoded but not yet read LLama/StreamingTokenDecoder.cs19
  2. Stateful Decoder (Decoder _decoder): A System.Text.Decoder instance obtained via encoding.GetDecoder() LLama/StreamingTokenDecoder.cs69 It preserves partial byte sequences (e.g., the first 2 bytes of a 4-byte emoji) between calls to Add().

Sources: LLama/StreamingTokenDecoder.cs14-20 LLama/StreamingTokenDecoder.cs67-70

Token-to-Byte Conversion

Before UTF-8 decoding can occur, tokens must be converted to their raw byte representation. This is handled by the private TokenToBytes helper within StreamingTokenDecoder, which interacts with the native model handle via TokenToSpan LLama/StreamingTokenDecoder.cs116-134

Native Piece Extraction

The decoder uses SafeLlamaModelHandle.TokenToSpan to retrieve the raw bytes for a specific token ID. If the temporary buffer is too small, it expands the buffer using ArrayPool<byte>.Shared.Rent LLama/StreamingTokenDecoder.cs122-126

Title: Token-to-Byte Resolution


Sources: LLama/StreamingTokenDecoder.cs116-134

Real-Time Token Output (Inference)

In the context of an executor, streaming is achieved using IAsyncEnumerable<string>. This pattern allows consumers to process text as it is generated rather than waiting for the entire response.

Stop Sequence Detection (AntipromptProcessor)

The AntipromptProcessor monitors the decoded stream for "stop sequences" (antiprompts) LLama/AntipromptProcessor.cs9-10

FeatureDescription
Add(string text)Adds newly decoded text to an internal buffer and checks for matches LLama/AntipromptProcessor.cs58-75
Buffer TrimmingAutomatically trims the internal buffer to prevent memory growth while keeping enough context for the longest antiprompt LLama/AntipromptProcessor.cs65-68
Case SensitivityMatches are performed using StringComparison.CurrentCulture LLama/AntipromptProcessor.cs71-72

Sources: LLama/AntipromptProcessor.cs9-76

Context Overflow Management

During streaming, if the KV cache fills up, the executor's behavior is governed by ContextOverflowStrategy LLama/Common/ContextOverflowStrategy.cs11-12

Sources: LLama/Common/ContextOverflowStrategy.cs11-25 LLama/Common/InferenceParams.cs38-51 LLama/Exceptions/ContextOverflowException.cs10-38

Performance and Best Practices

Memory Efficiency

StreamingTokenDecoder uses ArrayPool<byte>.Shared.Rent and ArrayPool<char>.Shared.Rent to minimize allocations during the conversion of tokens to bytes and then to characters LLama/StreamingTokenDecoder.cs79-80 Buffers are returned to the pool in a finally block LLama/StreamingTokenDecoder.cs106-110

Handling Special Tokens

The DecodeSpecialTokens property in InferenceParams and StreamingTokenDecoder determines whether control tokens (like BOS/EOS) are rendered as text or treated as invisible LLama/Common/InferenceParams.cs35 LLama/StreamingTokenDecoder.cs29

Title: Decoding Configuration to Native Flow


Sources: LLama/StreamingTokenDecoder.cs79-110 LLama/StreamingTokenDecoder.cs116-119 LLama/Common/InferenceParams.cs35