Last indexed: 18 May 2026 (ecd184)

Token Streaming and Decoding

This page documents the token processing and streaming components in LLamaSharp. These components sit between raw model output (logits) and the final text delivered to callers.

Scope:

Component	Key Types	Role
Token-to-text decoding	`StreamingTokenDecoder`	Converts `LLamaToken` IDs incrementally to UTF-8 text, preserving state across multi-byte character boundaries LLama/StreamingTokenDecoder.cs14-15
Output post-processing	`AntipromptProcessor`	Logic for detecting stop sequences in the decoded string stream LLama/AntipromptProcessor.cs9-10
Inference Parameters	`InferenceParams`	Configures max tokens, antiprompts, and decoding behavior LLama/Common/InferenceParams.cs12-51
Context Management	`ContextOverflowStrategy`	Defines behavior when the context window is full during streaming LLama/Common/ContextOverflowStrategy.cs11-25

Overview: Detokenization Architecture

Detokenization converts LLamaToken values into UTF-8 text. The primary challenge is that token boundaries do not align with character boundaries—a multi-byte UTF-8 character may be split across multiple tokens. The StreamingTokenDecoder class solves this by maintaining decode state across incremental token additions using a stateful System.Text.Decoder.

Title: Detokenization Flow from Tokens to Text

Sources: LLama/StreamingTokenDecoder.cs14-29 LLama/StreamingTokenDecoder.cs77-104 LLama/StreamingTokenDecoder.cs116-135

StreamingTokenDecoder Class

The StreamingTokenDecoder class is the primary API for converting tokens to text in LLamaSharp. It maintains internal state to correctly handle multi-byte UTF-8 characters split across token boundaries.

Core API Methods

Method	Purpose
`StreamingTokenDecoder(LLamaContext)`	Constructor retrieving encoding and model weights from context LLama/StreamingTokenDecoder.cs46-49
`Add(LLamaToken token)`	Add a single token to decode. Handles byte conversion and character buffering LLama/StreamingTokenDecoder.cs77-135
`AddRange(IEnumerable<LLamaToken> tokens)`	Add multiple tokens in batch LLama/StreamingTokenDecoder.cs150-155
`Read()`	Return all decoded text accumulated so far as a string and clear buffer LLama/StreamingTokenDecoder.cs181-194
`Reset()`	Clear internal buffers and reset the stateful decoder LLama/StreamingTokenDecoder.cs199-203

Sources: LLama/StreamingTokenDecoder.cs31-71 LLama/StreamingTokenDecoder.cs141-204

Internal State Management

The decoder maintains two critical pieces of state:

Character Buffer (List<char> _characters): Accumulated characters that have been successfully decoded but not yet read LLama/StreamingTokenDecoder.cs19
Stateful Decoder (Decoder _decoder): A System.Text.Decoder instance obtained via encoding.GetDecoder() LLama/StreamingTokenDecoder.cs69 It preserves partial byte sequences (e.g., the first 2 bytes of a 4-byte emoji) between calls to Add().

Sources: LLama/StreamingTokenDecoder.cs14-20 LLama/StreamingTokenDecoder.cs67-70

Token-to-Byte Conversion

Before UTF-8 decoding can occur, tokens must be converted to their raw byte representation. This is handled by the private TokenToBytes helper within StreamingTokenDecoder, which interacts with the native model handle via TokenToSpan LLama/StreamingTokenDecoder.cs116-134

Native Piece Extraction

The decoder uses SafeLlamaModelHandle.TokenToSpan to retrieve the raw bytes for a specific token ID. If the temporary buffer is too small, it expands the buffer using ArrayPool<byte>.Shared.Rent LLama/StreamingTokenDecoder.cs122-126

Title: Token-to-Byte Resolution

Sources: LLama/StreamingTokenDecoder.cs116-134

Real-Time Token Output (Inference)

In the context of an executor, streaming is achieved using IAsyncEnumerable<string>. This pattern allows consumers to process text as it is generated rather than waiting for the entire response.

Stop Sequence Detection (AntipromptProcessor)

The AntipromptProcessor monitors the decoded stream for "stop sequences" (antiprompts) LLama/AntipromptProcessor.cs9-10

Feature	Description
`Add(string text)`	Adds newly decoded text to an internal buffer and checks for matches LLama/AntipromptProcessor.cs58-75
Buffer Trimming	Automatically trims the internal buffer to prevent memory growth while keeping enough context for the longest antiprompt LLama/AntipromptProcessor.cs65-68
Case Sensitivity	Matches are performed using `StringComparison.CurrentCulture` LLama/AntipromptProcessor.cs71-72

Sources: LLama/AntipromptProcessor.cs9-76

Context Overflow Management

During streaming, if the KV cache fills up, the executor's behavior is governed by ContextOverflowStrategy LLama/Common/ContextOverflowStrategy.cs11-12

ThrowException: The engine throws a ContextOverflowException LLama/Exceptions/ContextOverflowException.cs10-12 This is equivalent to llama-cli's --no-context-shift LLama/Common/ContextOverflowStrategy.cs18
TruncateAndReprefill: The engine silently drops a percentage (defined by ContextTruncationPercentage) of the oldest tokens and re-prefills the context LLama/Common/ContextOverflowStrategy.cs24 LLama/Common/InferenceParams.cs49-50

Sources: LLama/Common/ContextOverflowStrategy.cs11-25 LLama/Common/InferenceParams.cs38-51 LLama/Exceptions/ContextOverflowException.cs10-38

Performance and Best Practices

Memory Efficiency

StreamingTokenDecoder uses ArrayPool<byte>.Shared.Rent and ArrayPool<char>.Shared.Rent to minimize allocations during the conversion of tokens to bytes and then to characters LLama/StreamingTokenDecoder.cs79-80 Buffers are returned to the pool in a finally block LLama/StreamingTokenDecoder.cs106-110

Handling Special Tokens

The DecodeSpecialTokens property in InferenceParams and StreamingTokenDecoder determines whether control tokens (like BOS/EOS) are rendered as text or treated as invisible LLama/Common/InferenceParams.cs35 LLama/StreamingTokenDecoder.cs29

Title: Decoding Configuration to Native Flow

Sources: LLama/StreamingTokenDecoder.cs79-110 LLama/StreamingTokenDecoder.cs116-119 LLama/Common/InferenceParams.cs35

Refresh this wiki

URL: https://deepwiki.com/SciSharp/LLamaSharp/4.5-token-streaming-and-decoding