Last indexed: 18 May 2026 (ecd184)

Kernel Memory Integration

The Kernel Memory integration package (LLamaSharp.kernel-memory) enables LLamaSharp to serve as a local backend for Microsoft Kernel Memory, a framework for building Retrieval-Augmented Generation (RAG) pipelines. This integration provides implementations of Kernel Memory's ITextEmbeddingGenerator and ITextGenerator interfaces, allowing LLamaSharp models to perform both embedding generation for document indexing and text generation for response synthesis.

For orchestration-focused Semantic Kernel integration, see Semantic Kernel Integration For general RAG concepts using LLamaSharp embeddings without Kernel Memory, see Text Embeddings

Architecture Overview

The integration layer translates between Kernel Memory's abstraction interfaces and LLamaSharp's core components. Two primary adapter classes implement the required interfaces, delegating to LLamaEmbedder and StatelessExecutor respectively.

Component Architecture

Sources: LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs13-135 LLama.KernelMemory/LlamaSharpTextGenerator.cs13-154 LLama.KernelMemory/BuilderExtensions.cs11-94

Text Embedding Generation

The LLamaSharpTextEmbeddingGenerator class implements ITextEmbeddingGenerator to provide document and query embeddings for Kernel Memory's indexing and retrieval operations.

Implementation Architecture

Sources: LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs13-135

Key Configuration Parameters

The embedding generator applies specific ModelParams configuration optimized for embedding tasks:

Parameter	Value	Purpose
`PoolingType`	`Mean`	Averages token embeddings into single document vector LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs45
`FlashAttention`	`true`	Enables optimized attention computation LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs43
`BatchSize`	512	Large batch for efficient embedding computation LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs41
`UBatchSize`	512	Matches logical batch size for embedding workload LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs42
`UseMemorymap`	`true`	Memory-maps model file for efficient loading LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs44

Sources: LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs35-46 LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs63-74

Tokenization Support

Both CountTokens and GetTokens methods are implemented to satisfy Kernel Memory's tokenization requirements for chunking and context management:

CountTokens: Uses LLamaWeights.Tokenize() with addBos=true and special=true for accurate token counting LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs115-118
GetTokens: Tokenizes text and decodes each token using StreamingTokenDecoder for individual token strings LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs129-134

Sources: LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs115-134

Text Generation

The LlamaSharpTextGenerator class implements ITextGenerator for generating responses in RAG pipelines, using StatelessExecutor for stateless request-response patterns.

Implementation Architecture

Sources: LLama.KernelMemory/LlamaSharpTextGenerator.cs13-154

Parameter Conversion

The OptionsToParams method translates Kernel Memory's TextGenerationOptions to LLamaSharp's InferenceParams:

StopSequences: Concatenated with default AntiPrompts LLama.KernelMemory/LlamaSharpTextGenerator.cs100
MaxTokens: Overrides default if specified LLama.KernelMemory/LlamaSharpTextGenerator.cs101
SamplingPipeline: A new DefaultSamplingPipeline is created mapping Temperature, FrequencyPenalty, PresencePenalty, and NucleusSampling (to TopP) LLama.KernelMemory/LlamaSharpTextGenerator.cs103-109

Sources: LLama.KernelMemory/LlamaSharpTextGenerator.cs94-126

Stateless Execution Pattern

The generator uses StatelessExecutor rather than stateful executors because Kernel Memory manages conversation state externally. Each request is independent, creating a fresh context for inference LLama.KernelMemory/LlamaSharpTextGenerator.cs16

Sources: LLama.KernelMemory/LlamaSharpTextGenerator.cs31-48 LLama.KernelMemory/LlamaSharpTextGenerator.cs57-75

Configuration

The LLamaSharpConfig class provides centralized configuration for both embedding and text generation components.

Configuration Properties

Property	Type	Description	Default
`ModelPath`	`string`	Path to GGUF model file	Required LLama.KernelMemory/LlamaSharpConfig.cs23
`ContextSize`	`uint?`	Maximum context window size	2048 LLama.KernelMemory/LlamaSharpConfig.cs28
`GpuLayerCount`	`int?`	Number of layers to offload to GPU	20 LLama.KernelMemory/LlamaSharpConfig.cs33
`MainGpu`	`int`	Primary GPU device ID	0 LLama.KernelMemory/LlamaSharpConfig.cs53
`SplitMode`	`GPUSplitMode`	Multi-GPU distribution strategy	`None` LLama.KernelMemory/LlamaSharpConfig.cs59
`DefaultInferenceParams`	`InferenceParams?`	Default sampling parameters for generation	`null` LLama.KernelMemory/LlamaSharpConfig.cs64

Sources: LLama.KernelMemory/LlamaSharpConfig.cs9-65

GPU Split Mode Behavior

The MainGpu property interpretation varies by SplitMode LLama.KernelMemory/LlamaSharpConfig.cs36-50:

None: The GPU that is used for the entire model.
Row: The GPU that is used for small tensors and intermediate results.
Layer: Ignored.

Sources: LLama.KernelMemory/LlamaSharpConfig.cs36-53

Builder Extensions

The BuilderExtensions class provides fluent API extensions for IKernelMemoryBuilder to register LLamaSharp components.

Extension Methods Overview

Sources: LLama.KernelMemory/BuilderExtensions.cs11-94

Resource Sharing with WithLLamaSharpDefaults

The WithLLamaSharpDefaults method optimizes resource usage by sharing LLamaWeights between embedding and generation components:

Creates or reuses LLamaWeights from model file LLama.KernelMemory/BuilderExtensions.cs86
Creates shared ModelParams configuration LLama.KernelMemory/BuilderExtensions.cs72-82
Instantiates StatelessExecutor with shared weights LLama.KernelMemory/BuilderExtensions.cs89
Registers LLamaSharpTextEmbeddingGenerator sharing the weights LLama.KernelMemory/BuilderExtensions.cs90
Registers LlamaSharpTextGenerator sharing weights and executor LLama.KernelMemory/BuilderExtensions.cs91

Sources: LLama.KernelMemory/BuilderExtensions.cs70-93

Resource Management

The integration implements resource ownership tracking to prevent premature disposal when components share underlying native resources.

Ownership Tracking Pattern

Sources: LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs90-100 LLama.KernelMemory/LlamaSharpTextGenerator.cs78-84

Constructor-Based Ownership

The ownership model is determined at construction time:

Constructor Signature	Owns Weights	Owns Embedder
`LLamaSharpTextEmbeddingGenerator(LLamaSharpConfig)`	✓ LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs50-51	✓
`LLamaSharpTextEmbeddingGenerator(config, LLamaWeights)`	✗	✓ LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs77
`LLamaSharpTextEmbeddingGenerator(LLamaEmbedder)`	✗	✗

Sources: LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs28-88

Tokenization Consistency Requirements

Kernel Memory requires consistent tokenization between the embedding generator and text generator to ensure that document chunks and prompts are handled identically by the underlying model's vocabulary.

Tokenizer Implementation Compliance

The integration implements ITextTokenizer (via the generators) and undergoes rigorous testing to ensure:

Reconstruction: GetTokens() must be able to reconstruct the original string when joined LLama.Unittest/KernelMemory/ITextTokenizerTests.cs37-50
Unicode Stability: Correct handling of multi-byte characters (ideograms, emojis) where one character may span multiple numeric tokens LLama.Unittest/KernelMemory/ITextTokenizerTests.cs74-90
BOS Handling: Both CountTokens and GetTokens maintain addBos=true to ensure consistency in token counts LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs117-131

Sources: LLama.Unittest/KernelMemory/ITextTokenizerTests.cs9-117 LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs115-134 LLama.KernelMemory/LlamaSharpTextGenerator.cs134-153

Refresh this wiki

URL: https://deepwiki.com/SciSharp/LLamaSharp/7.2-kernel-memory-integration