VOOZH about

URL: https://deepwiki.com/SciSharp/LLamaSharp/7.2-kernel-memory-integration

⇱ Kernel Memory Integration | SciSharp/LLamaSharp | DeepWiki


Loading...
Last indexed: 18 May 2026 (ecd184)
Menu

Kernel Memory Integration

The Kernel Memory integration package (LLamaSharp.kernel-memory) enables LLamaSharp to serve as a local backend for Microsoft Kernel Memory, a framework for building Retrieval-Augmented Generation (RAG) pipelines. This integration provides implementations of Kernel Memory's ITextEmbeddingGenerator and ITextGenerator interfaces, allowing LLamaSharp models to perform both embedding generation for document indexing and text generation for response synthesis.

For orchestration-focused Semantic Kernel integration, see Semantic Kernel Integration For general RAG concepts using LLamaSharp embeddings without Kernel Memory, see Text Embeddings

Architecture Overview

The integration layer translates between Kernel Memory's abstraction interfaces and LLamaSharp's core components. Two primary adapter classes implement the required interfaces, delegating to LLamaEmbedder and StatelessExecutor respectively.

Component Architecture


Sources: LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs13-135 LLama.KernelMemory/LlamaSharpTextGenerator.cs13-154 LLama.KernelMemory/BuilderExtensions.cs11-94

Text Embedding Generation

The LLamaSharpTextEmbeddingGenerator class implements ITextEmbeddingGenerator to provide document and query embeddings for Kernel Memory's indexing and retrieval operations.

Implementation Architecture


Sources: LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs13-135

Key Configuration Parameters

The embedding generator applies specific ModelParams configuration optimized for embedding tasks:

ParameterValuePurpose
PoolingTypeMeanAverages token embeddings into single document vector LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs45
FlashAttentiontrueEnables optimized attention computation LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs43
BatchSize512Large batch for efficient embedding computation LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs41
UBatchSize512Matches logical batch size for embedding workload LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs42
UseMemorymaptrueMemory-maps model file for efficient loading LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs44

Sources: LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs35-46 LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs63-74

Tokenization Support

Both CountTokens and GetTokens methods are implemented to satisfy Kernel Memory's tokenization requirements for chunking and context management:

Sources: LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs115-134

Text Generation

The LlamaSharpTextGenerator class implements ITextGenerator for generating responses in RAG pipelines, using StatelessExecutor for stateless request-response patterns.

Implementation Architecture


Sources: LLama.KernelMemory/LlamaSharpTextGenerator.cs13-154

Parameter Conversion

The OptionsToParams method translates Kernel Memory's TextGenerationOptions to LLamaSharp's InferenceParams:

Sources: LLama.KernelMemory/LlamaSharpTextGenerator.cs94-126

Stateless Execution Pattern

The generator uses StatelessExecutor rather than stateful executors because Kernel Memory manages conversation state externally. Each request is independent, creating a fresh context for inference LLama.KernelMemory/LlamaSharpTextGenerator.cs16

Sources: LLama.KernelMemory/LlamaSharpTextGenerator.cs31-48 LLama.KernelMemory/LlamaSharpTextGenerator.cs57-75

Configuration

The LLamaSharpConfig class provides centralized configuration for both embedding and text generation components.

Configuration Properties

PropertyTypeDescriptionDefault
ModelPathstringPath to GGUF model fileRequired LLama.KernelMemory/LlamaSharpConfig.cs23
ContextSizeuint?Maximum context window size2048 LLama.KernelMemory/LlamaSharpConfig.cs28
GpuLayerCountint?Number of layers to offload to GPU20 LLama.KernelMemory/LlamaSharpConfig.cs33
MainGpuintPrimary GPU device ID0 LLama.KernelMemory/LlamaSharpConfig.cs53
SplitModeGPUSplitModeMulti-GPU distribution strategyNone LLama.KernelMemory/LlamaSharpConfig.cs59
DefaultInferenceParamsInferenceParams?Default sampling parameters for generationnull LLama.KernelMemory/LlamaSharpConfig.cs64

Sources: LLama.KernelMemory/LlamaSharpConfig.cs9-65

GPU Split Mode Behavior

The MainGpu property interpretation varies by SplitMode LLama.KernelMemory/LlamaSharpConfig.cs36-50:

  • None: The GPU that is used for the entire model.
  • Row: The GPU that is used for small tensors and intermediate results.
  • Layer: Ignored.

Sources: LLama.KernelMemory/LlamaSharpConfig.cs36-53

Builder Extensions

The BuilderExtensions class provides fluent API extensions for IKernelMemoryBuilder to register LLamaSharp components.

Extension Methods Overview


Sources: LLama.KernelMemory/BuilderExtensions.cs11-94

Resource Sharing with WithLLamaSharpDefaults

The WithLLamaSharpDefaults method optimizes resource usage by sharing LLamaWeights between embedding and generation components:

  1. Creates or reuses LLamaWeights from model file LLama.KernelMemory/BuilderExtensions.cs86
  2. Creates shared ModelParams configuration LLama.KernelMemory/BuilderExtensions.cs72-82
  3. Instantiates StatelessExecutor with shared weights LLama.KernelMemory/BuilderExtensions.cs89
  4. Registers LLamaSharpTextEmbeddingGenerator sharing the weights LLama.KernelMemory/BuilderExtensions.cs90
  5. Registers LlamaSharpTextGenerator sharing weights and executor LLama.KernelMemory/BuilderExtensions.cs91

Sources: LLama.KernelMemory/BuilderExtensions.cs70-93

Resource Management

The integration implements resource ownership tracking to prevent premature disposal when components share underlying native resources.

Ownership Tracking Pattern


Sources: LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs90-100 LLama.KernelMemory/LlamaSharpTextGenerator.cs78-84

Constructor-Based Ownership

The ownership model is determined at construction time:

Constructor SignatureOwns WeightsOwns Embedder
LLamaSharpTextEmbeddingGenerator(LLamaSharpConfig)LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs50-51
LLamaSharpTextEmbeddingGenerator(config, LLamaWeights)LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs77
LLamaSharpTextEmbeddingGenerator(LLamaEmbedder)

Sources: LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs28-88

Tokenization Consistency Requirements

Kernel Memory requires consistent tokenization between the embedding generator and text generator to ensure that document chunks and prompts are handled identically by the underlying model's vocabulary.

Tokenizer Implementation Compliance

The integration implements ITextTokenizer (via the generators) and undergoes rigorous testing to ensure:

Sources: LLama.Unittest/KernelMemory/ITextTokenizerTests.cs9-117 LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs115-134 LLama.KernelMemory/LlamaSharpTextGenerator.cs134-153