VOOZH about

URL: https://deepwiki.com/SciSharp/LLamaSharp

⇱ SciSharp/LLamaSharp | DeepWiki


Loading...
Last indexed: 18 May 2026 (ecd184)
Menu

Overview

LLamaSharp is a cross-platform .NET library that provides managed bindings to llama.cpp enabling efficient local execution of large language models (LLMs) in .NET applications. It supports both CPU and GPU acceleration, runs models in GGUF format, and provides high-level APIs for text generation, embeddings, and multimodal inference LLama/LLamaSharp.csproj19-23

This page provides an architectural overview of the LLamaSharp ecosystem. For installation instructions, see Installation and Setup. For getting started with code examples, see Quick Start Guide. For details on specific components, refer to the architecture sections (Core Architecture, Executors and Inference, Sampling and Token Selection, Advanced Features).

Purpose and Capabilities

LLamaSharp serves three primary functions:

  1. Native Library Integration: Wraps llama.cpp's C/C++ APIs with safe, idiomatic .NET interfaces using P/Invoke and SafeHandle patterns docs/Architecture.md7
  2. High-Level Execution APIs: Provides multiple execution patterns (interactive chat, instruction-following, stateless inference, batched conversations) through the ILLamaExecutor abstraction docs/Architecture.md10
  3. Framework Integration: Bridges LLamaSharp to Microsoft AI frameworks (Semantic Kernel, Kernel Memory) and third-party ecosystems README.md61-70

The library targets netstandard2.0 and net8.0, enabling compatibility across .NET Framework, .NET Core, and modern .NET applications LLama/LLamaSharp.csproj1-3

Sources: LLama/LLamaSharp.csproj1-33 README.md14-23 docs/Architecture.md3-16

Package Ecosystem

LLamaSharp uses a modular distribution strategy with separate packages for core functionality, framework integrations, and hardware-specific backends.

Package Structure


Diagram: LLamaSharp Package Distribution Model

PackagePurposeTarget FrameworkDependencies
LLamaSharpCore library with inference APIsnetstandard2.0, net8.0Microsoft.Extensions.AI.Abstractions
LLamaSharp.semantic-kernelSemantic Kernel integrationnetstandard2.0, net8.0Microsoft.SemanticKernel.Abstractions
LLamaSharp.kernel-memoryKernel Memory integration (RAG)net8.0Microsoft.KernelMemory.Abstractions
LLamaSharp.Backend.CpuCPU binaries (+ Metal for macOS)RuntimeNative .dll/.so/.dylib
LLamaSharp.Backend.Cuda11CUDA 11 GPU accelerationRuntimeNative .dll/.so
LLamaSharp.Backend.Cuda12CUDA 12 GPU accelerationRuntimeNative .dll/.so
LLamaSharp.Backend.VulkanVulkan GPU accelerationRuntimeNative .dll/.so

Users install the LLamaSharp core package plus exactly one backend package matching their hardware README.md89-108 The backend packages contain pre-compiled llama.cpp binaries. During build, binaries are downloaded and extracted to the runtimes/ directory via MSBuild targets LLama/LLamaSharp.csproj71-81 For more details, see Package Architecture.

Sources: README.md89-108 LLama/LLamaSharp.csproj50-100 LLama.SemanticKernel/LLamaSharp.SemanticKernel.csproj1-51 LLama.KernelMemory/LLamaSharp.KernelMemory.csproj1-37

High-Level Architecture

LLamaSharp implements a layered architecture that progressively abstracts from native C++ code to high-level .NET APIs.


Diagram: LLamaSharp Layered Architecture

Layer Descriptions

LayerKey TypesResponsibility
Application LayerUser code, ChatSessionHigh-level conversational API and session state management docs/Architecture.md11
Executor LayerILLamaExecutor, InteractiveExecutor, StatelessExecutorAbstraction of execution patterns (chat vs. instruction vs. stateless) docs/Architecture.md10
Core Abstraction LayerLLamaWeights, LLamaContextManaged wrappers for model weights and inference context state docs/Architecture.md8-9
Configuration LayerModelParams, InferenceParamsConfiguration objects for loading models and controlling inference.
Native Interop LayerSafeLlamaModelHandle, NativeApi, NativeLibraryConfigMemory-safe P/Invoke and resource management docs/Architecture.md7
Native Library Layerllama.cppThe underlying C++ inference engine README.md14

Sources: docs/Architecture.md3-16 README.md14-23 LLama/LLamaSharp.csproj1-33

Core Components

Model Loading: LLamaWeights

LLamaWeights is the primary holder of model weights docs/Architecture.md8 It encapsulates the native llama_model* pointer via a SafeLlamaModelHandle. Multiple LLamaContext instances can share a single LLamaWeights to optimize memory usage when running multiple tasks on the same model docs/Architecture.md9

Inference Sessions: LLamaContext

LLamaContext manages the state for a specific inference session, including the KV cache docs/Architecture.md9 It utilizes LLamaWeights and interacts with the native library to perform tokenization and forward passes.

Executors: Execution Patterns

The library provides several executors defining how to run the model docs/Architecture.md10:

  • InteractiveExecutor: Designed for multi-turn chat interactions where the context is preserved and shifted.
  • InstructExecutor: Optimized for instruction-following tasks.
  • StatelessExecutor: Used for one-shot inference where context is not preserved between calls.
  • BatchedExecutor: Advanced executor for managing multiple concurrent conversation sequences.

For more information on starting your first inference, see Quick Start Guide.

Sources: docs/Architecture.md3-16 README.md20-22

Framework Integrations

LLamaSharp integrates with major .NET AI ecosystems to simplify RAG and Agent development:

Sources: README.md61-82 docs/index.md18-27

Version Compatibility

LLamaSharp version 0.27.0 is pinned to llama.cpp version 3f7c29d318e317b63f54c558bc69803963d7d88c LLama/LLamaSharp.csproj10-25 Native library loading is managed by NativeLibraryConfig and NativeApi, which ensures the correct backend is resolved at runtime docs/FAQ.md8

Sources: LLama/LLamaSharp.csproj10-26 docs/FAQ.md1-82