VOOZH about

URL: https://deepwiki.com/SciSharp/LLamaSharp/1.3-quick-start-guide

⇱ Quick Start Guide | SciSharp/LLamaSharp | DeepWiki


Loading...
Last indexed: 18 May 2026 (ecd184)
Menu

Quick Start Guide

This page walks through the minimum code required to load a model, create a context, set up a ChatSession, and read streamed output token by token. It assumes packages are already installed and a GGUF model file is available. For installation steps and backend selection, see Installation and Setup. For a conceptual explanation of the layered architecture, see Core Architecture.


Prerequisites

RequirementDetails
NuGet packagesLLamaSharp + one backend (e.g. LLamaSharp.Backend.Cpu) README.md92-104
Model fileA .gguf file; see Installation and Setup for sourcing guidance README.md108-115
Target frameworknet8.0 or netstandard2.0 LLama/LLamaSharp.csproj3

Object Construction Flow

The following diagram maps the conceptual steps (configure → load → wrap → chat) to the specific C# types involved in a typical text generation workflow.

Diagram: From configuration to streaming output


Sources: LLama.Examples/Examples/LLama3ChatSession.cs17-33 LLama.Examples/Examples/LLama3ChatSession.cs64-72


Step-by-Step Walkthrough

Step 1 — Global Native Configuration (Optional)

Before any LLM operations, you can configure global settings like logging or preferred hardware backends using NativeLibraryConfig. This must be done before any other native calls are made, as the static constructor of NativeApi locks the configuration upon the first P/Invoke.


Sources: LLama.Examples/Examples/SpeechChat.cs55-65 LLama/LLamaSharp.csproj71-81

Step 2 — Configure ModelParams

ModelParams (in the LLama.Common namespace) holds the model path and hardware tuning options.

PropertyPurpose
ContextSizeMaximum token context window (e.g. 1024, 4096)
GpuLayerCountNumber of transformer layers offloaded to GPU; 0 = CPU only LLama.Examples/Examples/LLama3ChatSession.cs19

Sources: LLama.Examples/Examples/LLama3ChatSession.cs17-20


Step 3 — Load LLamaWeights

LLamaWeights.LoadFromFile(parameters) (or the async version LoadFromFileAsync) reads the GGUF file from disk and allocates native memory. The returned object is IDisposable; use using to ensure the native handle is released.


Sources: LLama.Examples/Examples/LLama3ChatSession.cs22 LLama.Examples/Examples/ChatSessionWithHistory.cs16


Step 4 — Create LLamaContext

model.CreateContext(parameters) allocates the KV cache and other per-context state. A single LLamaWeights object can back multiple contexts simultaneously.


Sources: LLama.Examples/Examples/LLama3ChatSession.cs23


Step 5 — Choose an Executor

For interactive chat (stateful, conversation memory across turns) use InteractiveExecutor. For one-shot tasks where context is not preserved, use StatelessExecutor.



















ExecutorUse case
InteractiveExecutorMulti-turn chat; maintains KV cache state between calls LLama.Examples/Examples/LLama3ChatSession.cs24
StatelessExecutorFresh context on every call; no memory between turns LLama.Examples/Examples/StatelessModeExecute.cs18-22

Step 6 — Build a ChatSession

ChatSession wraps an executor together with ChatHistory. ChatHistory seeds the conversation with system and example messages.


Sources: LLama.Examples/Examples/LLama3ChatSession.cs27-29


Step 7 — Configure InferenceParams

InferenceParams controls how the executor generates tokens for a single call, including sampling logic via DefaultSamplingPipeline.


Sources: LLama.Examples/Examples/LLama3ChatSession.cs40-49


Step 8 — Stream the Response

ChatSession.ChatAsync returns IAsyncEnumerable<string>. Consume it with await foreach to get real-time token output.


Sources: LLama.Examples/Examples/LLama3ChatSession.cs64-72


Data Flow During Inference

Diagram: Runtime call chain from ChatAsync to native decode


Sources: LLama.Examples/Examples/LLama3ChatSession.cs64-72 LLama.Examples/Examples/StatelessModeExecute.cs50-53


Complete Minimal Example

Combining all steps above (adapted from LLama3ChatSession.cs):


Sources: LLama.Examples/Examples/LLama3ChatSession.cs14-80


Key Type Reference

TypeNamespaceRole
ModelParamsLLama.CommonModel path + hardware config LLama.Examples/Examples/LLama3ChatSession.cs17
LLamaWeightsLLamaLoaded model weights LLama.Examples/Examples/LLama3ChatSession.cs22
LLamaContextLLamaKV cache + inference state LLama.Examples/Examples/LLama3ChatSession.cs23
InteractiveExecutorLLamaStateful multi-turn executor LLama.Examples/Examples/LLama3ChatSession.cs24
StatelessExecutorLLamaFresh context per turn LLama.Examples/Examples/StatelessModeExecute.cs18
ChatSessionLLamaHigh-level chat API LLama.Examples/Examples/LLama3ChatSession.cs29
InferenceParamsLLama.CommonGeneration settings LLama.Examples/Examples/LLama3ChatSession.cs40
DefaultSamplingPipelineLLama.SamplingSampling logic (Temperature, etc.) LLama.Examples/Examples/LLama3ChatSession.cs42
PromptTemplateTransformerLLama.TransformersFormats history into model-specific prompt templates LLama.Examples/Examples/LLama3ChatSession.cs33

Sources: LLama.Examples/Examples/LLama3ChatSession.cs1-81 LLama.Examples/Examples/StatelessModeExecute.cs1-58