Last indexed: 18 May 2026 (ecd184)

Quick Start Guide

This page walks through the minimum code required to load a model, create a context, set up a ChatSession, and read streamed output token by token. It assumes packages are already installed and a GGUF model file is available. For installation steps and backend selection, see Installation and Setup. For a conceptual explanation of the layered architecture, see Core Architecture.

Prerequisites

Requirement	Details
NuGet packages	`LLamaSharp` + one backend (e.g. `LLamaSharp.Backend.Cpu`) README.md92-104
Model file	A `.gguf` file; see Installation and Setup for sourcing guidance README.md108-115
Target framework	`net8.0` or `netstandard2.0` LLama/LLamaSharp.csproj3

Object Construction Flow

The following diagram maps the conceptual steps (configure → load → wrap → chat) to the specific C# types involved in a typical text generation workflow.

Diagram: From configuration to streaming output

Sources: LLama.Examples/Examples/LLama3ChatSession.cs17-33 LLama.Examples/Examples/LLama3ChatSession.cs64-72

Step-by-Step Walkthrough

Step 1 — Global Native Configuration (Optional)

Before any LLM operations, you can configure global settings like logging or preferred hardware backends using NativeLibraryConfig. This must be done before any other native calls are made, as the static constructor of NativeApi locks the configuration upon the first P/Invoke.

Sources: LLama.Examples/Examples/SpeechChat.cs55-65 LLama/LLamaSharp.csproj71-81

Step 2 — Configure `ModelParams`

ModelParams (in the LLama.Common namespace) holds the model path and hardware tuning options.

Property	Purpose
`ContextSize`	Maximum token context window (e.g. `1024`, `4096`)
`GpuLayerCount`	Number of transformer layers offloaded to GPU; `0` = CPU only LLama.Examples/Examples/LLama3ChatSession.cs19

Sources: LLama.Examples/Examples/LLama3ChatSession.cs17-20

Step 3 — Load `LLamaWeights`

LLamaWeights.LoadFromFile(parameters) (or the async version LoadFromFileAsync) reads the GGUF file from disk and allocates native memory. The returned object is IDisposable; use using to ensure the native handle is released.

Sources: LLama.Examples/Examples/LLama3ChatSession.cs22 LLama.Examples/Examples/ChatSessionWithHistory.cs16

Step 4 — Create `LLamaContext`

model.CreateContext(parameters) allocates the KV cache and other per-context state. A single LLamaWeights object can back multiple contexts simultaneously.

Sources: LLama.Examples/Examples/LLama3ChatSession.cs23

Step 5 — Choose an Executor

For interactive chat (stateful, conversation memory across turns) use InteractiveExecutor. For one-shot tasks where context is not preserved, use StatelessExecutor.

Executor	Use case
`InteractiveExecutor`	Multi-turn chat; maintains KV cache state between calls LLama.Examples/Examples/LLama3ChatSession.cs24
`StatelessExecutor`	Fresh context on every call; no memory between turns LLama.Examples/Examples/StatelessModeExecute.cs18-22

Step 6 — Build a `ChatSession`

ChatSession wraps an executor together with ChatHistory. ChatHistory seeds the conversation with system and example messages.

Sources: LLama.Examples/Examples/LLama3ChatSession.cs27-29

Step 7 — Configure `InferenceParams`

InferenceParams controls how the executor generates tokens for a single call, including sampling logic via DefaultSamplingPipeline.

Sources: LLama.Examples/Examples/LLama3ChatSession.cs40-49

Step 8 — Stream the Response

ChatSession.ChatAsync returns IAsyncEnumerable<string>. Consume it with await foreach to get real-time token output.

Sources: LLama.Examples/Examples/LLama3ChatSession.cs64-72

Data Flow During Inference

Diagram: Runtime call chain from ChatAsync to native decode

Sources: LLama.Examples/Examples/LLama3ChatSession.cs64-72 LLama.Examples/Examples/StatelessModeExecute.cs50-53

Complete Minimal Example

Combining all steps above (adapted from LLama3ChatSession.cs):

Sources: LLama.Examples/Examples/LLama3ChatSession.cs14-80

Key Type Reference

Type	Namespace	Role
`ModelParams`	`LLama.Common`	Model path + hardware config LLama.Examples/Examples/LLama3ChatSession.cs17
`LLamaWeights`	`LLama`	Loaded model weights LLama.Examples/Examples/LLama3ChatSession.cs22
`LLamaContext`	`LLama`	KV cache + inference state LLama.Examples/Examples/LLama3ChatSession.cs23
`InteractiveExecutor`	`LLama`	Stateful multi-turn executor LLama.Examples/Examples/LLama3ChatSession.cs24
`StatelessExecutor`	`LLama`	Fresh context per turn LLama.Examples/Examples/StatelessModeExecute.cs18
`ChatSession`	`LLama`	High-level chat API LLama.Examples/Examples/LLama3ChatSession.cs29
`InferenceParams`	`LLama.Common`	Generation settings LLama.Examples/Examples/LLama3ChatSession.cs40
`DefaultSamplingPipeline`	`LLama.Sampling`	Sampling logic (Temperature, etc.) LLama.Examples/Examples/LLama3ChatSession.cs42
`PromptTemplateTransformer`	`LLama.Transformers`	Formats history into model-specific prompt templates LLama.Examples/Examples/LLama3ChatSession.cs33

Sources: LLama.Examples/Examples/LLama3ChatSession.cs1-81 LLama.Examples/Examples/StatelessModeExecute.cs1-58

Refresh this wiki

URL: https://deepwiki.com/SciSharp/LLamaSharp/1.3-quick-start-guide