Last indexed: 18 May 2026 (ecd184)

Model Loading and LLamaWeights

This document explains the LLamaWeights class, which represents a loaded LLM model in memory. LLamaWeights is the entry point for working with model files in GGUF format and provides methods for loading models, accessing model metadata, creating inference contexts, and tokenizing text.

Scope: This page covers model loading and the LLamaWeights API. For context creation and inference state management, see page 2.3. For low-level native interop details, see page 2.1. For safe handle resource management, see page 2.4. For tokenization details, see page 2.5.

Overview

LLamaWeights is the high-level managed wrapper around a loaded GGUF model file. It encapsulates a SafeLlamaModelHandle, which wraps the native llama_model* pointer from llama.cpp. The class provides:

Loading GGUF model files from disk (synchronously or asynchronously) LLama/LLamaWeights.cs67-138
Exposing model properties (size, parameter count, embedding dimensions) LLama/LLamaWeights.cs29-44
Providing access to model metadata and vocabulary LLama/LLamaWeights.cs49-54
Creating LLamaContext instances for inference LLama/LLamaWeights.cs152-155
Tokenizing text using the model's vocabulary LLama/LLamaWeights.cs165-168
Managing native resources via IDisposable LLama/LLamaWeights.cs141-144

LLamaWeights Architecture and Flow

Title: LLamaWeights Architecture and Flow

Sources: LLama/LLamaWeights.cs17-60 LLama/Native/SafeLlamaModelHandle.cs15-121 LLama/Native/NativeApi.cs186-187

Loading Models

Synchronous Loading

The primary method for loading a model is LoadFromFile, which accepts an IModelParams object specifying the model path and configuration:

The loading process:

Convert IModelParams to native LLamaModelParams structure via ToLlamaModelParams() LLama/LLamaWeights.cs69
Call SafeLlamaModelHandle.LoadFromFile() with model path and parameters LLama/Native/SafeLlamaModelHandle.cs136-151
Native llama_model_load_from_file() loads the GGUF file and allocates memory LLama/Native/SafeLlamaModelHandle.cs186
Wrap the returned SafeLlamaModelHandle in a LLamaWeights instance LLama/LLamaWeights.cs71
Private constructor reads and caches model metadata via ReadMetadata() LLama/LLamaWeights.cs59

Sources: LLama/LLamaWeights.cs56-72 LLama/Native/SafeLlamaModelHandle.cs136-151

Asynchronous Loading with Progress Reporting

For long-running model loads, LoadFromFileAsync provides cancellation support and progress reporting:

This method:

Runs the loading operation on a background thread via Task.Run LLama/LLamaWeights.cs114
Reports progress from 0.0 to 1.0 via IProgress<float> LLama/LLamaWeights.cs99
Respects cancellation tokens and converts LoadWeightsFailedException to OperationCanceledException when appropriate LLama/LLamaWeights.cs128-129
Uses a progress_callback in native LLamaModelParams to poll for cancellation and update the progress object LLama/LLamaWeights.cs96-110

Asynchronous Loading Flow

Title: Asynchronous Loading Flow

Sources: LLama/LLamaWeights.cs83-138 LLama/Native/SafeLlamaModelHandle.cs136-151

Error Handling

Model loading can fail for several reasons:

Exception Type	Cause
`FileNotFoundException`	Model file does not exist (handled by `FileStream` check) LLama/Native/SafeLlamaModelHandle.cs142
`InvalidOperationException`	File is not readable LLama/Native/SafeLlamaModelHandle.cs144
`LoadWeightsFailedException`	Native loading returned an invalid handle LLama/Native/SafeLlamaModelHandle.cs148
`OperationCanceledException`	Loading cancelled via `CancellationToken` LLama/LLamaWeights.cs129

Sources: LLama/Native/SafeLlamaModelHandle.cs136-151 LLama/LLamaWeights.cs127-133

Model Properties

LLamaWeights exposes read-only properties that describe the loaded model:

Property	Type	Native API Called	Description
`NativeHandle`	`SafeLlamaModelHandle`	N/A	The underlying safe handle wrapping the native `llama_model*` LLama/LLamaWeights.cs24
`ContextSize`	`int`	`llama_model_n_ctx_train`	Training context size LLama/Native/SafeLlamaModelHandle.cs26
`SizeInBytes`	`ulong`	`llama_model_size`	Total size of model weights in bytes LLama/Native/SafeLlamaModelHandle.cs41
`ParameterCount`	`ulong`	`llama_model_n_params`	Number of model parameters LLama/Native/SafeLlamaModelHandle.cs46
`EmbeddingSize`	`int`	`llama_model_n_embd`	Dimension of embedding vectors LLama/Native/SafeLlamaModelHandle.cs36
`Vocab`	`Vocabulary`	`_vocab` field	Vocabulary access LLama/Native/SafeLlamaModelHandle.cs120
`Metadata`	`IReadOnlyDictionary`	`ReadMetadata()`	Cached key-value metadata LLama/LLamaWeights.cs54

Property Delegation Chain

Title: Property Delegation Chain

Sources: LLama/LLamaWeights.cs24-54 LLama/Native/SafeLlamaModelHandle.cs18-120

Model Metadata and Templates

The Metadata property provides access to all key-value pairs embedded in the GGUF file. Metadata is read once during construction via weights.ReadMetadata() LLama/LLamaWeights.cs59

Chat Templates

LLamaWeights also provides access to model chat templates. These can be retrieved via the underlying SafeLlamaModelHandle using llama_model_chat_template LLama/Native/SafeLlamaModelHandle.cs175

Metadata and Template Flow

Title: Metadata and Template Flow

Sources: LLama/LLamaWeights.cs54-60 LLama/Native/SafeLlamaModelHandle.cs113 LLama/Native/SafeLlamaModelHandle.cs175

Creating Contexts

LLamaWeights creates LLamaContext instances, which manage inference state:

When a context is created, it maintains a reference to the model weights. The SafeLLamaContextHandle increments the reference count of the SafeLlamaModelHandle to ensure the model isn't freed while contexts are active LLama/Native/SafeLLamaContextHandle.cs117-119

Sources: LLama/LLamaWeights.cs152-155 LLama/LLamaContext.cs84-98 LLama/Native/SafeLLamaContextHandle.cs109-122

Tokenization

LLamaWeights provides a high-level tokenization method:

This delegates to the underlying SafeLlamaModelHandle.Tokenize, which interfaces with the native vocabulary LLama/LLamaWeights.cs167

Sources: LLama/LLamaWeights.cs165-168 LLama/LLamaContext.cs107-110

Model Quantization

LLamaSharp supports quantizing model files to different formats (e.g., Q4_K_M, Q8_0) via the LLamaQuantizer class LLama/LLamaQuantizer.cs10-11 This process uses LLamaModelQuantizeParams to configure the quantization operation LLama/LLamaQuantizer.cs32-36

Feature	Method	Source
Quantize File	`LLamaQuantizer.Quantize(...)`	LLama/LLamaQuantizer.cs23-43
Supported Types	`LLamaFtype` enum	LLama/Native/LLamaFtype.cs7-214
Params Struct	`LLamaModelQuantizeParams`	LLama/Native/LLamaModelQuantizeParams.cs10-113

Quantization Process

Title: Quantization Process

Sources: LLama/LLamaQuantizer.cs10-43 LLama/Native/LLamaModelQuantizeParams.cs9-113 LLama/Native/LLamaFtype.cs7-214 LLama/Native/NativeApi.Quantize.cs12-13

LoRA Adapters

LLamaSharp supports applying LoRA (Low-Rank Adaptation) adapters to models via the LoraAdapter class LLama/Native/LoraAdapter.cs8-9 Adapters are loaded for a specific SafeLlamaModelHandle and can be manually freed or automatically cleaned up when the model is unloaded LLama/Native/LoraAdapter.cs41-62

Sources: LLama/Native/LoraAdapter.cs8-63

Resource Management

LLamaWeights implements IDisposable to release native resources:

The disposal chain:

LLamaWeights.Dispose() is called LLama/LLamaWeights.cs141
NativeHandle.Dispose() triggers the safe handle cleanup LLama/LLamaWeights.cs143
SafeLlamaModelHandle.ReleaseHandle() calls native llama_model_free(handle) LLama/Native/SafeLlamaModelHandle.cs123-127

Sources: LLama/LLamaWeights.cs141-144 LLama/Native/SafeLlamaModelHandle.cs123-127

Refresh this wiki

URL: https://deepwiki.com/SciSharp/LLamaSharp/2.2-model-loading-and-llamaweights