Last indexed: 18 May 2026 (ecd184)

Multimodal Support

Purpose: This document explains LLamaSharp's multimodal (MTMD) capabilities for processing images and audio alongside text prompts. Multimodal support enables vision-language models and audio-language models to accept non-text inputs through projection weights that map media encodings into the text model's embedding space.

For basic text generation, see Executors and Inference. For text embeddings, see Text Embeddings.

Overview

LLamaSharp's multimodal support wraps llama.cpp's mtmd-helper layer, enabling language models to process images and audio in addition to text. Multimodal inference requires:

A base text model (LLaMA or compatible GGUF model)
A multimodal projection model (.gguf MMP file) containing vision/audio encoders
Media files or buffers (images, audio) to embed
Text prompts with special markers indicating where media should be inserted

Architecture Layers:

Layer	Components	Purpose
High-Level API	`MtmdWeights`	Simplified interface for loading projection models and media
Executor Integration	`InteractiveExecutor`, `BatchedExecutor`	Multimodal prompt support in stateful and batched executors
Low-Level API	`SafeMtmdModelHandle`, `SafeMtmdEmbed`, `SafeMtmdInputChunks`	Direct control over tokenization and evaluation
Native Layer	`mtmd_context`, `mtmd_bitmap`	llama.cpp multimodal helpers via P/Invoke

Sources: LLama/MtmdWeights.cs12-146 LLama/Native/SafeMtmdModelHandle.cs13-66

High-Level API: MtmdWeights

The MtmdWeights class provides the primary interface for multimodal functionality. It wraps SafeMtmdModelHandle and simplifies common operations. LLama/MtmdWeights.cs12-24

Loading a Multimodal Model

MtmdWeights.LoadFromFile() initializes a projection model bound to a text model. LLama/MtmdWeights.cs33-40

Key Properties and Methods:

Member	Description
`LoadFromFile`	Synchronously load projection model LLama/MtmdWeights.cs33-40
`LoadFromFileAsync`	Asynchronously load with cancellation support LLama/MtmdWeights.cs50-77
`LoadMedia(string)`	Load image/audio from file path LLama/MtmdWeights.cs82
`LoadMedia(ReadOnlySpan<byte>)`	Load image/audio from memory buffer LLama/MtmdWeights.cs87
`ClearMedia()`	Clear pending media queue LLama/MtmdWeights.cs92
`SupportsVision`	Returns `true` if model accepts images LLama/MtmdWeights.cs122
`SupportsAudio`	Returns `true` if model accepts audio LLama/MtmdWeights.cs127
`UsesNonCausalAttention`	Indicates non-causal attention decoding LLama/MtmdWeights.cs132
`UsesMRope`	Indicates multi-scale RoPE usage LLama/MtmdWeights.cs137
`SampleRate`	Audio sample rate (Hz) expected by model LLama/MtmdWeights.cs142

Sources: LLama/MtmdWeights.cs33-142

Using Multimodal with Executors

Stateful executors and BatchedExecutor support multimodal prompts when initialized with a MtmdWeights instance.

Stateful Executor Integration

The ILLamaExecutor interface provides access to ClipModel and a list of Embeds. LLama/Abstractions/ILLamaExecutor.cs10-31

InteractiveExecutor: Can be constructed with a MtmdWeights instance. It manages the prompt lifecycle including media markers. LLama.Examples/Examples/MtmdInteractiveModeExecute.cs41
Media Handling: Media is typically loaded via LoadMedia and stored in the executor's Embeds list. LLama.Examples/Examples/MtmdInteractiveModeExecute.cs168-176

BatchedExecutor Integration

BatchedExecutor accepts MtmdWeights as an optional clipModel in its constructor. LLama.Examples/Examples/BatchedExecutorMtmd.cs35

Conversations within the batched executor use the Prompt method with explicit SafeMtmdEmbed arrays to manage text and media chunks. LLama.Examples/Examples/BatchedExecutorMtmd.cs62-67

Multimodal Entity Mapping

The following diagram maps high-level multimodal concepts to specific code entities within LLamaSharp.

Multimodal System Mapping

Sources: LLama/MtmdWeights.cs12-24 LLama/Native/SafeMtmdModelHandle.cs13 LLama/Native/SafeMtmdEmbed.cs11 LLama/Native/SafeMtmdInputChunks.cs9 LLama.Examples/Examples/BatchedExecutorMtmd.cs17-35

Media Loading and Embedding

SafeMtmdEmbed Handle

SafeMtmdEmbed wraps a native mtmd_bitmap* resource. It supports loading from various sources: LLama/Native/SafeMtmdEmbed.cs11-30

RGB Bytes: FromRgbBytes(uint nx, uint ny, ReadOnlySpan<byte> rgbData) LLama/Native/SafeMtmdEmbed.cs41-54
Audio Samples: FromAudioSamples(ReadOnlySpan<float> samples) LLama/Native/SafeMtmdEmbed.cs62-72
Media File: FromMediaFile(SafeMtmdModelHandle mtmdContext, string path) LLama/Native/SafeMtmdEmbed.cs83-96
Media Buffer: FromMediaBuffer(SafeMtmdModelHandle mtmdContext, ReadOnlySpan<byte> data) LLama/Native/SafeMtmdEmbed.cs106-119

Media Metadata

You can query an embedding for its properties:

Property	Description
`Nx`	Width in pixels or audio sample count LLama/Native/SafeMtmdEmbed.cs125
`Ny`	Height in pixels (usually 1 for audio) LLama/Native/SafeMtmdEmbed.cs130
`IsAudio`	Whether the embedding contains audio data LLama/Native/SafeMtmdEmbed.cs135
`ByteCount`	Total size of raw data LLama/Native/SafeMtmdEmbed.cs140
`Id`	Optional identifier assigned to the embedding LLama/Native/SafeMtmdEmbed.cs145-149

Sources: LLama/Native/SafeMtmdEmbed.cs125-149

Multimodal Inference Flow

The multimodal pipeline involves tokenizing a prompt that contains media markers, generating chunks, and evaluating those chunks against the context.

Multimodal Data Flow

Sources: LLama/MtmdWeights.cs82-111 LLama/Native/SafeMtmdModelHandle.cs121-152

Tokenization and Chunks

The Tokenize method converts text and pending media into a collection of SafeMtmdInputChunks. LLama/Native/SafeMtmdModelHandle.cs121-152

A SafeMtmdInputChunk represents a specific segment of the input:

Type: Text, Image, or Audio. LLama/Native/SafeMtmdInputChunk.cs16-32
NTokens: Number of tokens in the chunk. LLama/Native/SafeMtmdInputChunk.cs78
NPos: Positional slots consumed. LLama/Native/SafeMtmdInputChunk.cs88

Sources: LLama/Native/SafeMtmdInputChunks.cs9-100 LLama/Native/SafeMtmdInputChunk.cs10-95

Configuration: mtmd_context_params

Multimodal contexts are configured using the native mtmd_context_params structure, typically wrapped by higher-level parameter classes. LLama/Native/NativeApi.Mtmd.cs16-30

Parameter	Description
`use_gpu`	Whether to request GPU acceleration for encoders. LLama/Native/NativeApi.Mtmd.cs18
`n_threads`	Threads for preprocessing and tokenization. LLama/Native/NativeApi.Mtmd.cs20
`image_marker`	Pointer to the string token used to represent images. LLama/Native/NativeApi.Mtmd.cs21
`media_marker`	Pointer to the string token used to represent general media. LLama/Native/NativeApi.Mtmd.cs22
`image_min_tokens`	Minimum tokens for dynamic resolution images. LLama/Native/NativeApi.Mtmd.cs25
`image_max_tokens`	Maximum tokens for dynamic resolution images. LLama/Native/NativeApi.Mtmd.cs26

Sources: LLama/Native/NativeApi.Mtmd.cs16-30 LLama/Native/NativeApi.Mtmd.cs32-39

Legacy LLavaWeights API

Before the introduction of the MTMD (Multimodal) helper system, LLamaSharp provided LlavaWeights. While MtmdWeights is now the preferred way to handle multimodal models due to its broader support for audio and more robust tokenization pipeline, LlavaWeights remains available for legacy codebases.

Legacy Entity Mapping

Sources: LLama/Native/NativeApi.Mtmd.cs1-100 (P/Invoke context for multimodal transitions).

Refresh this wiki

URL: https://deepwiki.com/SciSharp/LLamaSharp/5.2-multimodal-support