![]() |
VOOZH | about |
Purpose: This document explains LLamaSharp's multimodal (MTMD) capabilities for processing images and audio alongside text prompts. Multimodal support enables vision-language models and audio-language models to accept non-text inputs through projection weights that map media encodings into the text model's embedding space.
For basic text generation, see Executors and Inference. For text embeddings, see Text Embeddings.
LLamaSharp's multimodal support wraps llama.cpp's mtmd-helper layer, enabling language models to process images and audio in addition to text. Multimodal inference requires:
.gguf MMP file) containing vision/audio encodersArchitecture Layers:
| Layer | Components | Purpose |
|---|---|---|
| High-Level API | MtmdWeights | Simplified interface for loading projection models and media |
| Executor Integration | InteractiveExecutor, BatchedExecutor | Multimodal prompt support in stateful and batched executors |
| Low-Level API | SafeMtmdModelHandle, SafeMtmdEmbed, SafeMtmdInputChunks | Direct control over tokenization and evaluation |
| Native Layer | mtmd_context*, mtmd_bitmap* | llama.cpp multimodal helpers via P/Invoke |
Sources: LLama/MtmdWeights.cs12-146 LLama/Native/SafeMtmdModelHandle.cs13-66
The MtmdWeights class provides the primary interface for multimodal functionality. It wraps SafeMtmdModelHandle and simplifies common operations. LLama/MtmdWeights.cs12-24
MtmdWeights.LoadFromFile() initializes a projection model bound to a text model. LLama/MtmdWeights.cs33-40
Key Properties and Methods:
| Member | Description |
|---|---|
LoadFromFile | Synchronously load projection model LLama/MtmdWeights.cs33-40 |
LoadFromFileAsync | Asynchronously load with cancellation support LLama/MtmdWeights.cs50-77 |
LoadMedia(string) | Load image/audio from file path LLama/MtmdWeights.cs82 |
LoadMedia(ReadOnlySpan<byte>) | Load image/audio from memory buffer LLama/MtmdWeights.cs87 |
ClearMedia() | Clear pending media queue LLama/MtmdWeights.cs92 |
SupportsVision | Returns true if model accepts images LLama/MtmdWeights.cs122 |
SupportsAudio | Returns true if model accepts audio LLama/MtmdWeights.cs127 |
UsesNonCausalAttention | Indicates non-causal attention decoding LLama/MtmdWeights.cs132 |
UsesMRope | Indicates multi-scale RoPE usage LLama/MtmdWeights.cs137 |
SampleRate | Audio sample rate (Hz) expected by model LLama/MtmdWeights.cs142 |
Sources: LLama/MtmdWeights.cs33-142
Stateful executors and BatchedExecutor support multimodal prompts when initialized with a MtmdWeights instance.
The ILLamaExecutor interface provides access to ClipModel and a list of Embeds. LLama/Abstractions/ILLamaExecutor.cs10-31
MtmdWeights instance. It manages the prompt lifecycle including media markers. LLama.Examples/Examples/MtmdInteractiveModeExecute.cs41LoadMedia and stored in the executor's Embeds list. LLama.Examples/Examples/MtmdInteractiveModeExecute.cs168-176BatchedExecutor accepts MtmdWeights as an optional clipModel in its constructor. LLama.Examples/Examples/BatchedExecutorMtmd.cs35
Conversations within the batched executor use the Prompt method with explicit SafeMtmdEmbed arrays to manage text and media chunks. LLama.Examples/Examples/BatchedExecutorMtmd.cs62-67
The following diagram maps high-level multimodal concepts to specific code entities within LLamaSharp.
Multimodal System Mapping
Sources: LLama/MtmdWeights.cs12-24 LLama/Native/SafeMtmdModelHandle.cs13 LLama/Native/SafeMtmdEmbed.cs11 LLama/Native/SafeMtmdInputChunks.cs9 LLama.Examples/Examples/BatchedExecutorMtmd.cs17-35
SafeMtmdEmbed wraps a native mtmd_bitmap* resource. It supports loading from various sources: LLama/Native/SafeMtmdEmbed.cs11-30
FromRgbBytes(uint nx, uint ny, ReadOnlySpan<byte> rgbData) LLama/Native/SafeMtmdEmbed.cs41-54FromAudioSamples(ReadOnlySpan<float> samples) LLama/Native/SafeMtmdEmbed.cs62-72FromMediaFile(SafeMtmdModelHandle mtmdContext, string path) LLama/Native/SafeMtmdEmbed.cs83-96FromMediaBuffer(SafeMtmdModelHandle mtmdContext, ReadOnlySpan<byte> data) LLama/Native/SafeMtmdEmbed.cs106-119You can query an embedding for its properties:
| Property | Description |
|---|---|
Nx | Width in pixels or audio sample count LLama/Native/SafeMtmdEmbed.cs125 |
Ny | Height in pixels (usually 1 for audio) LLama/Native/SafeMtmdEmbed.cs130 |
IsAudio | Whether the embedding contains audio data LLama/Native/SafeMtmdEmbed.cs135 |
ByteCount | Total size of raw data LLama/Native/SafeMtmdEmbed.cs140 |
Id | Optional identifier assigned to the embedding LLama/Native/SafeMtmdEmbed.cs145-149 |
Sources: LLama/Native/SafeMtmdEmbed.cs125-149
The multimodal pipeline involves tokenizing a prompt that contains media markers, generating chunks, and evaluating those chunks against the context.
Multimodal Data Flow
Sources: LLama/MtmdWeights.cs82-111 LLama/Native/SafeMtmdModelHandle.cs121-152
The Tokenize method converts text and pending media into a collection of SafeMtmdInputChunks. LLama/Native/SafeMtmdModelHandle.cs121-152
A SafeMtmdInputChunk represents a specific segment of the input:
Text, Image, or Audio. LLama/Native/SafeMtmdInputChunk.cs16-32Sources: LLama/Native/SafeMtmdInputChunks.cs9-100 LLama/Native/SafeMtmdInputChunk.cs10-95
Multimodal contexts are configured using the native mtmd_context_params structure, typically wrapped by higher-level parameter classes. LLama/Native/NativeApi.Mtmd.cs16-30
| Parameter | Description |
|---|---|
use_gpu | Whether to request GPU acceleration for encoders. LLama/Native/NativeApi.Mtmd.cs18 |
n_threads | Threads for preprocessing and tokenization. LLama/Native/NativeApi.Mtmd.cs20 |
image_marker | Pointer to the string token used to represent images. LLama/Native/NativeApi.Mtmd.cs21 |
media_marker | Pointer to the string token used to represent general media. LLama/Native/NativeApi.Mtmd.cs22 |
image_min_tokens | Minimum tokens for dynamic resolution images. LLama/Native/NativeApi.Mtmd.cs25 |
image_max_tokens | Maximum tokens for dynamic resolution images. LLama/Native/NativeApi.Mtmd.cs26 |
Sources: LLama/Native/NativeApi.Mtmd.cs16-30 LLama/Native/NativeApi.Mtmd.cs32-39
Before the introduction of the MTMD (Multimodal) helper system, LLamaSharp provided LlavaWeights. While MtmdWeights is now the preferred way to handle multimodal models due to its broader support for audio and more robust tokenization pipeline, LlavaWeights remains available for legacy codebases.
Legacy Entity Mapping
Sources: LLama/Native/NativeApi.Mtmd.cs1-100 (P/Invoke context for multimodal transitions).
Refresh this wiki