VOOZH about

URL: https://deepwiki.com/SciSharp/LLamaSharp/5.2-multimodal-support

⇱ Multimodal Support | SciSharp/LLamaSharp | DeepWiki


Loading...
Last indexed: 18 May 2026 (ecd184)
Menu

Multimodal Support

Purpose: This document explains LLamaSharp's multimodal (MTMD) capabilities for processing images and audio alongside text prompts. Multimodal support enables vision-language models and audio-language models to accept non-text inputs through projection weights that map media encodings into the text model's embedding space.

For basic text generation, see Executors and Inference. For text embeddings, see Text Embeddings.


Overview

LLamaSharp's multimodal support wraps llama.cpp's mtmd-helper layer, enabling language models to process images and audio in addition to text. Multimodal inference requires:

  1. A base text model (LLaMA or compatible GGUF model)
  2. A multimodal projection model (.gguf MMP file) containing vision/audio encoders
  3. Media files or buffers (images, audio) to embed
  4. Text prompts with special markers indicating where media should be inserted

Architecture Layers:

LayerComponentsPurpose
High-Level APIMtmdWeightsSimplified interface for loading projection models and media
Executor IntegrationInteractiveExecutor, BatchedExecutorMultimodal prompt support in stateful and batched executors
Low-Level APISafeMtmdModelHandle, SafeMtmdEmbed, SafeMtmdInputChunksDirect control over tokenization and evaluation
Native Layermtmd_context*, mtmd_bitmap*llama.cpp multimodal helpers via P/Invoke

Sources: LLama/MtmdWeights.cs12-146 LLama/Native/SafeMtmdModelHandle.cs13-66


High-Level API: MtmdWeights

The MtmdWeights class provides the primary interface for multimodal functionality. It wraps SafeMtmdModelHandle and simplifies common operations. LLama/MtmdWeights.cs12-24

Loading a Multimodal Model

MtmdWeights.LoadFromFile() initializes a projection model bound to a text model. LLama/MtmdWeights.cs33-40


Key Properties and Methods:

MemberDescription
LoadFromFileSynchronously load projection model LLama/MtmdWeights.cs33-40
LoadFromFileAsyncAsynchronously load with cancellation support LLama/MtmdWeights.cs50-77
LoadMedia(string)Load image/audio from file path LLama/MtmdWeights.cs82
LoadMedia(ReadOnlySpan<byte>)Load image/audio from memory buffer LLama/MtmdWeights.cs87
ClearMedia()Clear pending media queue LLama/MtmdWeights.cs92
SupportsVisionReturns true if model accepts images LLama/MtmdWeights.cs122
SupportsAudioReturns true if model accepts audio LLama/MtmdWeights.cs127
UsesNonCausalAttentionIndicates non-causal attention decoding LLama/MtmdWeights.cs132
UsesMRopeIndicates multi-scale RoPE usage LLama/MtmdWeights.cs137
SampleRateAudio sample rate (Hz) expected by model LLama/MtmdWeights.cs142

Sources: LLama/MtmdWeights.cs33-142


Using Multimodal with Executors

Stateful executors and BatchedExecutor support multimodal prompts when initialized with a MtmdWeights instance.

Stateful Executor Integration

The ILLamaExecutor interface provides access to ClipModel and a list of Embeds. LLama/Abstractions/ILLamaExecutor.cs10-31

BatchedExecutor Integration

BatchedExecutor accepts MtmdWeights as an optional clipModel in its constructor. LLama.Examples/Examples/BatchedExecutorMtmd.cs35

Conversations within the batched executor use the Prompt method with explicit SafeMtmdEmbed arrays to manage text and media chunks. LLama.Examples/Examples/BatchedExecutorMtmd.cs62-67

Multimodal Entity Mapping

The following diagram maps high-level multimodal concepts to specific code entities within LLamaSharp.

Multimodal System Mapping


Sources: LLama/MtmdWeights.cs12-24 LLama/Native/SafeMtmdModelHandle.cs13 LLama/Native/SafeMtmdEmbed.cs11 LLama/Native/SafeMtmdInputChunks.cs9 LLama.Examples/Examples/BatchedExecutorMtmd.cs17-35


Media Loading and Embedding

SafeMtmdEmbed Handle

SafeMtmdEmbed wraps a native mtmd_bitmap* resource. It supports loading from various sources: LLama/Native/SafeMtmdEmbed.cs11-30

Media Metadata

You can query an embedding for its properties:

PropertyDescription
NxWidth in pixels or audio sample count LLama/Native/SafeMtmdEmbed.cs125
NyHeight in pixels (usually 1 for audio) LLama/Native/SafeMtmdEmbed.cs130
IsAudioWhether the embedding contains audio data LLama/Native/SafeMtmdEmbed.cs135
ByteCountTotal size of raw data LLama/Native/SafeMtmdEmbed.cs140
IdOptional identifier assigned to the embedding LLama/Native/SafeMtmdEmbed.cs145-149

Sources: LLama/Native/SafeMtmdEmbed.cs125-149


Multimodal Inference Flow

The multimodal pipeline involves tokenizing a prompt that contains media markers, generating chunks, and evaluating those chunks against the context.

Multimodal Data Flow


Sources: LLama/MtmdWeights.cs82-111 LLama/Native/SafeMtmdModelHandle.cs121-152

Tokenization and Chunks

The Tokenize method converts text and pending media into a collection of SafeMtmdInputChunks. LLama/Native/SafeMtmdModelHandle.cs121-152

A SafeMtmdInputChunk represents a specific segment of the input:

Sources: LLama/Native/SafeMtmdInputChunks.cs9-100 LLama/Native/SafeMtmdInputChunk.cs10-95


Configuration: mtmd_context_params

Multimodal contexts are configured using the native mtmd_context_params structure, typically wrapped by higher-level parameter classes. LLama/Native/NativeApi.Mtmd.cs16-30

ParameterDescription
use_gpuWhether to request GPU acceleration for encoders. LLama/Native/NativeApi.Mtmd.cs18
n_threadsThreads for preprocessing and tokenization. LLama/Native/NativeApi.Mtmd.cs20
image_markerPointer to the string token used to represent images. LLama/Native/NativeApi.Mtmd.cs21
media_markerPointer to the string token used to represent general media. LLama/Native/NativeApi.Mtmd.cs22
image_min_tokensMinimum tokens for dynamic resolution images. LLama/Native/NativeApi.Mtmd.cs25
image_max_tokensMaximum tokens for dynamic resolution images. LLama/Native/NativeApi.Mtmd.cs26

Sources: LLama/Native/NativeApi.Mtmd.cs16-30 LLama/Native/NativeApi.Mtmd.cs32-39


Legacy LLavaWeights API

Before the introduction of the MTMD (Multimodal) helper system, LLamaSharp provided LlavaWeights. While MtmdWeights is now the preferred way to handle multimodal models due to its broader support for audio and more robust tokenization pipeline, LlavaWeights remains available for legacy codebases.

Legacy Entity Mapping


Sources: LLama/Native/NativeApi.Mtmd.cs1-100 (P/Invoke context for multimodal transitions).