vdr-2b-multi-v1

👁 Image

vdr-2b-multi-v1 is a multilingual embedding model designed for visual document retrieval across multiple languages and domains. It encodes document page screenshots into dense single-vector representations, this will effectively allow to search and query visually rich multilingual documents without the need for any OCR, data extraction pipelines, chunking...

Trained on 🇮🇹 Italian, 🇪🇸 Spanish, 🇬🇧 English, 🇫🇷 French and 🇩🇪 German: together they form a new large, open-source, multilingual training dataset of 500k high-quality samples.
Cross-lingual Retrieval: substantially better on real-world scenarios. For example, this allows for searching german documents with italian queries.
Matryoshka Representation Learning: You can reduce the vectors size 3x and still keep 98% of the embeddings quality.

Usage

The model uses bf16 tensors and allocates ~4.4GB of VRAM when loaded. You can easily run inference and generate embeddings using 768 image patches and a batch size of 16 even on a cheap NVIDIA T4 GPU. This table reports the memory footprint (GB) under conditions of different batch sizes with HuggingFace Transformers and maximum 768 image patches.

Batch Size	GPU Memory (GB)
4	6.9
8	8.8
16	11.5
32	19.7

You can generate embeddings with this model in many different ways:

Training

The model is based on MrLight/dse-qwen2-2b-mrl-v1 and it was trained on the new vdr-multilingual-train dataset that consinsists of 500k high quality, multilingual query image pairs. It was trained for 1 epoch using the DSE approach, with a batch size of 128 and hard-mined negatives.

Results

👁 Image

The model has been evaluated on the Vidore benchmark and on custom-built evaluation sets that allow testing its multilingual capabilities on text-only, visual-only and mixed page screenshots. The evaluation dataset is publicly available here on HuggingFace.

All evaluations are performed by calculating NDCG@5 scores using 1536 dimensions vectors and an image resolution that can be represented with maximum 768 tokens.

	Avg	Italian (text)	Italian (visual)	Italian (mix)
dse-qwen2-2b-mrl-v1	95.1	95.1	94	96.2
vdr-2b-multi-v1	97.0	96.4	96.3	98.4
+2%

	Avg	French (text)	French (visual)	French (mix)
dse-qwen2-2b-mrl-v1	93.5	94.7	90.8	95.1
vdr-2b-multi-v1	95.6	95.6	93.3	97.9
+2.2%

	Avg	Spanish (text)	Spanish (visual)	Spanish (mix)
dse-qwen2-2b-mrl-v1	96.7	97.2	94.7	98.2
vdr-2b-multi-v1	98.1	98.3	96.9	99.1
+1.4%

	Avg	German (text)	German (visual)	German (mix)
dse-qwen2-2b-mrl-v1	93.0	93.4	90	95.5
vdr-2b-multi-v1	96.2	94.8	95.7	98.1
+3.4%

	Avg	English (text)	English (visual)	English (mix)
dse-qwen2-2b-mrl-v1	98.0	98.3	98.5	97.1
vdr-2b-multi-v1	98.1	97.9	99.1	97.3
+0.1%

	Avg	shiftproject	government	healthcare	energy	ai	docvqa	arxivqa	tatdqa	infovqa	tabfquad
dse-qwen2-2b-mrl-v1	83.6	79.8	95.7	96.9	92	98.2	56.3	85.2	53.9	87.5	90.3
vdr-2b-multi-v1	84.0	82.4	95.5	96.5	91.2	98.5	58.5	84.7	53.6	87.1	92.2