vdr-2b-multi-v1
vdr-2b-multi-v1 is a multilingual embedding model designed for visual document retrieval across multiple languages and domains. It encodes document page screenshots into dense single-vector representations, this will effectively allow to search and query visually rich multilingual documents without the need for any OCR, data extraction pipelines, chunking...
Trained on ๐ฎ๐น Italian, ๐ช๐ธ Spanish, ๐ฌ๐ง English, ๐ซ๐ท French and ๐ฉ๐ช German: together they form a new large, open-source, multilingual training dataset of 500k high-quality samples.
Cross-lingual Retrieval: substantially better on real-world scenarios. For example, this allows for searching german documents with italian queries.
Matryoshka Representation Learning: You can reduce the vectors size 3x and still keep 98% of the embeddings quality.
Usage
The model uses bf16 tensors and allocates ~4.4GB of VRAM when loaded. You can easily run inference and generate embeddings using 768 image patches and a batch size of 16 even on a cheap NVIDIA T4 GPU. This table reports the memory footprint (GB) under conditions of different batch sizes with HuggingFace Transformers and maximum 768 image patches.
| Batch Size | GPU Memory (GB) |
|---|---|
| 4 | 6.9 |
| 8 | 8.8 |
| 16 | 11.5 |
| 32 | 19.7 |
You can generate embeddings with this model in many different ways:
Training
The model is based on MrLight/dse-qwen2-2b-mrl-v1 and it was trained on the new vdr-multilingual-train dataset that consinsists of 500k high quality, multilingual query image pairs. It was trained for 1 epoch using the DSE approach, with a batch size of 128 and hard-mined negatives.
Results
The model has been evaluated on the Vidore benchmark and on custom-built evaluation sets that allow testing its multilingual capabilities on text-only, visual-only and mixed page screenshots. The evaluation dataset is publicly available here on HuggingFace.
All evaluations are performed by calculating NDCG@5 scores using 1536 dimensions vectors and an image resolution that can be represented with maximum 768 tokens.
| Avg | Italian (text) | Italian (visual) | Italian (mix) | |
|---|---|---|---|---|
| dse-qwen2-2b-mrl-v1 | 95.1 | 95.1 | 94 | 96.2 |
| vdr-2b-multi-v1 | 97.0 | 96.4 | 96.3 | 98.4 |
| +2% |
| Avg | French (text) | French (visual) | French (mix) | |
|---|---|---|---|---|
| dse-qwen2-2b-mrl-v1 | 93.5 | 94.7 | 90.8 | 95.1 |
| vdr-2b-multi-v1 | 95.6 | 95.6 | 93.3 | 97.9 |
| +2.2% |
| Avg | Spanish (text) | Spanish (visual) | Spanish (mix) | |
|---|---|---|---|---|
| dse-qwen2-2b-mrl-v1 | 96.7 | 97.2 | 94.7 | 98.2 |
| vdr-2b-multi-v1 | 98.1 | 98.3 | 96.9 | 99.1 |
| +1.4% |
| Avg | German (text) | German (visual) | German (mix) | |
|---|---|---|---|---|
| dse-qwen2-2b-mrl-v1 | 93.0 | 93.4 | 90 | 95.5 |
| vdr-2b-multi-v1 | 96.2 | 94.8 | 95.7 | 98.1 |
| +3.4% |
| Avg | English (text) | English (visual) | English (mix) | |
|---|---|---|---|---|
| dse-qwen2-2b-mrl-v1 | 98.0 | 98.3 | 98.5 | 97.1 |
| vdr-2b-multi-v1 | 98.1 | 97.9 | 99.1 | 97.3 |
| +0.1% |
| Avg | shiftproject | government | healthcare | energy | ai | docvqa | arxivqa | tatdqa | infovqa | tabfquad | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| dse-qwen2-2b-mrl-v1 | 83.6 | 79.8 | 95.7 | 96.9 | 92 | 98.2 | 56.3 | 85.2 | 53.9 | 87.5 | 90.3 |
| vdr-2b-multi-v1 | 84.0 | 82.4 | 95.5 | 96.5 | 91.2 | 98.5 | 58.5 | 84.7 | 53.6 | 87.1 | 92.2 |
- Downloads last month
- 662
Model tree for llamaindex/vdr-2b-multi-v1
Base model
Qwen/Qwen2-VL-2B