VOOZH about

URL: https://huggingface.co/llamaindex/vdr-2b-multi-v1

โ‡ฑ llamaindex/vdr-2b-multi-v1 ยท Hugging Face


vdr-2b-multi-v1

๐Ÿ‘ Image

vdr-2b-multi-v1 is a multilingual embedding model designed for visual document retrieval across multiple languages and domains. It encodes document page screenshots into dense single-vector representations, this will effectively allow to search and query visually rich multilingual documents without the need for any OCR, data extraction pipelines, chunking...

  • Trained on ๐Ÿ‡ฎ๐Ÿ‡น Italian, ๐Ÿ‡ช๐Ÿ‡ธ Spanish, ๐Ÿ‡ฌ๐Ÿ‡ง English, ๐Ÿ‡ซ๐Ÿ‡ท French and ๐Ÿ‡ฉ๐Ÿ‡ช German: together they form a new large, open-source, multilingual training dataset of 500k high-quality samples.

  • Cross-lingual Retrieval: substantially better on real-world scenarios. For example, this allows for searching german documents with italian queries.

  • Matryoshka Representation Learning: You can reduce the vectors size 3x and still keep 98% of the embeddings quality.

Usage

The model uses bf16 tensors and allocates ~4.4GB of VRAM when loaded. You can easily run inference and generate embeddings using 768 image patches and a batch size of 16 even on a cheap NVIDIA T4 GPU. This table reports the memory footprint (GB) under conditions of different batch sizes with HuggingFace Transformers and maximum 768 image patches.

Batch Size GPU Memory (GB)
4 6.9
8 8.8
16 11.5
32 19.7

You can generate embeddings with this model in many different ways:

Training

The model is based on MrLight/dse-qwen2-2b-mrl-v1 and it was trained on the new vdr-multilingual-train dataset that consinsists of 500k high quality, multilingual query image pairs. It was trained for 1 epoch using the DSE approach, with a batch size of 128 and hard-mined negatives.

Results

๐Ÿ‘ Image

The model has been evaluated on the Vidore benchmark and on custom-built evaluation sets that allow testing its multilingual capabilities on text-only, visual-only and mixed page screenshots. The evaluation dataset is publicly available here on HuggingFace.

All evaluations are performed by calculating NDCG@5 scores using 1536 dimensions vectors and an image resolution that can be represented with maximum 768 tokens.

Avg Italian (text) Italian (visual) Italian (mix)
dse-qwen2-2b-mrl-v1 95.1 95.1 94 96.2
vdr-2b-multi-v1 97.0 96.4 96.3 98.4
+2%
Avg French (text) French (visual) French (mix)
dse-qwen2-2b-mrl-v1 93.5 94.7 90.8 95.1
vdr-2b-multi-v1 95.6 95.6 93.3 97.9
+2.2%
Avg Spanish (text) Spanish (visual) Spanish (mix)
dse-qwen2-2b-mrl-v1 96.7 97.2 94.7 98.2
vdr-2b-multi-v1 98.1 98.3 96.9 99.1
+1.4%
Avg German (text) German (visual) German (mix)
dse-qwen2-2b-mrl-v1 93.0 93.4 90 95.5
vdr-2b-multi-v1 96.2 94.8 95.7 98.1
+3.4%
Avg English (text) English (visual) English (mix)
dse-qwen2-2b-mrl-v1 98.0 98.3 98.5 97.1
vdr-2b-multi-v1 98.1 97.9 99.1 97.3
+0.1%
Avg shiftproject government healthcare energy ai docvqa arxivqa tatdqa infovqa tabfquad
dse-qwen2-2b-mrl-v1 83.6 79.8 95.7 96.9 92 98.2 56.3 85.2 53.9 87.5 90.3
vdr-2b-multi-v1 84.0 82.4 95.5 96.5 91.2 98.5 58.5 84.7 53.6 87.1 92.2
Downloads last month
662
Safetensors
Model size
2B params
Tensor type
BF16
ยท

Model tree for llamaindex/vdr-2b-multi-v1

Base model

Qwen/Qwen2-VL-2B
Finetuned
(5)
this model
Quantizations
2 models

Dataset used to train llamaindex/vdr-2b-multi-v1

Spaces using llamaindex/vdr-2b-multi-v1 13

Collection including llamaindex/vdr-2b-multi-v1

Paper for llamaindex/vdr-2b-multi-v1