Paper โข 2205.13147 โข Published โข 27
Qwen3-VL-Embedding-2B model trained on
This is a sentence-transformers model finetuned from tomaarsen/Qwen3-VL-Embedding-2B on the vdr-multilingual-train dataset. It maps sentences & paragraphs to a 2048-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: tomaarsen/Qwen3-VL-Embedding-2B
- Maximum Sequence Length: 262144 tokens
- Output Dimensionality: 2048 dimensions
- Similarity Function: Cosine Similarity
- Supported Modalities: Text, Image, Video, Message
- Training Dataset:
- Language: en
- License: apache-2.0
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'forward', 'method_output_name': 'last_hidden_state'}, 'image': {'method': 'forward', 'method_output_name': 'last_hidden_state'}, 'video': {'method': 'forward', 'method_output_name': 'last_hidden_state'}, 'message': {'method': 'forward', 'method_output_name': 'last_hidden_state'}}, 'module_output_name': 'token_embeddings', 'message_format': 'structured', 'processing_kwargs': {'chat_template': {'add_generation_prompt': True}}, 'unpad_inputs': False, 'architecture': 'Qwen3VLModel'})
(1): Pooling({'embedding_dimension': 2048, 'pooling_mode': 'lasttoken', 'include_prompt': True})
(2): Normalize({})
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the ๐ค Hub
model = SentenceTransformer("tomaarsen/qwen3-vl-2b-vdr")
# Run inference
queries = [
'What is the quarter-on-quarter growth rate of Klook in Asia-Pacific as of October 2022?',
]
documents = [
'https://huggingface.co/tomaarsen/qwen3-vl-2b-vdr/resolve/main/assets/image_0.jpg',
'https://huggingface.co/tomaarsen/qwen3-vl-2b-vdr/resolve/main/assets/image_1.jpg',
'https://huggingface.co/tomaarsen/qwen3-vl-2b-vdr/resolve/main/assets/image_2.jpg',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 2048] [3, 2048]
# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.5789, 0.0973, 0.0304]])
Evaluation
Metrics
Information Retrieval
- Dataset:
vdr-eval - Evaluated with
InformationRetrievalEvaluator
| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.9533 |
| cosine_accuracy@3 | 0.99 |
| cosine_accuracy@5 | 0.9933 |
| cosine_accuracy@10 | 0.9933 |
| cosine_precision@1 | 0.9533 |
| cosine_precision@3 | 0.33 |
| cosine_precision@5 | 0.1987 |
| cosine_precision@10 | 0.0993 |
| cosine_recall@1 | 0.9533 |
| cosine_recall@3 | 0.99 |
| cosine_recall@5 | 0.9933 |
| cosine_recall@10 | 0.9933 |
| cosine_ndcg@10 | 0.9764 |
| cosine_mrr@10 | 0.9707 |
| cosine_map@100 | 0.9709 |
Training Details
Training Dataset
vdr-multilingual-train
- Dataset: vdr-multilingual-train at 6b92b5c
- Size: 10,000 training samples
- Columns:
query,image, andnegative_0 - Approximate statistics based on the first 1000 samples:
query image negative_0 type string image image details - min: 26 tokens
- mean: 36.31 tokens
- max: 62 tokens
- min: 700x709 px
- mean: 1416x1648 px
- max: 2100x2064 px
- min: 827x709 px
- mean: 1438x1633 px
- max: 2583x1897 px
- Samples:
query image negative_0 What are the new anthropological perspectives on development as discussed by Quarles Van Ufford and Giri in 2003?๐ Image ๐ Image What are the three main positions anthropologists have taken in relation to development, as discussed by David Lewis?๐ Image ๐ Image Who are the three sisters known as the Fates in Greek mythology?๐ Image ๐ Image - Loss:
MatryoshkaLosswith these parameters:{ "loss": "CachedMultipleNegativesRankingLoss", "matryoshka_dims": [ 2048, 1024, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Evaluation Dataset
vdr-multilingual-test
- Dataset: vdr-multilingual-test at 9e26ae1
- Size: 300 evaluation samples
- Columns:
queryandimage - Approximate statistics based on the first 300 samples:
query image type string image details - min: 27 tokens
- mean: 34.26 tokens
- max: 65 tokens
- min: 827x1125 px
- mean: 1371x1709 px
- max: 2045x2045 px
- Samples:
query image What is the quarter-on-quarter growth rate of Klook in Asia-Pacific as of October 2022?๐ Image When should spinach be planted and harvested?๐ Image How does the discharge of sewage into a river affect the concentration of dissolved oxygen?๐ Image - Loss:
MatryoshkaLosswith these parameters:{ "loss": "CachedMultipleNegativesRankingLoss", "matryoshka_dims": [ 2048, 1024, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
per_device_train_batch_size: 64num_train_epochs: 1learning_rate: 2e-05warmup_steps: 0.1bf16: Trueeval_strategy: stepsper_device_eval_batch_size: 64batch_sampler: no_duplicates
All Hyperparameters
Training Logs
| Epoch | Step | Training Loss | Validation Loss | vdr-eval_cosine_ndcg@10 |
|---|---|---|---|---|
| -1 | -1 | - | - | 0.9790 |
| 0.0510 | 8 | 7.9663 | - | - |
| 0.1019 | 16 | 5.9054 | 4.6686 | 0.9826 |
| 0.1529 | 24 | 5.6008 | - | - |
| 0.2038 | 32 | 5.6521 | 4.5979 | 0.9810 |
| 0.2548 | 40 | 5.7503 | - | - |
| 0.3057 | 48 | 5.5388 | 4.6358 | 0.9802 |
| 0.3567 | 56 | 5.5883 | - | - |
| 0.4076 | 64 | 5.4430 | 4.6014 | 0.9812 |
| 0.4586 | 72 | 5.4762 | - | - |
| 0.5096 | 80 | 5.4937 | 4.6229 | 0.9785 |
| 0.5605 | 88 | 5.4991 | - | - |
| 0.6115 | 96 | 5.2465 | 4.5517 | 0.9781 |
| 0.6624 | 104 | 5.1596 | - | - |
| 0.7134 | 112 | 5.2998 | 4.6642 | 0.9777 |
| 0.7643 | 120 | 5.4130 | - | - |
| 0.8153 | 128 | 5.2071 | 4.5448 | 0.9781 |
| 0.8662 | 136 | 5.1424 | - | - |
| 0.9172 | 144 | 5.1973 | 4.6617 | 0.9764 |
| 0.9682 | 152 | 5.3651 | - | - |
| -1 | -1 | - | - | 0.9764 |
Environmental Impact
Carbon emissions were measured using CodeCarbon.
- Energy Consumed: 2.882 kWh
- Carbon Emitted: 0.771 kg of CO2
- Hours Used: 9.675 hours
Training Hardware
- On Cloud: No
- GPU Model: 1 x NVIDIA GeForce RTX 3090
- CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
- RAM Size: 31.78 GB
Framework Versions
- Python: 3.11.6
- Sentence Transformers: 5.4.0.dev0
- Transformers: 5.3.0.dev0
- PyTorch: 2.10.0+cu128
- Accelerate: 1.13.0.dev0
- Datasets: 4.3.0
- Tokenizers: 0.22.2
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
CachedMultipleNegativesRankingLoss
@misc{gao2021scaling,
title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
year={2021},
eprint={2101.06983},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
- Downloads last month
- 10
Safetensors
Model size
2B params
Tensor type
BF16
ยท
Model tree for tomaarsen/qwen3-vl-2b-vdr
Base model
Qwen/Qwen3-VL-2B-Instruct Finetuned
tomaarsen/Qwen3-VL-Embedding-2BDatasets used to train tomaarsen/qwen3-vl-2b-vdr
Papers for tomaarsen/qwen3-vl-2b-vdr
Evaluation results
- Cosine Accuracy@1 on vdr evalself-reported0.953
- Cosine Accuracy@3 on vdr evalself-reported0.990
- Cosine Accuracy@5 on vdr evalself-reported0.993
- Cosine Accuracy@10 on vdr evalself-reported0.993
- Cosine Precision@1 on vdr evalself-reported0.953
- Cosine Precision@3 on vdr evalself-reported0.330
- Cosine Precision@5 on vdr evalself-reported0.199
- Cosine Precision@10 on vdr evalself-reported0.099
