cho-embedding-0.8b
A multimodal embedding model distilled from Qwen3-VL-Embedding-8B into the Qwen3.5-0.8B architecture. Supports text, images, and multimodal inputs.
Model Overview
- Model Type: Multimodal Embedding
- Base Architecture: Qwen3.5-0.8B (GatedDeltaNet)
- Teacher Model: Qwen3-VL-Embedding-8B
- Number of Parameters: 853M
- Embedding Dimension: 1024 (trained MRL at 1024, 256, 64)
- Context Length: 4096 tokens
- Training: Full fine-tune distillation (KL-divergence + contrastive) on 3.4M multimodal samples
Performance (MMEB-Image, 36 tasks)
| Model | Size | CLS | VQA | RET | GND | Overall |
|---|---|---|---|---|---|---|
| cho-embedding-0.8b | 853M | 54.5 | 59.2 | 60.1 | 81.9 | 60.7 |
| CAFe-0.5B | 894M | 59.1 | 49.1 | 61.0 | 83.0 | 59.6 |
| LLaVE-0.5B | 894M | 57.4 | 50.3 | 59.8 | 82.9 | 59.1 |
| VLM2Vec-V2.0-2B | 2.2B | 62.9 | 56.4 | 69.6 | 77.1 | 64.9 |
| VLM2Vec-V1-2B | 2.2B | 58.6 | 49.2 | 65.0 | 73.1 | 59.7 |
| VLM2Vec-Phi3.5V | 4.2B | 54.8 | 54.9 | 62.3 | 79.5 | 60.1 |
Usage
from transformers import AutoModel, AutoProcessor
import torch
import torch.nn.functional as F
model_path = "radi-cho/cho-embedding-0.8b"
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, dtype=torch.bfloat16).to("cuda").eval()
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
processor.tokenizer.padding_side = "right"
def embed(texts, instruction="Represent the user's input."):
conversations = []
for text in texts:
conversations.append([
{"role": "system", "content": [{"type": "text", "text": instruction}]},
{"role": "user", "content": [{"type": "text", "text": text}]}
])
formatted = processor.tokenizer.apply_chat_template(
conversations, add_generation_prompt=False, tokenize=False)
formatted = [t.rstrip() + "<|endoftext|>" for t in formatted]
inputs = processor.tokenizer(formatted, padding=True, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model(**inputs)
last_hidden = outputs.last_hidden_state
attn = inputs["attention_mask"]
last_pos = attn.shape[1] - attn.flip(dims=[1]).argmax(dim=1) - 1
row_idx = torch.arange(last_hidden.shape[0], device="cuda")
embeddings = last_hidden[row_idx, last_pos, :1024]
embeddings = F.normalize(embeddings.float(), p=2, dim=-1)
return embeddings
# Example
queries = embed(["A dog playing on the beach"], instruction="Find a matching image caption.")
docs = embed(["A golden retriever runs along the shoreline at sunset"])
similarity = (queries @ docs.T).item()
print(f"Similarity: {similarity:.4f}")
With Images
from qwen_vl_utils.vision_process import process_vision_info
conversations = [[
{"role": "system", "content": [{"type": "text", "text": "Represent the image for retrieval."}]},
{"role": "user", "content": [{"type": "image", "image": "file:///path/to/image.jpg"}]}
]]
texts = processor.tokenizer.apply_chat_template(conversations, add_generation_prompt=False, tokenize=False)
texts = [t.rstrip() + "<|endoftext|>" for t in texts]
images, _, video_kwargs = process_vision_info(conversations, return_video_metadata=True, return_video_kwargs=True)
inputs = processor(text=texts, images=images, padding=True, return_tensors="pt", **video_kwargs).to("cuda")
with torch.no_grad():
outputs = model(**inputs)
# ... extract last token embedding as above
Training Details
- Method: Contrastive training on mined hard-negatives and knowledge distillation from Qwen3-VL-Embedding-8B
- Data: 3.4M samples (MMEB train diverse + original splits, MSMarco, AllNLI, Quora, VisRAG, private mined samples)
- Batch Size: 1024 effective (128/GPU x 8 GPUs)
- Hardware: 8x NVIDIA H100 80GB (6720 GPU hours spent on mining, training, and ablations)
- Training Epochs: 3
Citation
Model is released under Apache 2.0. Please cite this work if used in academic publications, preprints, etc.
@misc{choembedding,
title={cho-embedding-0.8b: Vision-Language Embeddings via Contrastive Hard-negatives Objective},
author={Cholakov, Radostin},
year={2026}
}
- Downloads last month
- 188
Safetensors
Model size
0.9B params
Tensor type
BF16
·
Datasets used to train radi-cho/cho-embedding-0.8b
Evaluation results
- Overall (36 tasks) on MMEB-Imageself-reported60.700
- CLS on MMEB-Imageself-reported54.500
- VQA on MMEB-Imageself-reported59.200
- RET on MMEB-Imageself-reported60.100
- GND on MMEB-Imageself-reported81.900
