Voozh

👁 Jina AI: Your Search Foundation, Supercharged!

The embedding set trained by Jina AI.

Jina CLIP v2: Multilingual Multimodal Embeddings for Texts and Images

This model is based on the paper jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images.

Quick Start

Intended Usage & Model Info

jina-clip-v2 is a general-purpose multilingual multimodal embedding model for text & images.

Multimodal embeddings enable searching and understanding data across different modalities through a coherent representation. They serve as the backbone of neural information retrieval and multimodal GenAI applications.

Built upon jina-clip-v1 and our recently released jina-embeddings-v3, jina-clip-v2 features several significant improvements:

Improved Performance: v2 shows a 3% performance improvement over v1 in both text-image and text-text retrieval tasks. Similar to v1, v2's text encoder can serve as an effective multilingual long-context dense retriever. It performs on par with our frontier model jina-embeddings-v3 (currently the best multilingual embeddings under 1B parameters on MTEB).
Multilingual Support: Using the same backbone as jina-embeddings-v3 for the text tower, jina-clip-v2 supports 89 languages for multilingual-image retrieval, showing up to 4% improvement compared to nllb-clip-large-siglip on multilingual image retrieval tasks.
Higher Image Resolution: v2 now supports 512x512 input image resolution, a significant increase from v1's 224x224. This higher resolution enables better processing of detailed images, improved feature extraction, and more accurate recognition of fine-grained visual elements.
Matryoshka Representations: v2 allows users to truncate the output dimensions of both text and image embeddings from 1024 down to 64, reducing storage and processing overhead while maintaining strong performance.

Measuring 0.9B parameters, jina-clip-v2 combines two powerful encoders:

the text encoder Jina-XLM-RoBERTa (the backbone of jina-embeddings-v3) and
the vision encoder EVA02-L14 (an efficient vision Transformer developed by BAAI).

FEATURE	TEXT ENCODER	IMAGE ENCODER
Base Model	Jina-XLM-RoBERTa	EVA02-L
Parameters	561M	304M
Input Specification	8,192 tokens (max)	512×512 pixels
Min Output Dimensions	64	64
Max Output Dimensions	1,024	1,024
Layers	24	24
Attention Mechanism	FlashAttention2	xFormers
Pooling Strategy	Mean pooling	CLS pooling
Additional Features	89 languages supported	Patch size 14x14

These encoders are jointly trained to create aligned representations of images and text.

CLIP-like models have established themselves as the backbone for general-purpose multimodal applications. With jina-clip-v2, we're taking these capabilities to the next level, breaking down language barriers to deliver more accurate cross-modal understanding and retrieval. We're confident this release delivers a promise in making multimodal search and retrieval both more powerful and more accessible to developers worldwide.

Training, Data, Parameters

Please refer to our technical report of jina-clip-v2 for the model and training details.

technical report of jina-clip-v1

Faster Inference: FA2, XFormers and bf16

On a CUDA enabled torch environment, the model comes in torch.bfloat16 precision by default. It is highly recommended to install FlashAttention and xFormers to make use of their efficient attention mechanism implementations.

Usage

License

This model is licensed to download and run under CC BY-NC 4.0. It is available for commercial use via the Jina Embeddings API, AWS, Azure, and GCP. To download for commercial use, please contact us.

Contact

Join our Discord community and chat with other community members about ideas.

Citation

If you find jina-clip-v2 useful in your research, please cite the following paper:

@misc{koukounas2024jinaclipv2multilingualmultimodalembeddings,
 title={jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images}, 
 author={Andreas Koukounas and Georgios Mastrapas and Bo Wang and Mohammad Kalim Akram and Sedigheh Eslami and Michael Günther and Isabelle Mohr and Saba Sturua and Scott Martens and Nan Wang and Han Xiao},
 year={2024},
 eprint={2412.08802},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2412.08802}, 
}

Downloads last month: 50,679

Safetensors

Model size

0.9B params

Tensor type

F16

Model tree for jinaai/jina-clip-v2

Base model

jinaai/xlm-roberta-flash-implementation

Quantized

(3)

this model

Finetunes

2 models

Spaces using jinaai/jina-clip-v2 21

Collection including jinaai/jina-clip-v2

Multimodal text-image embeddings • 4 items • Updated Jul 20, 2025 • 15

Papers for jinaai/jina-clip-v2

Paper • 2412.08802 • Published Dec 11, 2024 • 7

Paper • 2405.20204 • Published May 30, 2024 • 37

URL: https://huggingface.co/jinaai/jina-clip-v2

⇱ jinaai/jina-clip-v2 · Hugging Face