VOOZH about

URL: https://huggingface.co/mixedbread-ai/deepset-mxbai-embed-de-large-v1

⇱ mixedbread-ai/deepset-mxbai-embed-de-large-v1 · Hugging Face


The crispy sentence embedding family from Mixedbread.
Mixedbread x deepset

🍞 Looking for a simple end-to-end retrieval solution? Meet Omni, our multimodal and multilingual model. Get in touch for access.

mixedbread-ai/deepset-mxbai-embed-de-large-v1

This model is a powerful open-source German/English embedding model developed by Mixedbread in collaboration with deepset. It's built upon intfloat/multilingual-e5-large and trained using the AnglE loss. Read more details in our blog post.

In a bread loaf:

  • State-of-the-art performance
  • Supports both binary quantization and Matryoshka Representation Learning (MRL).
  • Fine-tuned on 30+ million pairs of high-quality German data
  • Optimized for retrieval tasks
  • Supported Languages: German and English.
  • Requires a prompt: query: {query} for the query and passage: {doc} for the document

Performance

On the NDCG@10 metric, our model achieves an average performance of 51.7, setting a new standard for open-source German embedding models:

Model Avg. Performance (NDCG@10) Binary Support MRL Support
deepset-mxbai-embed-de-large-v1 51.7
multilingual-e5-large 50.5
jina-embeddings-v2-base-de 50.0
Closed Source Models
Cohere Multilingual v3 52.4 -

In a case study with a legal data client, our model outperformed domain-specific alternatives:

Model Avg. Performance (MAP@10)
deepset-mxbai-embed-de-large-v1 90.25
voyage-law-2 84.80

Binary Quantization and Matryoshka

Our model supports both binary quantization and Matryoshka Representation Learning (MRL), allowing for significant efficiency gains:

  • Binary quantization: Retains 91.8% of performance while increasing efficiency by a factor of 32
  • MRL: A 25% reduction in vector size still leaves 97.5% of model performance
  • At 512 dimensions, over 93% of model performance remains while cutting embedding sizes in half

These optimizations can lead to substantial reductions in infrastructure costs for cloud computing and vector databases. Read more here.

Quickstart

Here are several ways to produce German sentence embeddings using our model. Note that you need to provide the prompt: query: {query} for the query and passage: {doc} for the document.

Community

Join our discord community or the Haystack community discord to share your feedback and thoughts. We're here to help and always happy to discuss the exciting field of machine learning!

License

Apache 2.0

Citation

@online{germanemb2024mxbai,
 title={Open Source Gets DE-licious: Mixedbread x deepset German/English Embeddings},
 author={Sean Lee and Aamir Shakir and Darius Koenig and Julius Lipp},
 year={2024},
 url={https://www.mixedbread.ai/blog/deepset-mxbai-embed-de-large-v1},
}
Downloads last month
384,996
Safetensors
Model size
0.5B params
Tensor type
F16
·

Model tree for mixedbread-ai/deepset-mxbai-embed-de-large-v1

Adapters
2 models
Finetunes
7 models
Quantizations
7 models

Spaces using mixedbread-ai/deepset-mxbai-embed-de-large-v1 3

Collection including mixedbread-ai/deepset-mxbai-embed-de-large-v1

Paper for mixedbread-ai/deepset-mxbai-embed-de-large-v1