![]() |
VOOZH | about |
An embedding is a list of numbers (a vector) that represents the meaning of a piece of data. Most commonly, this is text, but embeddings can represent images, audio, and other data types depending on the model.
// Text -> embedding
"What is an embedding?" -> [0.00072939653, ..., -0.033257525]
The main idea is that semantically similar inputs produce vectors that are "close" to each other in vector space. That's what makes embeddings useful: instead of matching exact keywords, you can retrieve results that are related in meaning.
In practice, embeddings show up in semantic search ("find things like this"), recommendations ("users who liked X might like Y"), and retrieval-augmented generation (RAG), where you retrieve relevant context from a database and feed it into an LLM.
A vector database is a system that stores vectors and can efficiently retrieve the nearest vectors to a query vector. "Nearest" is defined using a similarity measure such as dot product, cosine similarity, or Euclidean distance.
MongoDB can act as your vector database without splitting your stack. You store embeddings inside the same documents as your operational fields, and then query them using MongoDB Vector Search. That means the "thing you search" (like a support ticket body) and the embedding for that thing can live together in one document, which simplifies your architecture.
Here's an example of a MongoDB document storing operational fields for a movie, plus an embedding stored as binary (BinData):
{
"_id": { "...": "..." },
"plot": "A young aristocrat must masquerade as a fop in order to maintain his secret identity of Zorro as he restores justice to early California.",
"title": "The Mark of Zorro",
"year": 1940,
"imdb": { "rating": 7.6, "votes": 7260, "id": 32762 },
"plot_embedding": {
"$binary": {
"base64": "JwD5iDi8...vLhGxbw=",
"subType": "09"
}
}
}
In this tutorial we'll start with the simplest storage format (an array), and later talk about compression options like BinData vectors for scale.
Semantic search retrieves results based on meaning, not just keyword overlap.
If someone searches for "ocean tragedy", a keyword search might miss "Titanic: the story of the 1912 sinkingβ¦" because the query words don't appear in the text. Semantic search should still retrieve Titanic, because the meaning is closely related.
Most semantic search pipelines follow the same pattern:
The important MongoDB-specific detail is that step 3 requires a vector search index on your embedding field.
If you want a structured way to keep learning, the MongoDB Vector Search Fundamentals skills badge is a good next step. It walks you through the core concepts with hands-on exercises, and you'll earn a credential you can share on LinkedIn.
To create embeddings, you use an embedding model. In this tutorial, we'll use Voyage AI's voyage-3-large model via LangChain4j. The model takes in text and returns a vector.
An embedding model is the component that converts input (like a string of text) into a numeric vector. Different models produce different vector sizes, called dimensions. That matters because MongoDB Vector Search needs you to specify the exact dimensions in your index definition.
If your model outputs 1024-dimensional vectors, then:
Choosing a model is mostly about tradeoffs: storage cost versus quality. A smaller embedding can be more storage-efficient. A larger embedding can capture more nuance, which can improve retrieval quality depending on your data.
The model also determines the maximum input size (tokens) and the operational cost of generating embeddings at scale. For RAG-style retrieval, you'll often care about how well the model ranks relevant results near the top of the list.
Here's a quick reference for the models mentioned in the MongoDB docs:
This tutorial expects two environment variables at runtime:
In a terminal session, you can set them like this:
export VOYAGE_AI_KEY="<api-key>"
export MONGODB_URI="<connection-string>"
In an IDE, you'd typically set these in the run configuration. In production, you'd manage them through your deployment environment, CI/CD, or a secrets manager.
In this section we'll build a small Java demo project with four classes that mirror the real workflow:
You can treat these classes as "scripts," but the separation of concerns is intentional. In business systems, ingestion, indexing, and query-time retrieval are usually separate flows that may run at different times.
Create a new maven project. In here, you'll include two core dependencies.
First is the MongoDB Java Sync Driver. This gives you MongoClient, collections, insertMany(), and the aggregation pipeline APIs you'll use for vector search queries.
Second is LangChain4j's Voyage AI module. This gives you an embedding model client that can call Voyage AI and return vectors for your text.
Add this to your pom.xml:
If you want to run each class from the command line using Maven, add the exec-maven-plugin. This isn't required if you only plan to run in your IDE, but it makes tutorial instructions and testing much smoother:
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>3.5.0</version>
</plugin>
</plugins>
</build>
You don't run this file directly. It's the shared piece that the ingestion and query code call into. The goal is to keep embedding logic in one place: the model choice, authentication, and conversion into BSON-friendly values.
It exposes two methods. One generates embeddings in batch for a list of strings, which is what you want for ingestion/backfills. The other generates a single embedding for query-time search.
Create EmbeddingProvider.java:
The conversion to BsonArray and BsonDouble is the key integration detail here. The embedding model gives you a list of numeric values; you need to store those in MongoDB as part of a BSON document. This format is easy to insert and works cleanly with the indexing and query steps in this tutorial.
Now we'll build the ingestion script. This is the part that mimics a real "embedding backfill": take some source text, generate embeddings for it, and insert documents that include both the raw text and its embedding.
Create CreateEmbeddings.java:
This script deliberately keeps the "business field" (text) and the embedding field together. While it is very convenient, it's also the pattern you want when you build real systems. It makes debugging and iteration easier, and it keeps your database model aligned with the actual meaning you're storing.
If you're running from the command line, make sure your environment variables are set, then run:
mvn -q compile exec:java -Dexec.mainClass="com.mongodb.CreateEmbeddings"
When it runs successfully, you should see output similar to:
If you run exec:java and Maven can't find your class, double-check your package name matches the folder structure under src/main/java.
If you're using Atlas, you can open sample_db.embeddings in the Atlas UI and confirm the documents contain both text and embedding.
At this point you've stored vectors, but you can't search them yet. MongoDB Vector Search requires a vector index on the embedding field. Once the index exists, query-time search becomes: embed the query string, then use $vectorSearch to find the closest vectors.
This script creates a vector search index named vector_index on the embedding field. The key configuration is numDimensions, which must exactly match the model output dimensions. For voyage-3-large, that's 1024.
Create CreateIndex.java:
This uses dotProduct similarity, matching the sample approach in the MongoDB docs. The polling loop is there because index creation is asynchronous. If you try to query immediately, you may hit a "not queryable" state. In practice, index build time is usually around a minute for small demos, but can vary depending on your environment.
mvn -q compile exec:java -Dexec.mainClass="com.mongodb.CreateIndex"
You should see output like:
Now we'll do the end-to-end search flow.
We'll embed a search phrase (for example, "ocean tragedy"), then run a $vectorSearch stage against our indexed embedding field. The query returns the closest matches, along with a search score.
Create VectorQuery.java:
A small but important detail here is conversion. EmbeddingProvider returns a BSON array. The $vectorSearch helper expects a plain Java List<Double>, so we walk the BSON values and extract the doubles. Once you have that list, the vector search stage can run.
mvn -q compile exec:java -Dexec.mainClass="com.mongodb.VectorQuery"
You should see results similar to:
The exact scores will vary by model and environment, but the semantic ordering is what you care about.
Our demo works on three documents. Real embedding pipelines work on hundreds of thousands or millions, and the operational constraints become very real.
Embedding generation costs time and compute, and you can hit memory bottlenecks if you try to do too much at once. The simplest safe pattern is batching: generate embeddings in chunks, insert/update in chunks, and measure performance as you scale.
If you're backfilling an existing collection, it's common to page through documents, generate embeddings for a batch, and write them back using bulk writes or batched updates.
Vector Search is strict about dimensions. If your index says numDimensions: 1024 but the stored vectors are 1536-dimensional, you'll get errors at index or query time.
A practical rule is: model dimensions = stored vector dimensions = index dimensions = query vector dimensions.
If you store a large number of vectors, the default array format can become expensive in disk and memory footprint. MongoDB supports compressing embeddings by converting float vectors to BSON BinData vectors (float32 subtype). Binary storage is more space efficient and can improve performance because less data needs to be loaded into the working set during queries, especially when returning larger result sets.
Driver support for BinData vectors includes Java Driver v5.3.1+ (as well as multiple other drivers). If you decide to adopt this, the natural place to implement it is in EmbeddingProvider, because that's where vectors are currently converted into BSON.
Not necessarily. If your app data already lives in MongoDB, storing embeddings alongside your existing documents can simplify architecture and keep operational + vector workloads in one place.
MongoDB needs to know the expected vector size to index and query correctly. numDimensions must exactly match what your embedding model produces.
MongoDB supports all three, and you choose it as part of the vector field's index definition (similarity). The "best" choice is the one your embedding model is intended to be compared with:
- Cosine similarity compares direction (angle) and is a common default for text embeddings.
- Dot product is closely related: if your vectors are normalized to unit length, dot product and cosine produce the same ranking (because cosine becomes dot product when norms are 1).
- Euclidean distance compares straight-line distance in the vector space; it can work well when the model/space is trained with L2 distance in mind.
The most defensible guidance for a tutorial is: use the metric recommended by the embedding provider or model docs, and if they don't specify, start with cosine or dotProduct and validate with a small evaluation set (because relevance quality is ultimately empirical).
For a demo and for learning, arrays are the simplest: they're readable in the UI, easy to debug, and work cleanly with the driver APIs.
For production scale, you should strongly consider storing vectors as binary float vectors (BinData) instead of "array of doubles". The MongoDB docs explicitly support indexing vector fields stored as binary data (and discuss quantization options), and the Java driver provides a BinaryVector helper specifically because it's more storage-efficient than a List<Double>. Smaller vectors mean a smaller working set and less I/O pressure, which can translate into better performance and lower cost as your dataset grows.
A reasonable "business" rule of thumb:
- Arrays for prototypes, small corpora, or when you're still iterating on your pipeline.
- BinData float vectors once you care about footprint, throughput, and predictable query latency.
Start small. Embed a handful of known texts, run a few test queries, and check whether the returned results make semantic sense. Once the workflow is correct, iterate on model choice, chunking strategy, and retrieval settings based on relevance and performance.