VOOZH about

URL: https://www.geeksforgeeks.org/mongodb/scaling-vector-workloads-in-mongodb-atlas/

⇱ Scaling Vector Workloads in MongoDB Atlas - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Scaling Vector Workloads in MongoDB Atlas

Last Updated : 2 Apr, 2026

The success of an AI application in production is not determined by the quality of a single embedding, but by the stability of latency under load, especially at high percentiles. In vector systems, the challenge is rarely "finding the nearest neighbor" and almost always doing so thousands of times per second, predictably, concurrently, and cost-effectively.

In MongoDB Atlas, scaling vector workloads requires mastering three fundamental axes:

  • Memory Efficiency
  • Execution Isolation
  • Explicit control of parallelism and I/O.

This text explores these axes through the real architecture of Atlas Vector Search, going beyond basic usage and treating the system as production infrastructure.

Internal Architecture: mongod, mongot, and the Real Cost of a Query

Atlas Vector Search is built on a clear separation of responsibilities between two processes.

  • mongod is the transactional process, based on WiredTiger, responsible for persistence, BSON document caching, and consistency guarantees.
  • mongot is the search process, based on Lucene, responsible for inverted indexes, HNSW graphs, and execution of search and vector search operators.

A typical vector query starts in mongot, which traverses the HNSW graph to identify semantically similar candidates. Next comes the "hydration" step, where mongot requests the full documents corresponding to the returned IDs from mongod.

This second step is often underestimated. Under high concurrency, the cost of random I/O to fetch large documents from the transactional layer becomes dominant, even when the vector search itself is fast. The separation of processes allows search and storage to scale independently, but it also makes explicit that every hop between mongot and mongod has a measurable cost.

At this point, indexing and data-return decisions become decisive.

I/O Optimization: storedSource and returnStoredSource

  • By default, search indexes do not store the original document. Mongot returns IDs and scores, and mongod is queried to fetch the final fields. This model preserves strong consistency, but penalizes workloads where documents are large and response time matters.
  • The storedSource feature allows specific fields to be stored directly in the search index. When combined with returnStoredSource: true, mongot can answer the query without consulting mongod, completely eliminating I/O on the transactional layer.
  • In practice, this changes the latency profile of the query. Small and stable fields such as titles, slugs, identifiers, or display metadata can be returned directly from the index, reducing latency by tens of percent in real scenarios.
  • This mechanism, however, requires discipline. With vector indexes, storedSource can still be used, but it has important constraints: you can store selected document fields, while vector fields themselves (the embedding array) aren’t eligible to be stored via storedSource, and broad "store everything" configurations are not supported. Correct usage of storedSource is surgical: few fields, high return value, low storage cost.
  • The result is a search layer that can respond independently, without crossing processes, which makes a significant difference under concurrent load.

Index Definition with Selective storedSource:

Json

{
"mappings": {
"dynamic": false,
"fields": {
"embedding": {
"type": "vector",
"dimensions": 1024,
"similarity": "cosine"
},
"title": { "type": "string" }
}
},
"storedSource": {
"include": ["title"]
}
}

Query using returnStoredSource:

Json

db.docs.aggregate([
{
$vectorSearch: {
path: "embedding",
queryVector: [...],
limit: 10,
returnStoredSource: true
}
}
])

In this case, mongod does not participate in execution.

Index Design and Field typing for Real Scale

  • Index design is where most vector systems silently fail.
  • Dynamic mappings appear convenient, but in production, they are almost always prohibitive. They index fields that never participate in search, inflating the index and increasing the cost of every query. In vector systems, where the index is already large by definition, this quickly translates into memory pressure and unstable latency.
  • Static mappings, on the other hand, force explicit decisions. Every indexed field must justify its existence. This control is what enables predictability at scale.
  • Within this context, correct field typing is critical. Fields used for operational filters inside $vectorSearch must be of type token, not string. The token type indexes the literal value without analyzers, allowing MongoDB Atlas to apply filters directly through the inverted index before any traversal of the HNSW graph.
  • String fields go through lexical analysis. This is desirable for text search, but introduces unnecessary overhead and degrades performance when used as vector filters. At scale, this difference completely changes the CPU profile of the query.
  • When filters are well typed and selective, MongoDB Atlas can restrict the set of HNSW nodes before the vector search even begins. This transforms the search from a global problem into a local one, semantically more coherent and computationally cheaper.

Correct Index for Vector Filters:

Json

{
"fields": {
"embedding": {
"type": "vector",
"dimensions": 1024,
"similarity": "cosine"
},
"tenantId": { "type": "token" },
"status": { "type": "token" }
}
}

Filter Applied before HNSW Traversal:

Json

filter: {
tenantId: "org_123",
status: "active"
}

Nested Structures and Semantic Integrity

In more complex domains, documents often contain arrays of objects or nested structures. Treating this data as flat fields breaks semantics and produces incorrect results.

Using document and embeddedDocuments in the mapping preserves the relationship between correlated fields within the same object. This is especially important when embeddings are generated from document fragments or when filters must respect subdocument cohesion.

This decision is not only semantic; it directly impacts result quality and prevents incorrect filters from artificially expanding the search space.

Json

{
"fields": {
"chunks": {
"type": "embeddedDocuments",
"fields": {
"embedding": {
"type": "vector",
"dimensions": 768
},
"section": { "type": "token" }
}
}
}
}

Memory Engineering: Scalar Quantization as a Requirement, not an Optimization

  • RAM is the primary limiting resource for vector workloads at scale.
  • A float32 vector consumes 4 bytes per dimension. A 1536-dimension embedding therefore occupies approximately 6 KB per document. At tens or hundreds of millions of vectors, this footprint quickly exceeds the available memory of search nodes, forcing parts of the HNSW graph out of RAM and into disk-backed storage.
  • Once the vector index no longer fits fully in memory, latency becomes unstable. Page faults and disk thrashing dominate query execution time, especially under concurrency, making p95 and p99 latencies unpredictable.
  • Scalar quantization (int8) addresses this constraint directly by reducing each dimension from 32 bits to 8 bits. This lowers memory consumption by approximately 75%, allowing substantially larger vector indexes to remain fully resident in RAM.
  • From a systems perspective, the key property is not the reduced precision of individual values, but the preservation of the relative geometry of the vector space. MongoDB Atlas maintains the topology of the HNSW graph after quantization, so nearest-neighbor relationships are largely preserved even though the underlying representation is compressed.
  • In practice, this means recall degradation is minimal for most workloads, while memory density improves dramatically. More vectors fit in cache, graph traversal remains memory-bound instead of I/O-bound, and latency becomes far more stable as the dataset grows.
  • At production scale, scalar quantization is not a micro-optimization. It is the mechanism that makes sustained low-latency vector search possible beyond a few million embeddings.

Json

{
"embedding": {
"type": "vector",
"dimensions": 1024,
"quantization": "scalar"
}
}

Automatic Embeddings and Voyage AI: Consistency and Semantic Density

  • Atlas Vector Search supports automatic embedding generation as part of the data ingestion and update workflow. In this model, embeddings are created and refreshed natively by Atlas when documents are inserted or modified, based on configured text fields.
  • This approach removes the need for external embedding pipelines and, more importantly, guarantees model consistency. The same embedding model, configuration, and version are used for both stored vectors and query vectors, eliminating an entire class of silent failures caused by model drift or mismatched preprocessing.
  • Automatic embeddings are now configured directly in the search index definition using the embeddingDefinition field. In this model, the index explicitly defines how embeddings are generated, while Atlas manages embedding creation and updates automatically as part of the ingestion and update workflow. The index continues to define how vectors are stored and searched, but it also becomes the authoritative place where the embedding model and configuration are declared.
  • When combined with Voyage AI models, this workflow provides an additional advantage. Voyage AI models are designed for high semantic density and support techniques such as multi-representation learning, allowing embeddings with fewer dimensions to preserve meaningful semantic relationships.
  • The practical impact is systemic rather than local. Fewer dimensions lead to smaller vector indexes, lower memory consumption, faster graph traversal, and reduced operational cost, with little to no observable loss in recall for most semantic search and RAG workloads.
  • This separation of concerns is intentional in Atlas Vector Search. Embedding generation and model selection belong to the ingestion pipeline, while quantization, similarity metrics, and graph execution belong to the index. Keeping these layers independent allows the system to evolve without forcing disruptive reindexing or schema changes.

Parallelism: Concurrent and numPartitions as Distinct Mechanisms

Atlas provides parallelism at different levels, and confusing these mechanisms leads to incorrect tuning.

The concurrent parameter allows a single query to execute in a multi-threaded manner, scanning index segments in parallel. It is useful when the bottleneck of a query is CPU, not global concurrency.

numPartitions, on the other hand, operates at the structural level of the index. It divides the index into independent partitions, allowing multiple queries to be processed in parallel more efficiently. Partition sizing is driven by the memory capacity of search nodes rather than arbitrary object counts, ensuring that each partition’s working set fits comfortably in RAM. During query execution, each partition is searched independently and partial results are merged at the end.

While concurrent reduces the latency of an individual query, numPartitions increases throughput and stability under concurrent load. They solve different problems and should be used together.

Json

{
"numPartitions": 4
}

Json

{
$vectorSearch: {
concurrent: true
}
}

Concurrency and Real Execution Limits

  • In production, the most important metric is not average latency, but behavior under concurrency.
  • MongoDB Atlas enforces limits on concurrent queries per search node. When these limits are reached, new queries are queued or experience degradation. Vector queries, because they consume more CPU and memory, reach these limits faster.
  • Designing for concurrency requires keeping queries lean, avoiding excessively complex pipelines, properly sizing search nodes, and monitoring mongot saturation metrics. Ignoring this aspect leads to systems that perform well in isolated tests but collapse under real traffic.

Physical Scalability: Dedicated Search Nodes and Sharding

  • Dedicated search nodes fully isolate the search layer from the transactional layer. This eliminates RAM contention between WiredTiger and Lucene and allows asymmetric scaling of search capacity.
  • When the vector index, even quantized, exceeds the RAM/Storage or storage capacity of the largest available node, sharding becomes necessary. In this model, the HNSW graph is fragmented across shards, each with its own search nodes, ensuring high availability.
  • By default, Atlas Search executes queries in a scatter-gather fashion. Vector queries are evaluated across all shards, and partial results are collected and merged to produce the final ranked result set. This behavior ensures correctness and consistency when searching a globally distributed vector index.
  • Sharding by keys such as tenantId becomes a scalability strategy for managing data distribution and operational isolation, while the search execution model remains scatter-gather.
  • This approach maintains millisecond-level latency as long as each shard’s local index fits within memory constraints, even as the total cluster grows to billions or trillions of vectors.

Shard key:

{ tenantId: 1 }

Precision: ANN, ENN, and Fine-Tuning numCandidates

  • ANN based on HNSW is the production standard. It offers logarithmic complexity and millisecond-level latency.
  • ENN performs exhaustive search and scales linearly with the dataset. Its use is limited to calibration, validation, or very small datasets.
  • The numCandidates parameter controls the computational effort of ANN. Higher values increase recall but also increase cost. In systems with strong filtering and correct typing, this value can remain relatively low without perceptible quality loss.
  • Precision, in this context, is not a property of the algorithm alone, but of the system as a whole.

Json

{
$vectorSearch: {
limit: 10,
numCandidates: 200
}
}

Scaling vector workloads in MongoDB Atlas is a systems engineering problem, not merely a semantic search problem. It requires conscious decisions about memory, I/O, parallelism, concurrency, embedding automation, and continuous index evolution.

When these decisions are made holistically, Atlas becomes a highly efficient platform for production vector workloads, capable of sustaining high concurrency with predictable latency and controlled cost.

This is the point at which vector search stops being an experiment and becomes infrastructure.

Comment
Article Tags:
Article Tags:

Explore