VOOZH about

URL: https://www.geeksforgeeks.org/mongodb/how-to-implement-data-masking-in-mongodb/

⇱ How to Implement Data Masking in MongoDB? - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

How to Implement Data Masking in MongoDB?

Last Updated : 2 Apr, 2026

Data masking lets you share, display, or process data without revealing the original values. Below are real-world scenarios where masking can be helpful:

  • A healthcare portal displays patient records to support staff without exposing full social security numbers or diagnosis codes.
  • An e-commerce platform logs order history for QA testing, excluding real customer addresses and payment details.
  • A ride-sharing app shares trip data with third-party auditors without revealing driver or passenger contact information.

Masking also helps you meet regulatory requirements, such as the General Data Protection Regulation (GDPR), Payment Card Industry Data Security Standard (PCI DSS), and Health Insurance Portability and Accountability Act (HIPAA). These regulations require organizations to limit the exposure of sensitive data.

MongoDB supports multiple masking approaches.

This tutorial shows you how to implement 4 of them using Python:

  • View-based masking
  • Aggregation pipeline masking
  • Static data masking
  • Tokenization

You can find all the code samples for this tutorial in the GitHub repository.

Prerequisites

Before you start, make sure you have the following ready:

  • Python 3.8 or later installed.
  • A MongoDB Atlas account with an M0 (Free tier) cluster set up.
  • pymongo and faker Python packages installed.
  • Basic familiarity with Python and MongoDB collections.
  • Your Atlas connection string, available in the Atlas UI under Database > Connect > Drivers.

Make sure your IP address is whitelisted in Atlas before you run any scripts. Go to Security > Database & Network Access > IP Access List in the Atlas UI and add your current IP address. Your scripts won't connect to your cluster without this step.

Set Up Your Project

Step 1: Create a folder for your project and navigate into it in your terminal:

  • mkdir mongodb-masking
  • cd mongodb-masking

Step 2: Create and activate a virtual environment to keep your project dependencies isolated:

  • python -m venv venv
  • source venv/bin/activate

Step 3: On Windows, activate the virtual environment with:

venv\Scripts\activate

Step 4: Now install the required packages:

python -m pip install pymongo faker

You'll create a separate Python file for each masking technique as you work through this tutorial. All files go in the `mongodb-masking` folder.

View-Based Masking

A MongoDB view is a read-only virtual collection. It runs an aggregation pipeline on a base collection and returns transformed results. The original data stays untouched in the base collection. You create the view once, then point your application or tools to it instead of the base collection.

View-based masking works best in scenarios where the same masked data gets queried repeatedly.

For Example:

  • Dashboards and reporting tools that query the same collection repeatedly
  • Scenarios where you want a permanent, reusable masked access layer for sensitive data

Advantages of View-Based Masking

View-based masking has the following advantages:

  • One-time setup
  • No code changes needed per query
  • Consumers query the view directly without access to the base collection

Disadvantages of View-Based Masking

View-based masking has the following disadvantages:

  • The masking logic is fixed at view creation time
  • It's not suited for dynamic or role-based masking without creating multiple views

Steps to Apply View-Based Masking

In your `mongodb-masking` folder, create a new file named `view_masking.py`. Start by importing `MongoClient` from `pymongo` and connecting to your Atlas cluster:

Python


Next, insert sample cardholder documents into a base collection. The `cards` collection holds the raw cardholder records. Each document includes a cardholder name, full card number, CVV, expiry date, and billing address:

Python


Create a view that masks sensitive fields. The view uses a `$project` stage to suppress `_id`, show only the last four digits of the card number, and fully redact the CVV and billing address:

Python


Now, query the masked view by reading from `cards_masked` instead of the base `cards` collection. Any consumer who queries this view never sees the full card number, CVV, or billing address:

Python


Run the script:

python view_masking.py

You should get the following output:

Json

[
{
"cardholder_name": "Jane Doe",
"card_number": "************1111",
"cvv": "REDACTED",
"expiry_date": "12/26",
"billing_address": "REDACTED"
},
{
"cardholder_name": "John Smith",
"card_number": "************5559",
"cvv": "REDACTED",
"expiry_date": "08/27",
"billing_address": "REDACTED"
}
]

Aggregation Pipeline Masking

Aggregation pipeline masking applies transformation logic at query time inside your Python code. No permanent view gets created. Each query runs a pipeline that masks fields before returning results. This gives you more control than a view because you can adjust the masking logic per query based on the user's role or context.

Aggregation pipeline masking works best when different users need different levels of access to the same data.

For Example:

  • Applications where different roles need different levels of access to the same records
  • Scenarios where the masking logic needs to change dynamically per request

Advantages of Aggregation Pipeline Masking

Aggregation pipeline masking has the following advantages:

  • Masking logic can change per query based on user role or context
  • No separate database objects to manage

Disadvantages of Aggregation Pipeline Masking

Aggregation pipeline masking has the following disadvantages:

  • Masking logic lives in application code, which can become scattered if not managed carefully
  • It requires slightly more code to maintain compared to a view

Step to Apply Aggregation Pipeline Masking

In your `mongodb-masking` folder, create a new file named `aggregation_masking.py`. Start by importing `MongoClient` from `pymongo` and connecting to your Atlas cluster:

Python


Next, define a reusable masking function. The `get_masked_cards()` function accepts a `user_role` parameter and passes it into the aggregation pipeline as a variable using `let`. The pipeline uses `$cond` to evaluate `$$role` inside MongoDB and decide which fields to show or redact. The card number and billing address are masked for all roles. The CVV and expiry date are conditionally shown or fully redacted depending on the role:

Python


Now, call the function with the appropriate role. For example, support agents may receive a more restricted view, while fraud analysts see the CVV and expiry date to help them investigate suspicious transactions:

Python


Run the script:

python aggregation_masking.py

You should get the following output:

json

[
{
"cardholder_name": "Jane Doe",
"card_number": "************1111",
"cvv": "REDACTED",
"expiry_date": "REDACTED",
"billing_address": "REDACTED"
},
{
"cardholder_name": "John Smith",
"card_number": "************5559",
"cvv": "REDACTED",
"expiry_date": "REDACTED",
"billing_address": "REDACTED"
}
]

---

[
{
"cardholder_name": "Jane Doe",
"card_number": "************1111",
"cvv": "123",
"expiry_date": "12/26",
"billing_address": "REDACTED"
},
{
"cardholder_name": "John Smith",
"card_number": "************5559",
"cvv": "456",
"expiry_date": "08/27",
"billing_address": "REDACTED"
}
]

Static Data Masking

Static data masking creates a separate anonymized copy of your database. A Python script reads documents from your source collection, replaces sensitive fields with fake values, and writes the results to a target collection. The source data stays intact. You can run this script once or on a schedule to keep your development or test environment updated.

Static masking is the right choice when you need a safe, realistic dataset outside of production.

It's well suited for:

  • Seeding development and test environments with realistic data
  • Sharing sensitive datasets with third parties
  • QA testing with synthetic data

Advantages of Static Data Masking

Static masking has the following advantages:

  • The masked copy is completely separate from production data
  • Teams get a full, realistic dataset with no real personal information
  • No runtime overhead on production queries

Disadvantages Static Data Masking

Static masking has the following disadvantages:

  • The masked copy can become outdated if not refreshed regularly
  • It doesn't support live or real-time access to masked data

Steps to run a Static Masking Script

In your `mongodb-masking` folder, create a new file named `static_masking.py`. Start by importing `Faker` from `faker` and `MongoClient` from `pymongo`, then connect to your Atlas cluster:

Python


Next, seed the source collection with sample cardholder documents. Insert realistic-looking records into `production.cards` to simulate what a live collection would contain:

Python


Now, write the masking and copy script. The `mask_document()` function replaces every sensitive field with a Faker-generated value. The `run_static_masking()` function drops the target collection, reads every document from the source, masks it, and inserts the result into `development.cards`:

Python


Finally, verify the output by retrieving one document from the target collection to confirm the masking worked:

Python


Run the script:

python static_masking.py

You should get an output different from your inserted data.

For example:

Json

{
"cardholder_name": "Patricia Moore",
"card_number": "4716823519064731",
"cvv": "901",
"expiry_date": "03/28",
"billing_address": "245 Oak Lane, Springfield, OH 45501"
}

Tokenization

Tokenization replaces a sensitive value with a random token. A vault collection stores the mapping between each token and its original value. Your main collection stores only tokens. Authorized parts of your application call the vault to retrieve the original value. Any part of the system that doesn't need the real value handles only the token.

Tokenization is the right choice when a sensitive value must remain referenceable without ever appearing in plain text.

It's well suited for:

  • Identifiers and sensitive values that multiple parts of your system need to reference
  • Any value that must stay consistent and referenceable across your system without appearing in plain text

Advantages of Tokenization

Tokenization has the following advantages:

  • The sensitive value never appears in the main collection
  • Tokens are consistent and referenceable across your system
  • Parts of the application that don't call `detokenize()` never see the real card number

Disadvantages of Tokenization

Tokenization has the following disadvantages:

  • It adds an extra database read for every `detokenize()` call
  • Querying the original card number requires tokenizing the query value first
  • The vault collection must be carefully secured since it's the only place the original values exist

Steps to Implement Tokenization

In your `mongodb-masking` folder, create a new file named `tokenization.py`. Start by importing `uuid`, `datetime`, and `MongoClient`, then connect to your Atlas cluster:

Python


Next, write the `tokenize()` and `detokenize()` functions. `tokenize()` checks whether a token already exists for a given value and returns it if so. If no token exists, it generates a new UUID, stores the mapping in the vault, and returns the new token. `detokenize()` looks up the token in the vault and returns the original value:

Python


Now, store a token instead of the real card number. Convert the card number to a token before inserting the payment record. The `payments` collection never stores the raw card number:

Python


Finally, retrieve the original value when needed. Fetch the payment record and pass its token to `detokenize()` to recover the original card number. Only the parts of your application that call `detokenize()` ever see the real value:

Python


Run the script:

python tokenization.py

You should get an output similar to this:

Stored token: 3f2a1b4c-8d7e-4f6a-9b2c-1d3e5f7a9b0c

Original card number: 4111111111111111

> _**Note:** In production, store the vault collection in a separate database with restricted access. Encrypt sensitive fields within the vault documents to provide an additional layer of security. You can also add TTL indexes for token expiration and implement strict access control separation between the vault and the main collection._

Choosing the Right Technique

Each technique solves a different problem, and many production applications use more than one.

The table below summarizes when to reach for each:

TechniqueBest forKey advantage
View-based maskingPermanent, reusable access layersOne-time setup, no code changes per query
Aggregation pipeline maskingRole-based, dynamic accessFlexible per-query control
Static maskingDev and test environmentsFull dataset, zero production exposure
TokenizationPayment data, reusable referencesOriginal value never stored in plain text

Key Takeaways

  • MongoDB Atlas free tier (M0) supports all four masking techniques: view-based masking, aggregation pipeline masking, static masking, and tokenization
  • View-based masking and aggregation pipeline masking are best for controlling live query access
  • Static masking is the safest option for dev and test environments
  • Tokenization protects values that must stay referenceable without appearing in plain text
  • You can combine all four techniques in a single application for layered protection

Common Reader Concerns

  • Most masking and tokenization techniques work on a free Atlas M0 cluster; a paid plan is only needed for advanced features like field-level encryption.
  • Tokenized data cannot be queried directly, you must tokenize the input first. For frequent queries, masking approaches are more suitable.
  • These techniques can be combined, such as using static masking for test data and tokenization in production.
  • Performance impact is minimal: masking adds slight query overhead, static masking has none, and tokenization may require an extra lookup.
  • Tools like Faker are safe for generating synthetic test data, as they do not use real user information.
Comment
Article Tags:
Article Tags:

Explore