![]() |
VOOZH | about |
Data masking lets you share, display, or process data without revealing the original values. Below are real-world scenarios where masking can be helpful:
Masking also helps you meet regulatory requirements, such as the General Data Protection Regulation (GDPR), Payment Card Industry Data Security Standard (PCI DSS), and Health Insurance Portability and Accountability Act (HIPAA). These regulations require organizations to limit the exposure of sensitive data.
MongoDB supports multiple masking approaches.
This tutorial shows you how to implement 4 of them using Python:
You can find all the code samples for this tutorial in the GitHub repository.
Before you start, make sure you have the following ready:
Make sure your IP address is whitelisted in Atlas before you run any scripts. Go to Security > Database & Network Access > IP Access List in the Atlas UI and add your current IP address. Your scripts won't connect to your cluster without this step.
Step 1: Create a folder for your project and navigate into it in your terminal:
- mkdir mongodb-masking
- cd mongodb-masking
Step 2: Create and activate a virtual environment to keep your project dependencies isolated:
- python -m venv venv
- source venv/bin/activate
Step 3: On Windows, activate the virtual environment with:
venv\Scripts\activate
Step 4: Now install the required packages:
python -m pip install pymongo faker
You'll create a separate Python file for each masking technique as you work through this tutorial. All files go in the `mongodb-masking` folder.
A MongoDB view is a read-only virtual collection. It runs an aggregation pipeline on a base collection and returns transformed results. The original data stays untouched in the base collection. You create the view once, then point your application or tools to it instead of the base collection.
View-based masking works best in scenarios where the same masked data gets queried repeatedly.
For Example:
View-based masking has the following advantages:
View-based masking has the following disadvantages:
In your `mongodb-masking` folder, create a new file named `view_masking.py`. Start by importing `MongoClient` from `pymongo` and connecting to your Atlas cluster:
Python
Next, insert sample cardholder documents into a base collection. The `cards` collection holds the raw cardholder records. Each document includes a cardholder name, full card number, CVV, expiry date, and billing address:
Python
Create a view that masks sensitive fields. The view uses a `$project` stage to suppress `_id`, show only the last four digits of the card number, and fully redact the CVV and billing address:
Python
Now, query the masked view by reading from `cards_masked` instead of the base `cards` collection. Any consumer who queries this view never sees the full card number, CVV, or billing address:
Python
Run the script:
python view_masking.py
You should get the following output:
Json
[
{
"cardholder_name": "Jane Doe",
"card_number": "************1111",
"cvv": "REDACTED",
"expiry_date": "12/26",
"billing_address": "REDACTED"
},
{
"cardholder_name": "John Smith",
"card_number": "************5559",
"cvv": "REDACTED",
"expiry_date": "08/27",
"billing_address": "REDACTED"
}
]
Aggregation pipeline masking applies transformation logic at query time inside your Python code. No permanent view gets created. Each query runs a pipeline that masks fields before returning results. This gives you more control than a view because you can adjust the masking logic per query based on the user's role or context.
Aggregation pipeline masking works best when different users need different levels of access to the same data.
For Example:
Aggregation pipeline masking has the following advantages:
Aggregation pipeline masking has the following disadvantages:
In your `mongodb-masking` folder, create a new file named `aggregation_masking.py`. Start by importing `MongoClient` from `pymongo` and connecting to your Atlas cluster:
Python
Next, define a reusable masking function. The `get_masked_cards()` function accepts a `user_role` parameter and passes it into the aggregation pipeline as a variable using `let`. The pipeline uses `$cond` to evaluate `$$role` inside MongoDB and decide which fields to show or redact. The card number and billing address are masked for all roles. The CVV and expiry date are conditionally shown or fully redacted depending on the role:
Python
Now, call the function with the appropriate role. For example, support agents may receive a more restricted view, while fraud analysts see the CVV and expiry date to help them investigate suspicious transactions:
Python
Run the script:
python aggregation_masking.py
You should get the following output:
json
[
{
"cardholder_name": "Jane Doe",
"card_number": "************1111",
"cvv": "REDACTED",
"expiry_date": "REDACTED",
"billing_address": "REDACTED"
},
{
"cardholder_name": "John Smith",
"card_number": "************5559",
"cvv": "REDACTED",
"expiry_date": "REDACTED",
"billing_address": "REDACTED"
}
]
---
[
{
"cardholder_name": "Jane Doe",
"card_number": "************1111",
"cvv": "123",
"expiry_date": "12/26",
"billing_address": "REDACTED"
},
{
"cardholder_name": "John Smith",
"card_number": "************5559",
"cvv": "456",
"expiry_date": "08/27",
"billing_address": "REDACTED"
}
]
Static data masking creates a separate anonymized copy of your database. A Python script reads documents from your source collection, replaces sensitive fields with fake values, and writes the results to a target collection. The source data stays intact. You can run this script once or on a schedule to keep your development or test environment updated.
Static masking is the right choice when you need a safe, realistic dataset outside of production.
It's well suited for:
Static masking has the following advantages:
Static masking has the following disadvantages:
In your `mongodb-masking` folder, create a new file named `static_masking.py`. Start by importing `Faker` from `faker` and `MongoClient` from `pymongo`, then connect to your Atlas cluster:
Python
Next, seed the source collection with sample cardholder documents. Insert realistic-looking records into `production.cards` to simulate what a live collection would contain:
Python
Now, write the masking and copy script. The `mask_document()` function replaces every sensitive field with a Faker-generated value. The `run_static_masking()` function drops the target collection, reads every document from the source, masks it, and inserts the result into `development.cards`:
Python
Finally, verify the output by retrieving one document from the target collection to confirm the masking worked:
Python
Run the script:
python static_masking.py
You should get an output different from your inserted data.
For example:
Json
{
"cardholder_name": "Patricia Moore",
"card_number": "4716823519064731",
"cvv": "901",
"expiry_date": "03/28",
"billing_address": "245 Oak Lane, Springfield, OH 45501"
}
Tokenization replaces a sensitive value with a random token. A vault collection stores the mapping between each token and its original value. Your main collection stores only tokens. Authorized parts of your application call the vault to retrieve the original value. Any part of the system that doesn't need the real value handles only the token.
Tokenization is the right choice when a sensitive value must remain referenceable without ever appearing in plain text.
It's well suited for:
Tokenization has the following advantages:
Tokenization has the following disadvantages:
In your `mongodb-masking` folder, create a new file named `tokenization.py`. Start by importing `uuid`, `datetime`, and `MongoClient`, then connect to your Atlas cluster:
Python
Next, write the `tokenize()` and `detokenize()` functions. `tokenize()` checks whether a token already exists for a given value and returns it if so. If no token exists, it generates a new UUID, stores the mapping in the vault, and returns the new token. `detokenize()` looks up the token in the vault and returns the original value:
Python
Now, store a token instead of the real card number. Convert the card number to a token before inserting the payment record. The `payments` collection never stores the raw card number:
Python
Finally, retrieve the original value when needed. Fetch the payment record and pass its token to `detokenize()` to recover the original card number. Only the parts of your application that call `detokenize()` ever see the real value:
Python
Run the script:
python tokenization.py
You should get an output similar to this:
Stored token: 3f2a1b4c-8d7e-4f6a-9b2c-1d3e5f7a9b0c
Original card number: 4111111111111111
> _**Note:** In production, store the vault collection in a separate database with restricted access. Encrypt sensitive fields within the vault documents to provide an additional layer of security. You can also add TTL indexes for token expiration and implement strict access control separation between the vault and the main collection._
Each technique solves a different problem, and many production applications use more than one.
The table below summarizes when to reach for each:
| Technique | Best for | Key advantage |
|---|---|---|
| View-based masking | Permanent, reusable access layers | One-time setup, no code changes per query |
| Aggregation pipeline masking | Role-based, dynamic access | Flexible per-query control |
| Static masking | Dev and test environments | Full dataset, zero production exposure |
| Tokenization | Payment data, reusable references | Original value never stored in plain text |