VOOZH about

URL: https://www.sitepoint.com/how-to-build-privacy-safe-cross-organizational-data-joins-with-databricks-cleanrooms/

⇱ How to Build Privacy-Safe Cross-Organizational Data Joins with Databricks Cleanrooms


This metrics tool terrifies bad developers

Start free trial

This metrics tool terrifies bad developers

Start free trial
Community Article
Community articles are authored by SitePoint Premium contributors. Content is screened before publication, and SitePoint reserves the right to moderate or remove articles that violate our guidelines. Views expressed are those of the authors and do not necessarily reflect those of SitePoint.

How to Build Privacy-Safe Cross-Organizational Data Joins with Databricks Cleanrooms

Share this article

SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

TL;DR:

Databricks Cleanrooms let two organizations run analytics on combined sensitive datasets without either side's raw data ever moving. This tutorial walks through the full setup: Unity Catalog governance policies, provider and consumer configuration, writing a privacy-safe notebook join, and the production pitfalls that documentation never covers. The example uses financial transaction data but the pattern applies to any regulated cross-organizational collaboration.

There's a question I still can't answer cleanly: when a partnership ends and lawyers get involved, is an audit trail that lives inside Databricks actually sufficient? I've been thinking about it for two years. I'll come back to it at the end. It's the reason I started taking notes on all of this in the first place.

In 2022 we needed to join our transaction signals with a partner bank's chargeback data. The first suggestion in the room was a shared S3 bucket. I didn't push back hard enough and we got thirty minutes into scoping it before someone's calendar invite for a legal review landed in everyone's inbox. That call was forty minutes of silence, broken up by our counsel saying "you did what" at least twice. I remember staring at my screen trying to look busy while the silence stretched out. Somewhere in the middle of it someone dropped a link to Databricks Cleanrooms in the chat. Nobody in the room had used one in production. I said I'd figure it out. That was optimistic.

This post is what I wish had existed then. The example uses financial transaction data, but the pattern works anywhere two organizations have complementary datasets and a real reason not to just hand them over. Healthcare, adtech, logistics, whatever applies to you.

Get Your Environment Right First

Unity Catalog is the thing that kills timelines. Most teams discover mid-project that their workspace is on the Standard plan and Unity Catalog isn't enabled. This happened to us on a Wednesday. The partner call was Friday; it was not a good Wednesday.

Check this before anything else, on both sides, before writing a single line of code:

  • Databricks Runtime 13 . 3 LTS or above on both workspaces. Minimum version where the Python SDK is bundled and Cleanrooms features are fully supported. Earlier versions fail in ways that produce confusing errors and a long Slack thread nobody wants.
  • Unity Catalog enabled on both metastores. Requires Databricks Premium or above. If you're not sure, you're probably not on it.
  • Databricks-to-Databricks Delta Sharing turned on in both workspace settings.
  • Python 3 . 10 or above on any local machine running SDK setup scripts.
  • databricks-sdk installed: pip install databricks-sdk
  • A service principal on each side with appropriate permissions on their data assets.
  • A signed data processing agreement between both organizations covering permitted use, output ownership, and what happens when the partnership ends.

That last one. I keep putting it at the bottom of lists and it keeps being the most important thing on them. Six months into one engagement, someone left one of the organizations. Nobody had written down who owned the output tables. Three weeks of back-and-forth between legal teams followed, all of it preventable with a single clause drafted before any code was written. Sort it out first.

What You're Actually Building

A Databricks Cleanroom is a shared, isolated compute environment where two parties run analytics against combined datasets without either side being able to directly view, export, or reverse-engineer the other's raw data.

The part that took me the longest to internalize, and I read the docs twice before it clicked, was Delta Sharing. It is not a sync. Nothing moves. When a provider shares a table into a Cleanroom, the consumer's compute reads directly from the provider's object storage via short-lived signed credential URLs. Your data stays where it is. That is the sentence your legal team needs. Practice saying it out loud before the next meeting.

Most writeups hand-wave past how Delta Sharing actually works and it frustrates me, because the mechanism is what makes the privacy guarantee credible. It's not a policy sitting on top of a data copy. There is no copy. The compute comes to the data.

Unity Catalog sits on top of that and handles governance: column-level masking so raw card numbers never appear in shared compute, row-level access policies so only eligible records are shared, and identity federation between both organizations' service principals. The Cleanroom environment handles isolation. Notebooks run in a sandboxed cluster, results go through a review step before export, and every query and policy change gets logged to an immutable audit trail.

👁 BlockNote image

Step 1: Apply Governance Policies Before You Touch the Cleanroom

Apply Unity Catalog governance policies directly to the underlying table before registering anything with the Cleanroom. These enforce automatically in any downstream compute, including inside the Cleanroom. Define them once and they follow the data everywhere.

The most common mistake here is hardcoding the shared salt in the notebook and committing it to version control. Use Databricks Secrets. Replace ${SHARED_SALT} below with a pre-shared secret stored there, not inline.

-- Row-level policy: only records flagged for consortium sharing are visible

-- Replace 'partner_data_agreements' with your own access-control table

CREATE ROW ACCESS POLICY fraud_catalog . security . consortium_row_filter

AS (sharing_consent_flag STRING, data_residency_region STRING)

RETURN

    sharing_consent_flag = 'CONSORTIUM_ELIGIBLE'

    AND data_residency_region IN (

        SELECT allowed_region

        FROM fraud_catalog . security . partner_data_agreements

        WHERE partner_principal = current_user()

    );

ALTER TABLE fraud_catalog . signal_features . transaction_signals_gold

ADD ROW ACCESS POLICY fraud_catalog . security . consortium_row_filter

ON (sharing_consent_flag, data_residency_region);

-- Column mask: replace raw card numbers with a deterministic HMAC token

-- Both parties agree on the salt so join tokens match across orgs

-- Replace current_user() with your SHARED_SALT secret in production

CREATE MASKING POLICY fraud_catalog . security . mask_pan

AS (card_number STRING)

RETURN

    CASE

        WHEN is_account_group_member('cleanroom_fraud_analyst') THEN

            SHA2(CONCAT(card_number, current_user()), 256)

        ELSE NULL

    END;

ALTER TABLE fraud_catalog . signal_features . transaction_signals_gold

ALTER COLUMN card_number

SET MASKING POLICY fraud_catalog . security . mask_pan ;

Step 2: Provider Creates the Cleanroom

The provider is the party sharing data in. Run this from the provider's workspace.

One thing that isn't prominently documented: the Cleanroom name is case-sensitive. data_collaboration_cleanroom and Data_Collaboration_Cleanroom are different things and the failure is silent. Write the name down before you start and don't deviate from it.

from databricks . sdk import WorkspaceClient

from databricks . sdk . service . sharing import (

    CleanRoom, CleanRoomAsset, CleanRoomAssetTable, CleanRoomCollaborator

)

Use Databricks Secrets for auth. Never hardcode tokens.

w = WorkspaceClient(

    host='https: // adb-xxxx . azuredatabricks . net',  # your provider workspace URL

    token=DATABRICKS_TOKEN  # dbutils . secret . get(scope=" … ", key=" … ")

)

Create the Cleanroom

cleanroom = w . clean_rooms . create(name='data_collaboration_cleanroom')

print(f'Cleanroom created: {cleanroom . name}')

Invite the consumer org

consumer_metastore_id is found in their Unity Catalog metastore settings

w . clean_rooms . update(

    name='data_collaboration_cleanroom',

    clean_room=CleanRoom(

        collaborators=[CleanRoomCollaborator(

            global_metastore_id='consumer_metastore_id',  # replace with actual ID

            invite_recipient_email='dataplatform@consumer-org . example . com'

        )]

    )

)

Register provider table

Row and column policies from Step 1 enforce automatically here

w .clean_rooms . update(

    name='data_collaboration_cleanroom',

    clean_room=CleanRoom(

        local_assets=[CleanRoomAsset(

            name='transaction_signals',

            asset_type='TABLE',

            table=CleanRoomAssetTable(

                name='fraud_catalog . signal_features . transaction_signals_gold'

            )

        )]

    )

)

print ('Provider assets registered.')

Step 3: Consumer Accepts and Registers Their Assets

The consumer runs this from their own workspace after receiving the invitation. The Cleanroom name must match exactly what the provider used in Step 2. Case-sensitive, same note applies.

Something worth saying here that I didn't fully appreciate when we were on the consumer side of an early engagement: you cannot inspect the provider's raw table definition from inside the Cleanroom. You are trusting that their policies in Step 1 are sufficient. Confirm with your own legal and governance teams before running this. That is not a formality you can skip on a deadline.

from databricks . sdk import WorkspaceClient

from databricks . sdk . service . sharing import CleanRoom, CleanRoomAsset, CleanRoomAssetTable

Run this from the consumer's Databricks workspace

w_consumer = WorkspaceClient(

    host='https: // adb-yyyy . azuredatabricks . net',  # consumer workspace URL

    token=CONSUMER_TOKEN  # dbutils . secrets . get(scope=" … ", key="  … ")

)

Accept the invitation and register the consumer's own table

Consumer's own Unity Catalog policies remain active inside the Cleanroom

w_consumer . clean_rooms . update(

    name='data_collaboration_cleanroom',  # must match provider's name exactly

    clean_room=CleanRoom(

        local_assets=[CleanRoomAsset(

            name='account_behavior',

            asset_type='TABLE',

            table=CleanRoomAssetTable(

                name='consumer_catalog . risk_features . account_behavior_gold'

            )

        )]

    )

)

print('Consumer assets registered. Cleanroom ready.')

Both parties' Unity Catalog policies stay active inside the Cleanroom. Neither side sees the other's raw records.

Step 4: Write the Cleanroom Notebook

Cleanroom Notebooks run in an isolated cluster with access to both parties' shared assets. They cannot write raw data out or download locally. All output passes through a review step before either party can export it.

Inside the Cleanroom, assets are accessible under cleanroom_catalog . provider . <asset_name> and cleanroom_catalog . consumer . <asset_name>. This namespace is created automatically when both parties register their assets. You don't create it manually.

from pyspark.sql import SparkSession

from pyspark . sql import functions as F

spark = SparkSession . builder . getOrCreate()

Provider data: card number arrives as an HMAC token from the masking policy

The raw card number is never visible inside the Cleanroom

txn_signals = spark . table('cleanroom_catalog . provider . transaction_signals')

Consumer data: account-level behavioral features

account_behavior = spark . table('cleanroom_catalog . consumer . account_behavior')

Privacy-safe join on the pre-agreed HMAC token

Both parties independently generate the same token from their own records

joined = txn_signals.alias('t') . join(

    account_behavior . alias('a'),

    on=F . col('t . card_token') == F . col('a . card_token'),

    how='inner'

)

combined_features = joined . select(

    F . col('t . merchant_category_code'),

    F . col('t . txn_count_1h'),

    F . col('t . txn_amount_band'),

    F . col('t . cross_border_flag'),

    F . col('t . network_velocity_score'),

    F . col('a . account_age_band'),

    F . col('a . chargeback_rate_90d'),

    F . col('a . prior_fraud_flag'),

    F . col('t . confirmed_fraud_flag') . alias('target')

)

Aggregate guard: segments with fewer than 100 records are dropped

Do not remove this filter

segment_stats = combined_features . groupBy(

    'merchant_category_code', 'account_age_band', 'cross_border_flag'

).agg(

    F . count('*') . alias('record_count'),

    F . avg('target') . alias('outcome_rate'),

    F . avg('txn_count_1h') . alias('avg_velocity_1h'),

    F . avg('chargeback_rate_90d') . alias('avg_chargeback_rate')

) . filter(F . col('record_count') >= 100)

Write to the output zone for result review

segment_stats . write . format('delta') . mode('overwrite') . saveAsTable(

    'cleanroom_catalog . outputs . collaboration_segment_signals'

)

print(f'Segments written: {segment_stats . count()}')

print('Awaiting result review approval from both parties before export.')

That . filter(F . col('record_count') >= 100) is the most important line in this notebook. In an early test run we removed it to see what the output looked like with small segments included. A few segments had a single record. The outcome rate for those segments was not aggregated or anonymized. It was just that individual's outcome sitting in a column called outcome_rate. We caught it before it left the environment. Put this filter in every Cleanroom notebook you write and do not let a code review pass without checking for it.

👁 BlockNote image

What Actually Goes Wrong in Production

Token alignment will cost you more time than everything else combined

Both organizations have to produce identical join tokens from their own records. We spent three days on this once. Three days. The issue was trailing whitespace on one side that nobody noticed because it doesn't show up when you print the value. Zero match rate, no error, just silence and a blank join output and two engineers staring at each other. The fix took forty seconds once we found it. It was a . strip() call on both sides before hashing. That was it

Before writing any Cleanroom notebook, define a shared token generation spec and validate it against a jointly agreed test vector file. At least one sample per card type, one edge case with leading zeros. It takes an hour, and saves days.

Delta Sharing credentials expire silently

The failure mode is an opaque 403 during notebook execution. Set up automated rotation with alerting that fires at least seven days before expiry. Without it, you will find out about expired credentials at the worst possible moment, because that is when you find out about everything.

Cleanroom compute bills the provider

Set auto-termination to 30 minutes on every Cleanroom cluster you create. Without it, someone will forget to stop the cluster after a long run. Everyone forgets eventually. The bill conversation is worse than the bill.

**Result review step becomes a bottleneck faster than you expect **

Manual review works fine for a proof of concept. It breaks down around week three when you're refreshing signals every few hours and the reviewer has seventeen other things happening. Build an automated review pipeline that validates outputs against a pre-approved schema: column names, data types, aggregation level, minimum cohort size. Auto-approve compliant results. Reserve manual review for new notebooks and schema changes only. We didn't build this early enough and had to explain to a partner why outputs from Tuesday hadn't been released by Thursday. It was a bad Thursday.

What's Worth Building Out From Here

The revocation pipeline is the piece most teams push down the backlog until something forces it up. When a data subject opts out or a partner agreement gets suspended, those records need to be excluded from Cleanroom compute immediately, not at the next scheduled refresh. A Structured Streaming job listening to a revocation event topic and merging updates into your Gold table handles this well. Unity Catalog's row filter checks the consent flag at query time, so the exclusion takes effect on the next notebook run with no Cleanroom reconfiguration needed. The reason teams deprioritize this is that it feels theoretical until it isn't. Build it before it stops feeling theoretical.

Differential privacy is worth understanding, but the calibration part is harder than most writeups let on. For segments involving rare event types or small sub-populations, calibrated noise adds a guarantee that cohort size alone can't provide. Google's pipeline_dp library integrates with PySpark for this. The harder problem is getting alignment on an epsilon value that means something to a non-technical stakeholder. We spent two weeks on it and landed somewhere I'm not fully confident in, partly because once a number was on the table nobody wanted to be the person who pushed back on it. It's a people problem wearing a math costume. Worth doing, but go in honest about that part.

If your organization operates under any of the following regulations, here is how the Cleanroom architecture maps directly to the key requirements:

Regulatory RequirementCleanroom ControlImplementation
PCI-DSS: No PAN outside secure boundaryZero-copy sharing + column maskingRaw PANs never leave provider storage; only HMAC tokens are shared
GLBA: Safeguard non-public personal infoColumn-level masking (UC)All direct identifiers masked before any shared compute runs
GLBA: Data minimisationRow-level access policyOnly consortium-eligible records shared; minimal column set
CCPA: Purpose limitationCleanroom policy + approved notebooksCompute restricted to fraud detection use; no other purpose permitted
CCPA: Right to opt-outRow filter + revocation pipelineOpt-out removes card from sharing within one processing cycle
SOX / Internal auditSystem audit logs (immutable)All queries, exports, and policy changes logged with actor, time, params

The Thing I Still Haven't Solved

Audit portability. When a partner relationship ends, both sides need a complete record of what was computed, approved, and exported. Right now that trail lives inside Databricks. Whether it holds up when a partnership dissolves and lawyers are involved, I genuinely don't know.

The obvious answer is exporting audit logs to neutral third-party storage. The problem is that "neutral third-party" is harder to define than it sounds. I've watched two organizations spend longer arguing about where logs should live than it took to build the Cleanroom. Neither side trusted the other's suggested solution and they weren't wrong not to.

I've been sitting with this for two years and haven't landed anywhere satisfying. If you've solved it in production, I actually want to hear from you.

How Cleanrooms Compare to Other Approaches

If you're evaluating whether Databricks Cleanrooms are the right fit for your use case, here's how they stack up against the alternatives:

ApproachData MovementPII RiskML Use Case SupportOperational ComplexityRegulatory Fit
Databricks CleanroomsZero (Delta Sharing)Low (UC policies)Strong (full Spark)MediumStrong (audit trail)
AWS Clean RoomsZero (S3)Low (policy engine)Limited (SQL only)Low-MedStrong
Google Analytics HubMinimalLowLimitedLowModerate
Third-party fraud bureauFull copyHigh (new custodian)Unrestricted (risk)Very HighDepends on legal
Federated LearningNone (gradients only)Very LowML only (no SQL joins)Very HighEmerging
Synthetic data generationFull copy (synthetic)MediumGood (training only)HighModerate

A few honest caveats this table doesn't capture. Databricks Cleanrooms require the Premium plan, which carries a meaningful cost premium over Standard. For AWS-native teams already invested in the S3 ecosystem, AWS Clean Rooms is a genuinely strong alternative and operationally simpler to stand up. Vendor lock-in is also a real consideration: your Cleanroom notebooks, Unity Catalog policies, and Delta Sharing configuration are Databricks-specific and don't port cleanly to another platform. If your organization is not already committed to the Databricks ecosystem, factor that in before starting.

Conclusion

Databricks Cleanrooms solve a problem most teams work around badly. The technical setup is straightforward once your environment is right. The parts that actually cost time are the token alignment spec you agree on before writing any code, the cohort size guard you put in every notebook, and the revocation pipeline you build before it stops feeling theoretical. Get those three right and the rest follows. </asset_name></asset_name>

SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

Stuff we do
Contact
About
Connect
Subscribe to our newsletter

Get the freshest news and resources for developers, designers and digital creators in your inbox each week

© 2000 – 2026 SitePoint Pty. Ltd.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Privacy PolicyTerms of Service