VOOZH about

URL: https://thenewstack.io/delta-lake-a-layer-to-ensure-data-quality/

⇱ Delta Lake: A Layer to Ensure Data Quality - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2019-11-06 06:00:22
Delta Lake: A Layer to Ensure Data Quality
news,
Open Source

Delta Lake: A Layer to Ensure Data Quality

This post is about one of the Linux Foundation’s newest projects, called Delta Lake, which aims to ensure reliability of data across data lakes at massive scale.
Nov 6th, 2019 6:00am by Susan Hall
👁 Featued image for: Delta Lake: A Layer to Ensure Data Quality

One of the Linux Foundation’s newest projects, called Delta Lake, aims to ensure the reliability of data across data lakes at a massive scale. These Big Data systems most commonly are used for machine learning and data science, but also for business intelligence, visualization and reporting.

With multiple people working with data in a data lake at the same time, it’s easy for problems like incomplete transactions or multiple simultaneous updates to bring the quality of the data into question.

Apache Spark’s creators at Databricks also built Delta Lake. Though initially built atop Apache Spark, it now also supports other open source Big Data systems.

“Delta Lake enables you to add a transactional layer on top of your existing data lake. Now that you have transactional transactions on top of it, you can make sure you have reliable, high-quality data, and you can do all kinds of computations on it. You can, in fact, mix batch and streaming. … Because the data is reliable, It’s OK to have someone streaming in data while someone else is in batch reading it,” Ali Ghodsi, co-founder and CEO of Databricks explained at Spark+AI Summit Europe.

Delta Lake provides ACID transactions, snapshot isolation, data versioning and rollback, as well as schema enforcement to better handle schema changes and data type changes.

👁 Image

Transactional Support

Databricks open sourced the technology in April under the Apache 2.0 license.

Companies using it in production such as Viacom, Edmunds, Riot Games and McGraw Hill. Alibaba; Booz Allen Hamilton, Intel and Starburst Data, which are collaborating with Databricks on support also for Apache Hive, Apache NiFi, and Presto.

There are other ways to add transactional support to data lakes. Cloudera’s Project Ozone takes a similar tack, and there’s Hive for HDFS-based storage.

It’s not a storage system, per se, but sits atop your existing storage, like HDFS and cloud storage like S3 or Azure blob storage. It provides a bridge between on-prem and cloud storage systems.

It can read from any storage system that supports Apache Spark’s data sources and can write to Delta Lake, which stores data in Apache Parquet format. All transactions made on Delta Lake tables are stored directly to disk.

Central to Delta Lake is the transaction log, a central repository that tracks all changes that users make. It records as a JSON file every change in the order they are made. If someone makes a change, but then deletes it, there still will be a record of that to simplify auditing.

It provides atomicity, recording only transactions that execute fully and completely, to ensure the trustworthiness of the data.

Optimistic Protocol

Just as multiple people can work on a jigsaw puzzle by tackling different areas of it, Delta Lake is designed to enable multiple people to work on the data at once without stepping on each others’ toes.

When dealing with petabytes of data, most likely those users will be working on different parts of the data. If, for instance, two changes do happen simultaneously, it relies on optimistic concurrency control, a protocol in which the data remains unlocked, to settle the matter.

It also offers a “time travel” or data-versioning feature, enabling users to focus on a specific point in time. After 10 commits to the transaction log, Delta Lake saves a checkpoint file in Parquet format. Those files enable Spark to skip ahead to the most recent checkpoint file, which reflects the state of the table at that point.

Delta Lake supports two isolation levels: Serializable and WriteSerializable. Stronger than Snapshot isolation, WriteSerializable offers the best combination of availability and performance and is the default. The strongest level, Serializable ensures the serial sequence matches exactly that shown in the table’s history.

The Linux Foundation is a sponsor of The New Stack.

Image by DreamyArt from Pixabay.

TRENDING STORIES
Susan Hall is the Sponsor Editor for The New Stack. Her job is to help sponsors attain the widest readership possible for their contributed content. She has written for The New Stack since its early days, as well as sites...
Read more from Susan Hall
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Databricks.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.