VOOZH about

URL: https://thenewstack.io/from-spark-sql-to-declarative-pipelines-at-databricks/

⇱ From Spark SQL to Declarative Pipelines at Databricks - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2025-07-07 13:08:15
From Spark SQL to Declarative Pipelines at Databricks
Data / Data Streaming / Open Source

From Spark SQL to Declarative Pipelines at Databricks

Databricks recently open sourced its Declarative Pipeline service and real-time mode technologies, which enable easier data streaming capabilities with low latency.
Jul 7th, 2025 1:08pm by Alex Williams
👁 Featued image for: From Spark SQL to Declarative Pipelines at Databricks
Photo of Michael Armbrust of Databricks by Alex Williams.

On his first day at Databricks in 2013, Michael Armbrust — employee No. 9, — began coding Spark SQL.

Twelve years later, Armbrust, now a distinguished engineer, announced at the Databricks annual Data + AI Summit in June that the company had open sourced two of its platform technologies to Apache Spark. The news demonstrates Databricks’ continued focus on building out Spark, a project that has served as the company’s playbook since its inception.

Databricks CTO Matei Zaharia created Spark in 2009 at the University of California, Berkeley’s AMPLab as a platform for distributed machine learning. In early 2010, the codebase was open sourced, and in 2013, the project became part of the Apache Software Foundation.

Spark offers distributed data processing across compute clusters and coordinating workloads across multiple nodes. The outcome of that work stands as the foundation for what we see today in Databricks’ offerings, dating back to Armbrust’s first days at the company.

Zaharia, along with Databricks’ CEO  Ali Ghodsi and Andy Konwinski, Ion Stoica,
Arsalan Tavakoli, Patrick Wendell and Reynold Xin, all contributed to Spark and formed Databricks in 2013. As active contributors to Spark, the team commercialized the technologies they built to develop Databricks’ foundational technology.

Their first research project became what we know today as Spark SQL. Called Shark, named after Spark and Apache Hive, the technology provided better performance than Hive through better querying and caching data in a cluster’s memory. Perhaps most importantly, Shark integrated SQL, which led to the development of Spark SQL, made available with Spark 1.0 in May 2014.

Databricks historically presented itself as a group of people who started the Spark project. They emphasized simplicity, getting better value from data, and their open source roots.

Over the years, the company has open sourced several of its platform technologies.

At the Data + AI Summit in San Francisco last month, Databricks open sourced its Declarative Pipeline service and real-time mode technologies, which enable easier data streaming capabilities with low latency.

Declarative Pipelines

The distributed ETL pipeline, initially known as Delta Live Tables, evolved into the Lakeflow Declarative Pipeline, which is now open sourced for Apache Spark. The structured streaming capability also emerged in Spark through a similar developmental process.

“Structured streaming — we built this team, we got it working before we open sourced it,” Armbrust said. “Delta, very similarly, it was a product inside of Databricks for over a year and a half before we open sourced it.”

Structured streaming leverages SQL’s high-level declarative language, which understands tables, columns, data types and schemas, as well as functions, to process ever-growing input tables. When an engineer adds new rows, the query runs incrementally over the data, producing a new answer, but only examines the latest data that has arrived since the last update.

“There’s nothing that a sophisticated engineer couldn’t do with Spark, Spark SQL, structured streaming and Delta by hand that you can do with Declarative Pipelines,” Armbrust said.

He added, “Declarative Pipelines let you focus on the interesting part, the data transformation, and it extracts away what I would call undifferentiated heavy lifting.”

The Databricks team designed Delta with streaming in mind, Armbrust said. It provides insight into the capability to transform data across multiple tables by consuming it and pushing it downstream.

“Our customers often call this the medallion architecture, where you take raw data, you bring it into bronze, you do a little bit of cleaning, you bring it to silver, and then you bring it to finally gold,” Armbrust said.

“Gold are the tables that actually have answers for your business. It’s alway a process to get from bronze dirty data to gold data, and the pipelines and streaming are what enable this. Delta – I think of it as the nodes of this graph. And because it natively supports change data feeds. It allows you to do this incrementally, which is critical for performance at scale.”

And time travel? It all comes back to how the data tells a story, Armbrust said. The logs are a record of the content in the tables over time.

“It’s no longer just a static collection of data,” he said. “It’s this living and changing collection of data where you can ask questions about what has changed over time.”

And the Unity Catalog, also open source, provides governance, notably through rich metadata, which allows for fine-grained filtering, Armbrust said. An engineer may annotate columns and tables with descriptions. An AI assistant can read those comments and use that information to help write queries over the data.

MLflow is another core piece that fits with Declarative Pipelines.

The result is that customers can build end-to-end data and AI workflows using only Databricks technologies while still benefiting from open standards and avoiding vendor lock-in through the open source Apache Spark Foundation.

What Is Real-Time Mode?

Declarative Pipelines rely on low latency. Real-time mode, also open sourced by Databricks for Apache Spark, expands the aperture for low-latency workflows by enabling structured streaming for operational use cases, thereby transforming the way streaming data is processed.

“Instead of running micro batches, where we decide ahead of time what data is going to be processed, we start long-running tasks that are continually polling for new data,” Armbrust said. “And so that means we can process it immediately.”

It again shows why streaming is now a first-class citizen. Microbatching can lead to latency issues, complexities in resource utilization, data quality challenges and difficulty in debugging.

Databricks is making a run in a fast-growing market and faces plenty of competition. VentureBeat has a comprehensive look at Databricks open sourcing declarative pipelines, citing Snowflake and how it integrates with Apache NiFi to centralize any data from any source into its platform.

The Databricks approach overlaps with multiple vendors.

Google has Google Cloud Data Flow. Amazon Web Services offers AWS Glue, and Microsoft provides Azure Data Factory—all of which are market data transformation capabilities. There are also vendors like Fivetran and Airbyte, which also partner with Databricks. As mentioned, Snowflake is also a competitor with Databricks.

Staying True to Open Source Roots

Databricks confirms why open source companies do so well when they stay committed to their roots while also building out a proprietary platform that accelerates growth.

Building an open source project from scratch, transforming it into a platform, and using it to set the direction for an entire ecosystem positions Databricks to take on the largest monolithic software companies of the past 20 to 30 years.

Numerous companies have failed in their open source journey. It’s not even worth mentioning any by name. Their stories are all quite similar. They face pressure due to a host of factors, become proprietary and struggle to maintain their standing in the community.

The creators behind Spark are still involved with Databricks. Long focused on data analytics, Databricks has developed a range of products, established partnerships and made acquisitions to cater to the needs of those who create data pipelines as well as those who use them to transform data.

Declarative environments are well known, as is the need to reduce latency, especially as open source communities working on complex pipelines will increasingly face pressure to implement AI and agent-based frameworks.

Getting data in the desired state is the promise of declarative data pipelines and how they fit with DevOps code deployments, data operations and the layering of data models with AI that adapt to the user’s needs.

The open sourcing of Databricks technology demonstrates how the company contributes back to the open source project it created. It strengthens their place in the community.

And it’s not just the technology that gets contributed. Databricks engineers contribute to the core engine, demonstrating the value they provide while also using the technology as the foundation of its platform products.

However, there are always some downsides to an approach that relies heavily on open source. Foremost, there’s the problem of perception. Do open source companies fine-tune their proprietary platforms over their open source equivalents? Does the open source platform then rank second in importance?

These are the kinds of questions that affect any open source provider. Databricks is not immune to these types of concerns as well.

TRENDING STORIES
Alex Williams is founder and publisher of The New Stack. He's a longtime technology journalist who did stints at TechCrunch, SiliconAngle and what is now known as ReadWrite. Alex has been a journalist since the late 1980s, starting at the...
Read more from Alex Williams
SHARE THIS STORY
TRENDING STORIES
Amazon Web Services and Snowflake are sponsors of The New Stack.
TNS owner Insight Partners is an investor in: Databricks, Real.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.