VOOZH about

URL: https://thenewstack.io/soda-io-checks-to-keep-your-data-in-line/

⇱ Soda Checks to Keep Your Data in Line - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-07-20 03:00:19
Soda Checks to Keep Your Data in Line
profile,
Data / Observability

Soda Checks to Keep Your Data in Line

The Soda tools enable data analysts as well as engineers to write and execute checks on the health of data in production.
Jul 20th, 2022 3:00am by Susan Hall
👁 Featued image for: Soda Checks to Keep Your Data in Line
Feature image via Pixabay.

There’s been a lot of talk lately about data mesh, which rather than a technology or service, is actually an organizational structure that brings ownership of data closer to those actually using it to bring value to the company, as Emily Omier explained in a post recently.

She quoted Arsalan Tavakoli, senior vice president of field engineering at data management systems provider Databricks, saying:

If you have a central data engineering group, how well do they really understand what are the data sets that finance needs? Or the data sets that any of the business units needs? The closer you are to somebody who understands the business problems and the requirements and has the domain knowledge, the better prepared they are to build the right set of data assets to power the right kind of use cases.

Belgian startup Soda is taking that data ownership a step further to enable the business data owners to also own data quality. Co-founders Tom Baeyens and Maarten Masschelein came at the problem from slightly different angles but recognized a common problem, and the company was born.

“There’s all these people working together to make some value out of the data that they have. And it turns out that in production, the biggest problem is actually to keep that data in a clean form. Because once you’re using data in production, then typically the engineers go and do something else, build the next product. And then it breaks down,” Baeyens explained.

There are myriad ways data systems can go wonky — it might be as simple as somebody adding a new field in Salesforce — but traditionally, engineers have to write code to create checks on data quality in production, something data analysts often lack the skills to do. The Soda team set out to change that, focusing on the needs of data analysts as well as the data engineers.

Data as Code

To that end, it released Soda Core, a framework for embedding data reliability checks and quality management into data pipelines powered by SodaCL (Soda Checks Language), a domain-specific language for data reliability.

Taking a page from the data-as-code concept, Soda Core is an open source CLI tool and Python library that enables users to use SodaCL to turn user-defined input into aggregated SQL queries. Core components include the use of dataset metadata to understand the shape and health of the data, and built-in metrics and broad check coverage that can be used to validate many data quality parameters. They include anomaly detection checks and change-over-time checks to detect and resolve issues in the data and alert the appropriate people. It’s the foundation for Soda Cloud, but also can be used as a standalone tool.

In 2021, the company released Soda SQL to help data engineers maintain reliable data pipelines in production and has gone on to build it out SodaCL as a specific language, enabling data teams to check data as code across every data workload from ingestion to consumption.

As a more human-readable language, SodaCL eliminates the need to code in SQL, meaning that everyone on a data team can define the thresholds of what good data needs to look like. At the same time, underneath it still queries SQL-based data sources.

These are among the more than 30 built-in metrics included in SodaCL:

👁 Image

Said Tiago Andrade, head of big data, analytics and AI at Brazilian retailer Americanas S.A., “The modern retail environment has changed, and for organizations like Americanas to continue offering the best possible commerce experience, we are reliant on AI- and ML-powered digital engines that sit behind our retail platform.

“This platform is a dynamically changing entity which needs to be managed in real time to ensure that we’re adapting to changing conditions and not suffering from errors which impact accuracy and degrade overall performance. Soda gives us the end-to-end observability we need to be more confident about the data that is feeding our engines, meaning that instead of being reactive to issues, we can take a much more proactive approach based on an entirely accurate picture of the health of our data.”

Baeyens said its users pressed the idea of a specific language for data reliability. A couple of companies had already been working on such a language.

“When you want to monitor this data in production, that means you need to build up a picture of what good data looks like, so that you can monitor for that,” he said.

“Normally, this is a terrain only reserved for engineers. They have to write code, they know how to write code, and then they have to learn the library and all that. But our focus is … expanding that also to analysts and non-technical users. So the language really allows analysts to become self-serve. They don’t have to rely on programmers anymore to write those checks. [With the language] it’s much simpler than writing code. It’s easy to read. And now a lot more people can contribute to the picture of what good data looks like.”

For instance, you can compare data sets, check the freshness of data or configure a programmatic scan to create a circuit breaker to stop the ingestion of data should a problem be detected.

It takes two inputs. One is all your data source configuration and the other is the checks that you want to do. Both are YAML configuration files.

“It’s very easy for engineers to plug in into their Airflow or orchestration tools, very early on as data comes in,” Masschelein said.

Its commercial offering is a managed cloud that includes collaboration tools, incident management, integrations with Slack and other features.

Community Input

Soda is unrelated to the Soda Foundation, an open source data effort operated by The Linux Foundation.

Baeyens, the company’s CTO, previously created the open source projects jBPM, a JBoss-based toolkit for building business applications to help automate business processes; and Activiti, a Java-centric business process model and notation (BPMN) engine for process automation. He also created Effektif, a cloud-based business process management (BPM) solution for process automation that became SAP Signavio Process Governance.

Masschelein, the CEO, came from data governance platform vendor Collibra, which was using Baeyens’s data tools. The two connected on a community forum, and Soda launched nearly four years ago. The Brussels-based company has grown to around 40 employees.

It counts Disney, HelloFresh, Udemy and St. Jude Children’s Research Hospital among its users and open source contributors.

Disney, for instance, contributed connectors to the Trino SQL query engine and Hello Fresh is working with the company on Spark.

“So you can use this on data frames, which is also very popular,” Masschelein said. “And then in the future, we will go in the direction of streaming as well. We’ve done some early prototyping. But we want to make sure we cover the entire landscape from streaming to Spark to all SQL sources.”

TRENDING STORIES
Susan Hall is the Sponsor Editor for The New Stack. Her job is to help sponsors attain the widest readership possible for their contributed content. She has written for The New Stack since its early days, as well as sites...
Read more from Susan Hall
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Udemy, Databricks.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.