VOOZH about

URL: https://thenewstack.io/iterative-ai-git-based-machine-learning-tools-for-data-engineers/

⇱ Iterative.ai: Git-Based Machine Learning Tools for ML Engineers - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2021-02-04 09:10:41
Iterative.ai: Git-Based Machine Learning Tools for ML Engineers
profile,
AI / Software Development

Iterative.ai: Git-Based Machine Learning Tools for ML Engineers

Iterative.ai, a San Francisco-based startup has two products DVC (Data Version Control) and CML (Continuous Machine Learning), which aims to bring engineering practices to data science and machine learning.
Feb 4th, 2021 9:10am by Susan Hall
👁 Featued image for: Iterative.ai: Git-Based Machine Learning Tools for ML Engineers

While working as a data scientist at Microsoft, Dmitry Petrov decided that big, monolithic data platforms weren’t the way to go. There needed to be tools built on top of platforms, they needed to be open source and that machine learning engineers had particular needs not being met.

His solution was creating of Iterative.ai, a San Francisco-based startup focused on managing machine learning models. Its two products DVC (Data Version Control) and CML (Continuous Machine Learning) aim to bring engineering practices to data science and machine learning.

In the ever-growing ecosystem of DataOps enterprise software vendors, including DVC joins the likes of TerminusDB, Dolt and Pachyderm with the aim to bring a Git-like experience to data science, but Petrov says the focus of DVC is narrow — versioning data and ML models.

In managing their data, companies initially decide they need to move it around, to colleagues’ laptops, to the cloud, to production systems, Petrov said. They need to know they’re working on the right version, especially when training a model.

“Our focus is ML modeling, ML process, so we can help people to build models, to share the model between the team, to collaborate on the model,” Petrov said.

An O’Reilly report on 2021 trends cites a lack of adequate tools for versioning data (though it calls DVC a start), as well as a lack of adequate tools for versioning models (though there it points to tools MLflow as a start).

Petrov said the needs of data analysts and data scientists and those of ML engineers are different, and the continuous integration/continuous delivery tools of the software engineering stack don’t necessarily meet those needs. Rather than build out a separate platform, however, he decided to build on top of GitHub, GitLab and more recently BitBucket.

ML engineers, he said, tend to work with unstructured data — images, videos, text — while data scientists usually work with structured data, often from a data warehouse.

“ML engineers, they do write a code. Their models are usually complicated. They work in a team,” he said. “Data scientists and data analysts they work for usually in a relatively small project, like maybe two days, maybe one week at the best. They don’t need any advanced collaboration tool.

“ML engineers, they still need collaboration. They need GitHub for collaboration, they need this CI/CD system to resolve [issues] between each other, between the team and production system,” he said.

That’s where DVC and CML library comes in.👁 Image

DVC offers a way to track changes in data, source code and ML models together to provide a single history of a project. It enables users to track the evolution of experiment, reproduce projects without model retraining and share projects.

Built on top of git, users create lightweight metafiles that describe the ML artifacts to track. That enables the system to use this metadata to handle large files, rather than storing them in Git. DVC relies on remote storage for large files in the cloud — S3, Azure, Google Cloud, etc. — or on-premise network storage (via SSH, for example). They’re treated as a key-value store, employing hardlinks/symlinks instead of copying files.

Versions of the data and models are stored as Git commits, enabling users to create snapshots, restore previous versions, reproduce experiments, and more They can manage experiments with Git tags/branches and metrics tracking.

DVC defines rules and processes for collaborating as a team and for running a finished model in production. With push/pull commands, you can consistently move ML models, data and code into production or to other locales.

Lightweight pipelines connect versioned data sets, models and code. Pipelines are treated as a first-class citizen. They are language-agnostic and connect multiple steps into a directed acyclic graph (DAG).

👁 Image

DVC can mark a certain stage outputs as metrics that can be used to help users compare models and data sets across versions. The plots feature displays the metrics in visual form.

Just as DVC is an extension of Git-LFS, CML is an extension of GitLab CI/CD.

While some data engineering tools are “more focused on reliability and distributed data processing, our scenario is way more lightweight. … This is this scenario and navigation around models, when you build like 20 versions of your models, how you can find the best one? What does it mean to have a best model? Sometimes it’s not failure, sometimes it doesn’t mean like the best score or something, you need to have kind of a picture of what’s going on and how to find the best model in your repository. So this is another functionality that we need on top of Git,” Petrov said.

CML is a library to automate machine learning workflows, including model training and evaluation. With CML, you can run reports comparing the current model, the production model and spot differences with the master model or at any stage of your project history, as well as monitor changing datasets. It will auto-generate reports with metrics and plots in each Git pull request.

In addition to its open source projects, Iterative.ai has built enterprise features, such as enhanced security, and will be unveiling a SaaS product combining collaboration and visualization on top of DVC and CML in the next month or so, Petrov said.

Feature image by Gerd Altmann from Pixabay.

TRENDING STORIES
Susan Hall is the Sponsor Editor for The New Stack. Her job is to help sponsors attain the widest readership possible for their contributed content. She has written for The New Stack since its early days, as well as sites...
Read more from Susan Hall
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.