VOOZH about

URL: https://www.analyticsvidhya.com/blog/2021/06/mlops-tracking-ml-experiments-with-data-version-control/

⇱ Data Version Control | Tracking ML Experiments With DVC


India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Reading list

Tracking ML Experiments With Data Version Control

Amit Last Updated : 24 Jun, 2021
9 min read

This article was published as a part of the Data Science Blogathon

” DVC brings agility, reproducibility, and collaboration into your existing data science workflow “

Introduction:

In a typical data science project, the scientists and machine learning experts deal with a large volume of data of various kinds. There are multiple models built with varying configurations, features, with multiple iterations of parameters tuning to get the best performing model. In such a scenario, all the changes we make to data and in the model building process need to be tracked and measured for us to understand what has worked and what has not. It also is necessary to have the option to go back to a specific version and investigate past results. One such tool which lets us track all of this is Data Version Control (DVC) which helps in governing the data, the underlying model and run reproducible results.

What is Data Version Control (DVC)?

Data Version Control, or DVC, is a data and ML experiment management tool that takes advantage of the existing engineering toolset that we are familiar with (Git, CI/CD, etc.). DVC is meant to be run alongside Git. The git and DVC commands will often be used in tandem, one after the other. While Git is used to storing and version code, DVC does the same for data and model files.

The .dvc file is lightweight and meant to be stored with code in GitHub. When you download a Git repository, you also get the .dvc files. You can then use those files to get the data associated with that repository. Large data and model files go in your DVC remote storage, and small .dvc files that point to your data go in GitHub.
The entire process of loading the data to measuring model metrics to tracking the model experiments is called Machine learning operations aka MLOps. If you need an example of MLOps implementation with Github then please read my previous blog – Bring DevOps To Data Science With MLOps

In this blog, we will explore and learn the following:

  • Install Data Version Control and learn about its project setup
  • Build ML model and create experiment pipelines that can be reproduced with DVC
  • Compare the metrics with previous iterations/results.
  • Generate ML metrics plots
  • Use Github to add, commit, and push commands

Any Prerequisites?

To understand MLOps & DVC, you will need very basic knowledge of model building in python and a GitHub account. I will be using Visual Studio Code as editor; you can use any editor of your choice as long as you are comfortable with it.

Getting Started:

We will be using Kaggle’s South Africa Heart Disease dataset. Here is the data dictionary for reference. Our objective is to predict chd i.e., coronary heart disease (yes=1 or no=0). A simple binary classification model.
  • sbp: systolic blood pressure
  • tobacco: cumulative tobacco (kg)
  • ldl: low-density lipoprotein cholesterol
  • famhist: family history of heart disease (Present=1, Absent=0)
  • typea: type-A behavior
  • alcohol: current alcohol consumption
  • age: age at onset
  • chd: coronary heart disease (yes=1 or no=0)

Here is a quick snap of basic stats from our data:

Project Setup: 

To start with, our project folder structure is as below. As we process the files in the later sections, each of these folders will have respective files in them.

+—data
| +—processed –> This will have 2 files train and test after split
| —raw –> This will have the raw data file
+—data_given –> The data from the source either from the local machine or external source
+—report –> A place holder for JSON files.
+—saved_models –> The model pickle file will be saved
—src –> Four python files: get data, load data, splitting data, and Model building

 

Configurations:

The configuration plays a key role in setting up a workflow and also to keep the code clean and improve readability. We will use the YAML file format to define our configurations in stages. To know more about configs, please read my previous blog Reproducible ML Reports Using YAML Configs

Create a file by name param.yaml and have the below configuration defined:

base:
 project: SAHeart-project
 random_state: 999
 target_col: chd
.........
split_data:
 train_path: data/processed/train_SAHeart.csv
 test_path: data/processed/test_SAHeart.csv
 test_size: 0.25
model_dir: saved_models

Please refer to the complete code on GitHub

Fetching Data & Pre-processing:

Let’s create a python file by the name get_data.py where we will load the data and carry out basic data processing. The processed data will be saved in the data_given folder which we had created in an earlier section

...........
def read_params(config_path):
 with open(config_path) as yaml_file:
 config = yaml.safe_load(yaml_file)
 return config
def get_data(config_path):
 config = read_params(config_path)
 data_path = config["data_source"]["s3_source"]
 df = pd.read_csv(data_path, sep=",", encoding='utf-8')
 df = pd.get_dummies(df, columns = ['famhist'], drop_first=True)
 df.drop("sbp",axis=1, inplace=True)
 return df
..........

Please refer to the complete code on GitHub

Loading Data:

We will create a python file load_data.py and load the data from our previous stage and save it in ./raw folder which we had created earlier

.......
def load_and_save(config_path):
 config = read_params(config_path)
 df = get_data(config_path)
 df = df.dropna()
 new_cols = [col.replace(" ", "_") for col in df.columns]
 raw_data_path = config["load_data"]["raw_dataset_csv"]
 df.to_csv(raw_data_path, sep=",", index=False, header=new_cols)
.......

Please refer to the complete code on GitHub

Splitting Data:

Our next stage would be to split the data into train and test which is an essential step before model building. There will be two files created train_SAHeart.csv and test_SAHeart.csv. These two files will be saved under data/processed folder

.............
 train, test = train_test_split(
 df, 
 test_size=split_ratio, 
 random_state=random_state
 )
 train.to_csv(train_data_path, sep=",", index=False, encoding="utf-8")
 test.to_csv(test_data_path, sep=",", index=False, encoding="utf-8")
.............

Please refer to the complete code on GitHub

Model Building:

This is the last of four stages – the model building where we will build a binary classification model using Logistic Regression. Lets create a python file by name train_and_evaluate.py with the below code. In the end, the model in the pickle format is saved under the/saved_models folder.

...........
def train_and_evaluate(config_path):
 config = read_params(config_path)
 test_data_path = config["split_data"]["test_path"]
 train_data_path = config["split_data"]["train_path"]
 random_state = config["base"]["random_state"]
 model_dir = config["model_dir"]
 target = [config["base"]["target_col"]]
 train = pd.read_csv(train_data_path, sep=",")
 test = pd.read_csv(test_data_path, sep=",")
 train_y = train[target]
 test_y = test[target]
 train_x = train.drop(target, axis=1)
 test_x = test.drop(target, axis=1)
 # Build logistic regression model
 model = LogisticRegression(solver='saga', random_state=0).fit(train_x, train_y)
 # Report training set score
 train_score = model.score(train_x, train_y) * 100
 print(train_score)
 # Report test set score
 test_score = model.score(test_x, test_y) * 100
 print(test_score)
.............

Please refer to the complete code on GitHub

At this stage, our project structure is as below. The files highlighted are the ones that we generated from our previous four stages.

+---data
| +---processed
| | test_SAHeart.csv
| | train_SAHeart.csv
| ---raw
| SAHeart.csv
+---data_given
| SAHeart.csv
+---report
+---saved_models
| model.joblib
---src
 | get_data.py
 | load_data.py
 | split_data.py
 | train_and_evaluate.py

DVC Setup:

If you don’t have DVC Install then install as below. For more information refer install DVC

pip install dvc

Getting started with DVC:

Assuming you have already installed DVC, initiate DVC as below. This is very similar to initiating git.

dvc init

Mapping remote storage: We need to tell DVC where our remote storage is on local machine.

dvc add . /data_given/SAHeart.csv

Note: DVC supports many cloud-based storage systems, such as AWS S3 buckets, Google Cloud Storage, and Microsoft Azure Blob Storage. You can find out more in the official DVC documentation for the dvc remote add command. 

We have successfully installed DVC and mapped our dataset to bed tracked & versioned by DVC.

Create metrics file: Let’s create JSON file under-report folder by name scores.json where we will write some of the key results we need to track. Remember, in our last section, we had created a report folder but didn’t have any file under it.

DVC pipeline:

DVC pipelines are effectively version-controlled steps in a typical machine learning workflow (e.g. data loading, pre-processing, data splitting, model building, metrics measurement, etc.). we will define the dependencies and outputs. Under the hood, DVC is building up a Directed Acyclic Graph (DAG) which is a fancy way of saying a series of steps that can only be executed in one direction.

dvc dag

Let’s create a YAML file by the name dvc.yaml where we define all the stages of the pipeline. In the below sample code we are defining the command that needs to execute with all the dependent files and also the output which in this case is SAHeart.csv file to be placed under /data/raw folder

stages:
 load_data:
 cmd: python src/load_data.py --config=params.yaml
 deps:
 - src/get_data.py
 - src/load_data.py
 - data_given/SAHeart.csv
 outs:
 - data/raw/SAHeart.csv

We will define the remaining stages on similar lines as below:

.......
 split_data:
 cmd: python src/split_data.py --config=params.yaml
 deps:
 - src/split_data.py
 - data/raw/SAHeart.csv
 outs:
 - data/processed/train_SAHeart.csv
 - data/processed/test_SAHeart.csv 
 train_and_evaluate:
 cmd: [python src/train_and_evaluate.py --config=params.yaml, dvc plots show]
 deps:
 - data/processed/train_SAHeart.csv
 - data/processed/test_SAHeart.csv 
 - src/train_and_evaluate.py
 metrics:
 - report/scores.json:
 cache: false
.......

Please refer to the complete code on GitHub

At this stage, we have our model built, DVC installed and setup, defined pipeline with dvc.yaml file. Lets run the pipeline to see if all the stages are executed sequentially as defined.

dvc repro
## Here is the output
'data_givenSAHeart.csv.dvc' didn't change, skipping
Running stage 'load_data':
> python src/load_data.py --config=params.yaml
Updating lock file 'dvc.lock'
Running stage 'split_data':
> python src/split_data.py --config=params.yaml
Updating lock file 'dvc.lock'
Running stage 'train_and_evaluate':
> python src/train_and_evaluate.py --config=params.yaml
.....
.....
To track the changes with git, run:
 git add dvc.lock
Use `dvc push` to send your updates to remote storage.

Ahh !!. All our stages are executed successfully. I have trimmed the output to keep it clean and precise. The dvc also automatically creates a file by name dvc.lock which tracks all the changes. Lets take a look at the content of this file.

schema: '2.0'
stages:
 load_data:
 cmd: python src/load_data.py --config=params.yaml
 deps:
 - path: data_given/SAHeart.csv
 md5: d75f04589cb6fb8dce5772ed762a5621
 size: 22761
 - path: src/get_data.py
 md5: cd5029054879f4cdebb4f105ef891912
 size: 774
 - path: src/load_data.py
 md5: fd8ed8a4e1d3a894523c33f274a42ba1db
 size: 693
 outs:
 - path: data/raw/SAHeart.csv
 md5: 9bcb98547eea864e13d075f2f0e17cea9
 size: 19013

If you observe, the structure of the file is similar to dvc.yaml but the content is different. It has created an identifier to track every single change that goes in the pipeline.

Model Metrics:

As our pipeline has completed its task successfully, lets look at some of the model metrics that are of our interest. Remember, we had created report/scores.json file in the earlier section?. This file would have the metrics data and it was populated with data during the model building stage. Let’s take a look.

dvc metrics show
## Here is the output
Path Logistic Accuracy roc_auc test_score train_score
reportscores.json 0.62069 0.65093 62.06897 71.96532

Note: Our focus is not on model metrics or on improving it further but on building a reproducible pipeline with DVC to help us track our experiment.

Let’s push the code to github before proceeding further.

git add . && git commit -m "experiment with dvc" && git push origin main

Metric Plots:

As we have built a binary classification model, it will be good to see the confusion matrix. DVC provides plot templates and in this case, we will use the confusion matrix template. We will also set the x (Actual value from test data i.e. target (chd)) and y axis (predicted value from our model).

dvc plots show cm.csv --template confusion -x chd -y Predicted
## Here is the output. The confusion matrix is generated in HTML and saved in current directory
file://C:/Desktop/Versioning-ML-Models-datasets-With-DVC/plots.html

Let’s now experiment by changing some of the parameters and re-running dvc pipeline.

Here are the changes:
param.json change the test size from 0.25 to 0.3
train_and_evaluate.py change solver from ‘saga’ to ‘lbfgs’

Let’s run the DVC pipeline and also check metrics.

dvc repro
dvc metrics show
## Here is the output
Path Logistic Accuracy roc_auc test_score train_score
reportscores.json 0.65517 0.72764 65.51724 74.27746

Well, we have some improvement compared to the last iteration. Let’s compare it with DVC.

dvc metrics diff
## Here is the comparision of both the iterations
Path Metric Old New Change
reportscores.json Logistic Accuracy 0.62069 0.65517 0.03448
reportscores.json roc_auc 0.65093 0.72764 0.07671
reportscores.json test_score 62.06897 65.51724 3.44828
reportscores.json train_score 71.96532 74.27746 2.31214

Let’s check confusion matrix as well for this iteration:

dvc plots show cm.csv --template confusion -x chd -y Predicted

We just wrote one line of code to reproduce the entire pipeline, that’s the simplicity DVC brings to ML experiments. 

Closing Note:

You have learned how to use Data Version Control to create experiments!. For every experiment you run, you can version the data you use and the model you train. The best part is you can place your high-volume dataset on remote(cloud) and use DVC to point to the data source saving disk space on your local and on Github. And moreover, the experiments are reproducible, and anyone can repeat what you’ve done.

I hope this article was useful. Good luck with all your experiments !!!!

You can connect with me Linkedin

You can find the code for reference – Github

References

https://dvc.org/

https://unsplash.com/

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

I am a Data Science enthusiast with experience in building predictive models, data processing, and data mining algorithms to solve challenging business problems. Involved in open source community and passionate about building data apps.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
👁 Av Logo White

Continue your learning for FREE

Forgot your password?
👁 Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

👁 Popup Banner
👁 AI Popup Banner