![]() |
VOOZH | about |
In the rapidly evolving landscape of data science and machine learning, data versioning has become a crucial practice. As datasets grow in size and complexity, keeping track of changes, ensuring reproducibility, and maintaining data integrity are essential tasks. This article delves into the concept of data versioning, its importance, use cases, and applications, focusing on how it can be efficiently managed using Python.
Data versioning is the process of tracking and managing changes to datasets over time, similar to how version control systems manage source code. It involves creating snapshots of datasets at different points, including metadata like timestamps and change logs, to maintain a historical record and allow users to revert, compare, and understand data evolution.
Reproducibility: Reproducibility is a cornerstone of scientific research and data analysis. Data versioning ensures that datasets used in experiments, analyses, or machine learning models can be precisely replicated. By maintaining versions of datasets, researchers can trace back to the exact data that produced specific results, enabling others to validate findings and build upon previous work.
Open your terminal and Type
pip install dvcTo initialize git and dvc respectively
git init
dvc inityou must have folder for keeping all data used for project let's consider that you have folder named data
To add data files to DVC system. here .file is extension of file
dvc add <path/to/data.file>As soon as you type above command. the .dvc files is added to repository.
git add <path/to/data.file>.dvc <path\to\data\folder>\.gitignoreThis command will stage the progress into git.
Now to push data into our own seperate remote storage also known as DVC Remote.
We need to setup an remote origin for dvc same as the one we use for git.
Note : The key must be kept secret.
To add DVC remote. run below with given syntax. named as origin
dvc remote add --default < origin> gdrive://<Key of Folder in Gdrive where you want to store>This syntax is useful only for google drive storage. You other kinds of storage you can check reference below
After you successfully add setup remote run below command as it is.
git commit .dvc\config -m "Configured Remote"Next step is to push data into remote. So, to complete authentication with google drive. install following dependency.
pip install dvc_gdriveThen to push data run following command
dvc pushIf you are pushing for first time it will prompt you to authentication by google in your default browser. From make sure to choose the same account from which folder is created.
After authentication data must be pushed. you can cross check this by seeing inside the google drive folder.
Now, That suppose you want to clone it into new project named Data Versioning Clone
To initialize git and dvc respectively
git initdvc initTo pull data from git repository named project where data is folder containing Data files we need. run following. -o is flag denoting where to store data on local repository after pulling.
dvc get git@github.com:username/project.git data/da.txt -o data/