VOOZH about

URL: https://towardsdatascience.com/a-real-world-case-study-of-using-git-commands-as-a-data-scientist-e7775cccb4ba/

⇱ A Real-World Case Study of Using Git Commands as a Data Scientist | Towards Data Science


A Real-World Case Study of Using Git Commands as a Data Scientist

Complete with Branch Illustration

10 min read

Data Science

πŸ‘ Photo by Praveen Thirumurugan on Unsplash
Photo by Praveen Thirumurugan on Unsplash

You’re a data scientist. As data science is becoming more and more mature every day, software engineering practices begin creeping in. You are forced to venture out of your local jupyter notebooks and meet other data scientists in the wild to build a great product.

To help you out with this grand mission, you can rely on Git, a free and open-source distributed version control system to keep track of what everyone is coding.

Table of Contents

1. Git commands for setting up a remote repository
2. Git commands for working on a different branch
3. Git commands for joining in collaboration
4. Git commands for coworking
5. Resolving merge conflicts
Wrapping Up

To be more concrete, let’s work with an actual project (see the end product here). And to minimize the hassle of creating one, we’ll use the famous Cookiecutter Data Science. Install cookiecutter and create a project template locally.

πŸ‘ Image

Fill in the prompt accordingly. In our case, it’s as follows.

project_name [project_name]: Data Science Project Example
repo_name [example_project_name_here]: ds-project-example
author_name [Your name (or your organization/company/team)]: Albers Uzila
description [A short description of the project.]: A simple data science project, template by cookiecutter
Select open_source_license:
1 - MIT
2 - BSD-3-Clause
3 - No license file
Choose from 1, 2, 3 (1, 2, 3) [1]: 1
s3_bucket [[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')]:
aws_profile [default]:
Select python_interpreter:
1 - python3
2 - python
Choose from 1, 2 (1, 2) [1]: 1

Change your working directory to ds-project-example folder by running the following command.

πŸ‘ Image

1. Git commands for setting up a remote repository

You now have a local project in ds-project-example. You need to push your local project to GitHub to collaborate with other data scientists.

To do that, initialize an empty Git repo using git init. You can confirm the repo is ready by observing that there is a hidden folder named .git in your working directory or by running git status.

πŸ‘ Image
Your local:

⬀ main*

As you can see, you’re working on a branch called main and have many untracked files by Git. You can use git add . to add all of these files to the index, also known as the "staging area" between the files you have in your working directory and your commit history.

To record changes in the index to the local repo, use git commit. Add a message like "Set up repo with cookiecutter".

πŸ‘ Image
Your local:

⬀───⬀ main*

Now, create a remote repo in https://github.com/new and name it ds-project-example. Before pushing the local repo to remote, you need to add the remote repo in the directory where your local repo is being stored, using git remote add command.

The git remote add command takes two arguments:

After running git remote add command, you will see in .git/refs folder that you now have a local HEAD and a remote named origin.

πŸ‘ Files and folders inside the .git folder | Image by author
Files and folders inside the .git folder | Image by author

Now, to push commits made on your local branch to the remote repo, use git push. This command takes two arguments:

  • A remote name, for example, origin
  • A branch name, for example, main

To summarize:

πŸ‘ Image
Your local:

⬀───⬀ main*
 origin/main

Remote:

⬀───⬀ main

The -u flag in git push sets the branch you are pushing to (origin/main) as the remote-tracking branch of the branch you are pushing (main), so Git knows what you want to do when you push/pull branches in the future.

After doing all this, your project is now set up on GitHub:

πŸ‘ Our remote repository on GitHub | Image by author
Our remote repository on GitHub | Image by author
β”œβ”€β”€ LICENSE
β”œβ”€β”€ Makefile <- Makefile with commands like `make data` or `make train`
β”œβ”€β”€ README.md <- The top-level README for developers using this project.
β”œβ”€β”€ data
β”‚ β”œβ”€β”€ external <- Data from third party sources.
β”‚ β”œβ”€β”€ interim <- Intermediate data that has been transformed.
β”‚ β”œβ”€β”€ processed <- The final, canonical data sets for modeling.
β”‚ └── raw <- The original, immutable data dump.
β”‚
β”œβ”€β”€ docs <- A default Sphinx project; see sphinx-doc.org for details
β”‚
β”œβ”€β”€ models <- Trained and serialized models, model predictions, or model summaries
β”‚
β”œβ”€β”€ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
β”‚ the creator's initials, and a short `-` delimited description, e.g.
β”‚ `1.0-jqp-initial-data-exploration`.
β”‚
β”œβ”€β”€ references <- Data dictionaries, manuals, and all other explanatory materials.
β”‚
β”œβ”€β”€ reports <- Generated analysis as HTML, PDF, LaTeX, etc.
β”‚ └── figures <- Generated graphics and figures to be used in reporting
β”‚
β”œβ”€β”€ requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
β”‚ generated with `pip freeze > requirements.txt`
β”‚
β”œβ”€β”€ setup.py <- makes project pip installable (pip install -e .) so src can be imported
β”œβ”€β”€ src <- Source code for use in this project.
β”‚ β”œβ”€β”€ __init__.py <- Makes src a Python module
β”‚ β”‚
β”‚ β”œβ”€β”€ data <- Scripts to download or generate data
β”‚ β”‚ └── make_dataset.py
β”‚ β”‚
β”‚ β”œβ”€β”€ features <- Scripts to turn raw data into features for modeling
β”‚ β”‚ └── build_features.py
β”‚ β”‚
β”‚ β”œβ”€β”€ models <- Scripts to train models and then use trained models to make
β”‚ β”‚ β”‚ predictions
β”‚ β”‚ β”œβ”€β”€ predict_model.py
β”‚ β”‚ └── train_model.py
β”‚ β”‚
β”‚ └── visualization <- Scripts to create exploratory and results oriented visualizations
β”‚ └── visualize.py
β”‚
└── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io

2. Git commands for working on a different branch

Your main branch should represent the stable history of your code. Create other branches to experiment with new things, implement them, and when they have matured enough you can merge them back to main.

Now, to create a new branch from local main, use git checkout. You can use git branch to see all available branches and which branch you are currently on.

πŸ‘ Image
Your local:

⬀───⬀ main
 origin/main
 make_dataset*

Remote:

⬀───⬀ main

You’ve made a new local branch named make_dataset and checked out this branch. After adding some codes on make_dataset, you’re ready to add, commit, and push changes to a new remote branch also called make_dataset with remote tracking branch origin/make_dataset. The only change you want to push was in the src/data/make_dataset.py file.

πŸ‘ Image
Your local:

⬀───⬀ main
 β”‚ origin/main
 β”‚
 └──⬀ make_dataset*
 origin/make_dataset

Remote:

⬀───⬀ main
 β”‚
 └──⬀ make_dataset

You can now merge remote make_dataset to remote main by first clicking the "Compare & pull request" button on your GitHub, then following the steps.

πŸ‘ Compare and pull request a branch | Image by author
Compare and pull request a branch | Image by author

After successfully merging, you will see something like this.

πŸ‘ Pull request successfully merged and closed | Image by author
Pull request successfully merged and closed | Image by author
Your local:

⬀───⬀ main
 β”‚ origin/main
 β”‚
 └──⬀ make_dataset*
 origin/make_dataset

Remote:

⬀───⬀──────⬀ main
 β”‚ β”‚
 β””β”€β”€β¬€β”€β”€β”˜

3. Git commands for joining in collaboration

You have another contributor for your project. Let’s say his name is Hiro. To get started, Hiro has already cloned your remote repo using git clone just before you merged remote make_dataset to remote main. He also checked out his own local branch called train_model from the cloned repo.

πŸ‘ Image
Your local:

⬀───⬀ main
 β”‚ origin/main
 β”‚
 └──⬀ make_dataset
 origin/make_dataset

Hiro's local:

⬀───⬀ main
 origin/main
 train_model*

Remote:

⬀───⬀──────⬀ main
 β”‚ β”‚
 β””β”€β”€β¬€β”€β”€β”˜

After adding src/configs/config.py and editing it along with src/models/train_model.py, Hiro generates:

  1. four trained models in models directory, and
  2. a JSON file containing the performance of the ensembled model on train and validation split in reports directory.

    Just to make sure, Hiro runs git status.

πŸ‘ Image

Just as you did before, Hiro adds, commits, and pushes the changes in his local branch to remote. However, models directory is not included since they occupy a large space.

πŸ‘ Image
Your local:

⬀───⬀ main
 β”‚ origin/main
 β”‚
 └──⬀ make_dataset
 origin/make_dataset

Hiro's local:

⬀───⬀ main
 β”‚ origin/main
 β”‚
 └──⬀ train_model*
 origin/train_model

Remote:

 β”Œβ”€β”€β¬€ train_model
 β”‚
⬀───⬀──────⬀ main
 β”‚ β”‚
 β””β”€β”€β¬€β”€β”€β”˜

4. Git commands for coworking

You want to add something to Hiro’s work. However, you already did some other tasks for a while now: moving some parts of the code in src/data/make_dataset.py into src/features/build_features.py. So, let’s talk about that first.

What you did for a start was to pull all changes using git pull from remote main to local main so that you’re checking out the new branch build_features from the most recent version of main.

πŸ‘ Image
Your local:

⬀───⬀──────⬀ main
 β”‚ origin/main
 β”‚ build_features*
 β”‚
 └──⬀ make_dataset
 origin/make_dataset

Hiro's local:

⬀───⬀ main
 β”‚ origin/main
 β”‚
 └──⬀ train_model
 origin/train_model

Remote:

 β”Œβ”€β”€β¬€ train_model
 β”‚
⬀───⬀──────⬀ main
 β”‚ β”‚
 β””β”€β”€β¬€β”€β”€β”˜

In the middle of editing build_features branch, you want to see Hiro’s progress. But you still have 2 files in the branch that haven’t been staged for commit.

πŸ‘ Image

So, you stash these changes in a dirty working directory away using git stash. Then you can:

  1. create a local train_model branch checked out from local main,
  2. set the upstream of local train_model to origin/train_model so it can track remote train_model, and
  3. pull from the remote train_model that Hiro has created.

It’s all well and good until a problem appears in step 3 above. Since:

  1. Hiro checked out his local train_model from local main before you merged your remote make_dataset to remote main (see Section 3), and
  2. you pulled from remote main to your local main so you have the most recent version of main (see at the beginning of Section 4),

your local main is more updated (also called several "commits ahead") than Hiro’s. Hence you need a more elaborate way to pull the remote train_model (hint: git pull is just git fetch followed by git merge).

πŸ‘ Image
Your local:

 β”Œβ”€β”€β”€β”€β”€β”€β¬€ origin/train_model
 β”‚ β•²
 β”‚ ⬀ train_model*
 β”‚ β•±
⬀───⬀──────⬀ main
 β”‚ origin/main
 β”‚ build_features --> stash
 β”‚
 └──⬀ make_dataset
 origin/make_dataset

Hiro's local:

⬀───⬀ main
 β”‚ origin/main
 β”‚
 └──⬀ train_model
 origin/train_model

Remote:

 β”Œβ”€β”€β¬€ train_model
 β”‚
⬀───⬀──────⬀ main
 β”‚ β”‚
 β””β”€β”€β¬€β”€β”€β”˜

Now, after merging the latest local main with your local train_model, you’re ready to push the changes to remote and take anything back from stash to build_features.

πŸ‘ Image
Your local:

 β”Œβ”€β”€β”€β”€β”€β”€β¬€ 
 β”‚ β•²
 β”‚ β•²
 β”‚ ⬀ train_model
 β”‚ β•± origin/train_model
 β”‚ β•±
⬀───⬀──────⬀ main
 β”‚ origin/main
 β”‚ build_features*
 β”‚
 └──⬀ make_dataset
 origin/make_dataset

Hiro's local:

⬀───⬀ main
 β”‚ origin/main
 β”‚
 └──⬀ train_model
 origin/train_model

Remote:

 β”Œβ”€β”€β”€β”€β”€β”€β¬€
 β”‚ β•²
 β”‚ ⬀ train_model
 β”‚ β•±
⬀───⬀──────⬀ main
 β”‚ β”‚
 β””β”€β”€β¬€β”€β”€β”˜

You create and edit another file src/configs/config.py, stage all 3 files, commit, and push to remote.

πŸ‘ Image
Your local:

 β”Œβ”€β”€β”€β”€β”€β”€β¬€ 
 β”‚ β•²
 β”‚ β•²
 β”‚ ⬀ train_model
 β”‚ β•± origin/train_model
 β”‚ β•±
⬀───⬀──────⬀ main
 β”‚ β”‚ origin/main
 β”‚ β”‚
 β”‚ └──⬀ build_features*
 β”‚
 └──⬀ make_dataset
 origin/make_dataset

Hiro's local:

⬀───⬀ main
 β”‚ origin/main
 β”‚
 └──⬀ train_model
 origin/train_model

Remote:

 β”Œβ”€β”€β”€β”€β”€β”€β¬€
 β”‚ β•²
 β”‚ ⬀ train_model
 β”‚ β•±
⬀───⬀──────⬀ main
 β”‚ β”‚β”‚
 β””β”€β”€β¬€β”€β”€β”˜β””β”€β”€β¬€ build_features

5. Resolving merge conflicts

After everything has been pushed to remote, we won’t use local repos anymore. So let’s focus on the remote repo. Merge train_model and main.

πŸ‘ Merge train_model to main | Image by author
Merge train_model to main | Image by author

After requesting pull and merging train_model to main, here’s what we got so far.

Remote:

 β”Œβ”€β”€β”€β”€β”€β”€β¬€
 β”‚ β•²
 β”‚ ⬀ main
 β”‚ β•±
⬀───⬀──────⬀
 β”‚ β”‚β”‚
 β””β”€β”€β¬€β”€β”€β”˜β””β”€β”€β¬€ build_features

Now, merge build_features and main. This time, the two can’t automatically merge. But don’t worry, you can still create the pull request.

πŸ‘ Merge build_features to main | Image by author
Merge build_features to main | Image by author

It turns out build_features has conflicts that must be resolved, and the culprit is src/configs/config.py.

πŸ‘ A conflict between build_features and main that must be resolved | Image by author
A conflict between build_features and main that must be resolved | Image by author

You see the problem? Hiro added n_splits and max_features in this file for train_model branch, which has been merged to main. However, you also added loss and learning_rate for build_features branch in the same file. The merging operation becomes confused about which changes to be made.

πŸ‘ Resolving conflicts between build_features and main | Image by author
Resolving conflicts between build_features and main | Image by author

We want to maintain all variables since they all are useful in our project pipeline. Let’s just do so and delete all unnecessary lines.

πŸ‘ Committing changes after resolving conflicts | Image by author
Committing changes after resolving conflicts | Image by author

After merging build_features to main, here’s the worktree that we have on the remote repo.

Remote:

 β”Œβ”€β”€β”€β”€β”€β”€β¬€
 β”‚ β•²
 β”‚ ⬀───┐
 β”‚ β•± β”‚
⬀───⬀──────⬀ β”œβ”€β”€β¬€ main
 β”‚ β”‚β”‚ β”‚
 β””β”€β”€β¬€β”€β”€β”˜β””β”€β”€β¬€β”€β”€β”˜

We are done πŸ™‚

Wrapping Up

πŸ‘ Photo by Mel Poole on Unsplash
Photo by Mel Poole on Unsplash

I hope you learned a lot from this story. You’ve been introduced to several essential GitHub commands and use them in a real-case scenario of building a data science project. Here are some most common ones (not ordered in any way):

$ git add
$ git branch
$ git checkout
$ git clone
$ git commit
$ git fetch
$ git init
$ git merge
$ git pull
$ git push
$ git remote
$ git stash
$ git status

With these git commands, you can create/clone new repos, navigate through them or their branches, and collaborate with anyone on the opposite side of the world.

πŸ‘ Image

πŸ”₯ Hi there! If you enjoy this story and want to support me as a writer, consider becoming a member. For only $5 a month, you’ll get unlimited access to all stories on Medium. If you sign up using my link, I’ll earn a small commission.

πŸ”– Want to know more about how classical machine learning models work and how they optimize their parameters? Or an example of MLOps megaprojects? What about cherry-picked top-notch articles of all time? Continue reading:

Machine Learning from Scratch

Advanced Optimization Methods

MLOps Megaproject

My Best Stories

Data Science in R


Written By

Albers Uzila

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles