Everybody who is involved in the Computer Science/Software/ Programming world has undoubtedly heard of Git. But do we really know what Git is, and why we should use it? What is its underlying philosophy? What is the difference between Git, Github, GitLab and other platforms like Bitbucket?
If you want to learn all these things then, please, keep on reading. We will talk about the history of Git, it’s purpose and philosophy, and then get a bit more technical and go through its main functionalities and characteristics while building a real project.
Don’t feel overwhelmed, by the end of this post you will fully understand what git is, why it is so important in software development, and how you can use it to our advantage. Let’s get on with it!
Introduction
Git is the world’s most popular Version Control System (VCS).
What is a VCS? Very simple: a VCS is a software that helps manage the changes done to a certain code over time.
If you have a text document called CatImageDetector.py and you are consistently making changes to it, if used properly, the VCS will keep track of these changes so that if you want to go back to a version different than the last one you have edited (because the code isn’t working properly anymore for example) you can easily do it.
There is something that is worth highlighting from the previous paragraph: VCSs like Git are thought for keeping up with changes and storing TEXT documents. They are not for storing data; for that, we have many other platforms. Because of this, sometimes they have trouble keeping up with changes on formats like Jupyter Notebooks for example.
The philosophy behind Git contrasts with the usual local document overwriting that we do all the time – we edit a document, and only get to see the last picture of it – where if we want to go back to a previous version, we might not be able to.
These kind of systems (VCS) are also very valuable when working in a group with different people. Imagine you are working on a project with two other colleagues: you are all modifying the same documents, and want this collaboration to be as effective as possible. As we will see later, Git allows you to do this in the best possible manner (if certain guidelines are followed).
Those two, I think, are the main ideas behind Git: allowing users to manage their software projects so that the code is backed up after every desired change, and protecting the project from some of the inconsistencies that can appear when collaborating in a team.
Git also has some particularities that make it even more notable, like its distributed nature, it’s way of keeping track of the different versions of a document (using Deltas Δ: the changes between adjacent versions), its performance or its security, but those are issues that will be covered later or which are out of the scope of this post.
Let’s carry on and briefly see how Git was born.
The history of Git
The history of Git is tightly linked to the history of Linux. The origins of the Kernel of the Linux operating system, where developers from all over the world where contributing to the project, demanded the existence of an efficient version control system that would allow everyone to work in an ordered and systematic manner.
In 2005, the VCS that these developers were using for their work, BitKeeper, went from being free of charge to charging for its use, so the race to find a new VCS, and preferably one that was distributed, started.
The lead developer for this project, Linus Torvalds, decided after some time that if he couldn’t find the perfect VCS, fit for the task ahead, he would build his own. And that is how git was born.
You might be wondering what git actually stands for. The answer, which is not obvious at first, becomes evident once we know it, and after knowing the task it was created for. GIT stands for ‘global information tracker’.
Cool right? Okay, let’s get a bit more technical.
Git basics – Working with Groups
As mentioned earlier, one of the virtues of Git is that it efficiently allows us to work together with people from all over the world. For this collaboration to succeed, however, we can not just use git in whatever way we see fit and hope that our colleagues will understand what we have done, and drive their actions accordingly; some guidelines have to be followed.
But first, before defining these guidelines, let’s describe some of the basic elements of git: repositories and branches.
Repositories and Branches
A repository, despite its lengthy name, is best understood as a folder, just like the ones in our PCs, where all the elements that we are working with will be stored.
People working in the same project will be making changes, uploading documents, and downloading documents from the same repository.
Branches are a tool that facilitate self-organisation and collaborative work. Projects usually start from an initial branch called master branch, and from there build multiple other branches, forming a tree-like structure.
Why is this done? Imagine you are working on a project with 4 or 5 different people who are modifying the same scripts as you, but implementing different functionalities. If you all constantly upload changes to the same content on the master branch, everything is going to start looking messy pretty soon. Instead, everybody should be working on their own branch, implementing changes on that branch, and when everything is working perfectly on its own branch, these changes should carefully be merged.
Guidelines and Git-Flow
Now we know what repos (short for repositories) and branches are, let’s look at the guidelines for their proper use. The description from the previous paragraph was just a quick example to illustrate what branches are, but using them is not just as easy as creating one for each developer/feature and then whacking them all together.
We’ve already spoken about the master branch, saying that is the root or main branch of any project, but we have not given it the importance it deserves. The master branch is the branch where the latest working or production code of the project is stored, and should only be touched to upload thoroughly tested code that is going to be given actual use. Anything that is uploaded here should be ready to go into production.
How do we build code then, if we can’t touch the main branch? Easy: we make another branch, usually called develop, which is where we actually DEVELOP our code, and where we constantly make changes.
Wait, what happens when the situation described above happens, i.e, we have many people working on different functionalities of the same code base? You can probably guess the answer: we make branches that bifurcate from develop, and work there, uploading the changes into develop once we have finished our work, and checking that there are no conflicts with the code from the other developers.
However, when there are only a few of developers touching the same code on the develop branch, there is probably no need to create multiple branches from it. Also, sometimes we might want to create a branch starting from a certain point in a repository to develop new features from there for a different version or release of a product or something like that.
When the code from develop has been tested and tested and tested and yes, tested again, then we can upload the changes to our master branch.
There most commonly accepted path to do this is done via what is called the Git-Flow, which you can find here:
Now that we know how to efficiently work with branches, let’s briefly see the main git commands and then showcase an actual example.
Git commands
I am not going to describe all of the git commands with their outputs, just the ones that you 100% need to know and be comfortable with in order to be able to use git to manage the code of a project.
- status:
git statusThis command displays the current status of your repository, saying which files have been modified, which files have been added or committed. - branch:
git branchReturns the branches of our repository, highlighting the one we are currently working on.
- add:
git add document.pyTells git that you want to include the changes made in document.py in the next commit.
- commit:
git commit -m "message"The commit command basically records the changes you have made and added (using the previous command) to your local repository. The commit command has a message that usually describes the changes that have been made and uploaded with the commit.
We have mentioned through the post the distributed nature of git, which is one of its most previous features, but we have not really described what this means. It is called distributed because every git working directory contains a full-fledged repository containing complete history of the tree.
You usually have a remote repository (which you can configure) to where you upload and download the code from, but the nature of git doesn’t oblige you to do so.
- push:
git pushThis command sends the changes you have made in your local repo (via commits) to the remote repository you have configured.
- pull:
git pullPull is the brother command of push. It downloads a remote repository to your local machine.
Lastly, we will see the commands that allow us to work with branches:
- checkout:
git checkout branchis how we switch from the branch that we are on to another one. The branch is this command is the branch that we want to switch to. The checkout command can also be used to create new branches. Using the commandgit checkout -b newly_created_branchwe create a new branch and switch to it.
- merge:
git merge branchToMergeFrom**** I’ve been saying upload but the actual word for joining two branches is merge. This means moving the changes you have made in one branch to the branch that you are on. The usual way to do this is the following: imagine we want to marge our changes from develop to master. First, we must checkout to master, if we are in develop, and thengit merge developto pass the changes made in our _develo_p branch to our _maste_r branch. Remember this must only be done once the code is super ready to be used and has been tested.
Those are the main commands you need to know to be chill when using git. Now that you know them, let’s explore the differences between git and the various services that use it!
Git vs GitHub and other platforms
What are GitHub, GitLab, BitBucket, Azure repos and all those then? What is their relationship with git?
We have mentioned a couple of times throughout this post that one of the key characteristics of git is its decentralised nature. This means that a group of users who have git on their local machines, know the IP address of the other users, and set up an HTTP or SSH tunnel with its correspondent credentials and everything can push or pull repos from or to the local machines of their buddies.
However, for commodity, what would be done if we only worked with git to manage collaboration, would be to set up a remote server and have everybody push and pull from there. Wouldn’t this make git centralised? Well, technically no, as this remote server wouldn’t constitute a single point of failure for the network: if the code is in the remote server, it means it has been on one of the users local machine, and therefore we can recover it from there.
What services like Github, GitLab, or Bitbucket provide is just an efficient, organised, and easy way to do this, while also adding additional functionalities like pull requests, and a shiny web-based user interface. In this case, Git (a VCS) is the tool that is being used by various cloud based services like Github. Git is an open-source tool to manage source code while Github is an online service where git users can connect to upload or download resources. Git is the core, and GitHub and the other services provide layers around it.
To get your code to GitHub, have a look here. Finally, let’s look at a quick and dummie example of how to use git to manage some code and its workflow on Github.
Git example using Github
Okay, now that we have covered all the theory, let’s get practical and see a very simple example of the use of Github. Many modern IDEs like Sublime Text, Visual Studio or Pycharm have built-in git plugins, so that we can easily upload our code to one of the platforms we mentioned earlier. In our case, we will be using Pycharm.
First, we will create a new project, and assign it the name we want for our repo: _Git_forNoobs. Then, in a script called _Git_formedium.py we will write the following code:
print("Hello Medium!")
Save it, and then proceed to add to our local git repo and make the first commit with the message "_First commit of the Git_forNoobs repo", like shown in the following video.
However, if now we try to push to a remote repo, it will give us an error, as we have not configured one. First we need to go into the VCS option of our project, and create a GitHub repository. Once we have done this, the tool will probably push this first commit for us, and if we are using GitHub, we will see a new repository has been created, like shown in the following figure.
Okay, now we have created our repo and uploaded our first file, lets create a new branch and make some changes. First lets be a bit more extroverted and change the code to the following:
print("Hello Internet!")
After this, lets create a new branch called develop (by default when creating a new repo the first branch will be called master) using git checkout -b develop and add our file _Git_formedium.py to it.
Lastly, lets commit and then push these changes to GitHub.
As you can see, the push command gave us an error, as we did not have a branch called develop in our remote repo (it had only been created locally), and it did not know where to put our code. Using the –git--set-upstream origin develop we tell it to push the changes to a new branch in the GitHub repo called develop. Lets check out the changes!
Now we have two branches, master and develop, and they both have different codes, as our second commit was only pushed to develop and not to master.
Lets make one final change to our script, and push it to develop too. Now we have gone full extrovert mode, and want to say hi to everybody. The code will now be one of the most famous lines of the programming community:
print("Hello world!")
Now, as always, we add, commit, and push, in that order.
In the following figure we can see the last format of our code in the develop branch (top) and all the commits we have made to our repo, independently of the branch they have been made to (bottom).
As a final step, lets let the world know that we are a more extrovert person now, and update our changes done in the develop branch to master. This would be done via the checkout command (to move to master) and then merging develop, like shown in the following image.
Internally, merges work just like any other commit, so still after this we have to do a push to our remote repo using git push . Now, if we look at the master branch, we have the same code that we had in the develop one! Cool right?
Conclusion
Git is an insanely useful tool for software development. We have seen what it is for, what it should not be used for, its philosophy, origin and, differences with the services that use it.
For more posts like this one follow me on Medium, and stay tuned!
That is all, I hope you liked the post. Feel Free to connect with me on LinkedIn or follow me on Twitter at @jaimezorno. Also, you can take a look at my posts on Data Science and Machine Learning here. Have a good read!
Additional resources
In case you want to learn a little bit more, clarify your learning from this post, or go deeper into the subject, I have left some information here which I think could be of great use.
What is Git: become a pro at Git with this guide | Atlassian Git Tutorial
- Post on how to get started with git
Enjoy and feel free to contact me with any doubts!
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS