This article is part of Big Data in 30 Hours lectures series and is intended to serve as reference material for students. However, I hope others can also benefit.
Why do we need version control in Data Science?
Working with data is similar to working with software. A Data Scientist developing source code and data for the models needs similar basic tooling that regular software developers use.
The essential programmer tools are just four: an editor or IDE, a version control system, a unit test framework, and a deployment method. For instance a Python programmer is likely to respectively use: pyCharm, Git, the built-in unittest library, and Docker. Of course there are more options.
(In bigger projects you will additionally need a project management software; a frequent choice here is the Atlassian product family, including Jira, Trello and Confluence).
During the course of the project, one needs to create subsequent versions of the files (data, software or metadata such as parameters for experiments). Beginners will overwrite old files with new files. That way the project folder remains clean, with just the newest version of the work available. However, not being able to get back to the previous version of your work is not the way.
To cope with this problem, you might find yourself filling up the project folder with files looking perhaps like this: model, model0, model-v1, model-v1-corrected, model-v1-corrected2, model-v1.02-clean etc. This quickly becomes an unbearable mess. A version control system (such as Git) helps you manage this file versioning hell. It allows you to safely overwrite the older versions of the files, so only the newest versions remain in your folder. To get back to an older version of your work, you would retrieve it from the Git version control repository. So, a version control system is pretty much the same thing as a differential backup. It is a specialized data backup system, tailor-made for the needs of software and data professionals.
There are many version control systems out there. In the past, programmers used cvs and Subversion (svn). Today, Git seems the most popular choice. While learning Git, keep in mind that there exist other similar systems.
The first Git session
Git works almost the same under Windows, Linux or Mac.
The instructions below are for Windows 10 WSL, but should work just the same elsewhere. After downloading and installing Git locally on your computer, perform the basic config. At minimum, you need to tell Git who you are so the file revisions can be authored properly:
$ git config --global user.email "firstname.lastname@example.org" $ git config --global user.name "Pawel Plaszczak"
Now enter into the folder that you want to keep under version control. As an example, here is my repository for the training purposes. It includes one python script and a subdirectory with four text files. I want to keep all these files under version control:
ls -lR drwxrwxrwx 1 pawel pawel 512 Dec 30 09:51 books -rwxrwxrwx 1 pawel pawel 707 Dec 30 10:00 write-slow.py ./books: -rwxrwxrwx 1 pawel pawel 169655 Dec 30 09:51 debello.txt -rwxrwxrwx 1 pawel pawel 2198927 Dec 30 09:51 donquijote.txt -rwxrwxrwx 1 pawel pawel 256865 Dec 30 09:51 tadeusz.txt -rwxrwxrwx 1 pawel pawel 481101 Dec 30 09:51 trial.txt
To initiate the repository, use init. To add files to repository, use add. To commit files, use commit -m “message”. Always provide a descriptive message that will be stored with the commit. Here is what I did to add all files in the folder to the repository:
$ git init Initialized empty Git repository in /mnt/d/swriteslow/.git/ $ git add * $ git commit -m "version zero" [master (root-commit) 08c30fd] version zero 5 files changed, 53719 insertions(+) create mode 100644 books/debellogallico.txt create mode 100644 books/donquijote.txt create mode 100644 books/tadeusz.txt create mode 100644 books/trial.txt create mode 100644 write-slow.py
This is really it. Your files are now under version control. You can safely continue your work, simply overwriting the old work with the new one. No need to keep separate versions of your files. Just remember to manually run git add whenever a new file is added, and git commit whenever changes to old files are made. Git will keep the history of your work for you.
Below I will discuss a few additional terms, commands and explanations to help you get comfortable with what you are doing.
What files should you keep under version control?
The files you just added to Git are called tracked files. The remaining files, which Git does not know about, are untracked. We have added all the files, so everything is tracked. Good practice: unless you have a good reason, keep your local directory clean with all the files committed (tracked). A good reason to not commit a file? For instance, I would not commit a temporary output or log file which adds no value to my work. Maybe that file gets generated in the directory every time you run your software? Then probably you don’t want this committed in the repository. Another reason not to commit a file is specific to Data Science: when you work with big, often binary files that will not change – perhaps raw source data of 30+ megabytes? Putting this sort of data under version control might take time and add little benefit. If possible, avoid those situations. Get rid of trash and if you have raw data, keep it in a separate folder. Commit all that is left in the project folder.
Other useful concepts and commands
To some users it is frustrating that to commit a file, you always need two commands: add and commit. Why is it so?
The local directory which has been put under Git version control is now called the working directory. The location where Git stores the repository that includes all the history and metadata is called the Git directory. In between the two there is also the Staging Area that holds temporary data.
The files can be in one of three states: modified (the file has only been changed locally), staged (the file has been marked for commit) and committed (the recent version of the file has safely been stored in the local repository). If you add without a commit, the files end up in temporary staging phase: scheduled for commit but not yet committed.
If this is confusing, use the short form: commit -a to perform commit and add in one step:
git commit -a -m "just a few changes"
At this point you may notice that Git works in a similar way to synchronization services such as Dropbox or Google Drive. Indeed it does. The difference is that Git does not synchronize files automatically but only after an explicit commit. So to speak, you have control over the version control.
If you forgot whether the files have lately been synchronized with the repository, use status:
$ git status On branch master nothing to commit, working directory clean
The status above tells us that our version is in sync. If it were not, it would specify the difference. The output will be lengthy. You will quickly get annoyed with this and learn to use the laconic version of git status:
$ git status -s M write-slow.py
That’s a short-output version. The meaning of symbols are: (??) for untracked files, otherwise files can be (M)odified, (A)dded to staging. And you don’t really need to know much more to start with Git. Some other commands that may be useful at this stage are:
git log file #the commit history of a file git rm file # remove the file from the repository git mv fromOldName toNewName # change file name or location in the repository
At this point you learned to save your work locally. Good enough, if you are the only person in the project. If there are more of you, you will naturally want to share your code in a central, remote repository and this is where Git becomes really useful. In the next article we will learn this. I will publish the follow-up in a few days. Meanwhile, why don’t you get started with Git by putting your existing work under version control and getting fluent with the commands above. Mostly all you need at his point is: config, init, add, commit, status, log, rm, mv. Good luck!