With help of this article, you made your first steps with Git, the version control software. You learned to commit your software so that it became version-controlled. You need just two more skills: work with remote repositories, and to check out a particular version of your files (not necessarily the newest one). Learn those two things , and you’re good to go.

This text, by the way, is part of Big Data in 30 hours lecture series. We do things from the perspective of a Data Scientist here, which however isn’t that different from a regular programmer’s perspective.

Working with local Git repository, which you learned in the previous article, is a temporary hack really. In the long run, you always want to work with a remote repository. Why so? First, it is not great to store your backups locally. Second, you want to share with others. So. Let’s will first learn to use other people’s remote repositories, and then to create our own repo.

Checking out from remote Git repository

The Git remote repository lives on a server. Ask your sysadmin to install one, or use a public one. The most common choice is, of course, github
For the training purposes, I have made my writeslow repository, described in part one of this article, available on github. Have a look: click here so you can explore this simple repository in a browser. This is how you get a copy on your local machine:

git clone https://github.com/altanova/writeslow [local-directory-name]

You should end up with a local folder being the exact copy of the data on the server, including the python source code and text files. You won’t be able to commit your changes back, because you have no permissions. However, you can use it and play with it. So now clone locally, play with and use the source code from thousands of developers who maintain their code publicly at github.

Here’s an exercise. You might be aware of Petter Harrington’s great book Machine Learning in Action. However, the source code of the examples in the edition I have works only with Python 2. We need Python 3 source code. As a matter of fact, it is available here: https://github.com/pbharrin/machinelearninginaction3x.git

Now for a training, clone this repository locally using git clone, and start playing with Machine Learning examples. Here’s another exercise: get a copy of games repository, used for training purposes, available here: https://github.com/githubtraining/github-games

Creating a remote Git / Github repository

Now learn to push your own source code to github:

  1. get yourself an account at github.com (pick free account, get through email verification… the usual stuff)
  2. continue in the browser, following the instructions to create your first repository.
  3. it is a good idea to add README.txt and a license! Popular open source licenses are: Apache or GPL.

Next back at your home computer’s shell, pull a copy of the remote repository, just like we did before:

git clone https://github.com/[your-repo-name]

At this point you should end up with …

$ cd [your-repo-name]
$ ls -la
total 384
drwxrwxrwx 1 pawel pawel   512 Dec 30 09:48 .git
-rwxrwxrwx 1 pawel pawel 11558 Dec 30 09:49 LICENSE
-rwxrwxrwx 1 pawel pawel   412 Dec 30 09:49 README.md

Note. This is where people are confused. Now you work not with one git repository, but with two: one local at .git directory, and another remote at github. So to properly save your work both locally and remotely, you need two steps: (1) git commit -a, to commit the files locally, (2) git push, to synchronize the local with the remote. So, let’s add some files locally in the directory, and do just this:

$ git commit -a -m "just a space"
$ git push

This automatically saves your work to the original remote repository (called origin) and the default branch (called master). We used short syntax of git push. The full syntax would read:

git push origin master

By the way, if you ever forgot where did you originally clone the repository from, just ask Git:

git remote show origin
* remote origin
  Fetch URL: https://github.com/altanova/writeslow
  Push  URL: https://github.com/altanova/writeslow

And how to push to a different repository, or to a different branch? Here is how:

git push https://github.com/[another-repo-name] [another-branch]

I will not go into branching and merging in this basic class. More on this here, for those interested. In summary, we learned two commands: use git clone to clone the remote repository locally. Use git push to push the local repository remotely. 

Tagging and versioning

It is nice to have some control over the version control. At minimum, we want to assign names to versions. Here is how.  To tag, using an annotated tag with a message:

git tag -a v1.0 -m "it works."

to tag, using a lightweight tag without a message:

git tag v1.01

No need to commit after tagging. To check the existent tags, use git tag again; to check details on a certain tag use git show, like this:

>git tag

>git show v1.0
tag v1.0
Tagger: Pawel Plaszczak <pp@altanova.pl>
Date:   Sun Dec 30 19:04:24 2018 +0100

Tagging is useful in the case we want to check out a historical version of the software, different from the most recent one. This is how:

>git checkout v1.01
Note: checking out 'v1.01'.

You are in 'detached HEAD' state. You can look around, make experimental

The message is a warning. Do not commit anything unless you know what you are doing. Have a look around, copy away the files you want, and get back to the latest version to get out of Detached Head state. Here’s how:

>git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.

A final word for the Data Scientists

At this point you start using Git. Commands you learned are:
 config, init, add, commit, status, log, rm, mv,clone, push, show, checkout, tag. For anything else, refer to the documentation. Just two more points to consider. In part one of this article, I mentioned that as a Data Scientist, you might decide to not store some project files, such as raw data, in Git. Storing or not storing data in version control is an important decision made per project. No golden bullet here. Some people create a separate data directory which is kept out of version control. Some others recommend a particular structure for the project folders, such as the one here. You may also get acquainted with Git Large File Storage, as one of the options to secure your data files. That said, taking the raw data files aside, it is generally a good practice to version control everything that is, or can potentially be, subject to modification, including not just source code, but everything that is human-generated: parameters, metadata, build files, and documentation.

Now, that’s really it. You are good to go. Good luck!

Here’s Big Data in 30 hours main page again, and here’s my contact

Git version control: part 2

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.