Having fun with Git

I recently read The Git Book. As I went through the Git Internals parts, it struck me how simple and elegant the structure of Git really is. I decided that I just had to create my own little library to work with Git repositories (as you do). I call the result Silly Jgit. In this article, I will be walking through the code.

This article is for you if you want to understand Git a bit deeper or perhaps even want to work directly with a Git repository in your favorite programming language. I will be walking through four topics: 1) Reading a raw commit from a repository, 2) Reading the tree hash of the root of a commit, 3) parsing the file list of a directory tree, and 4) Reading the file contents from a subdirectory of a commit root.

Reading the head commit from a repository

The first thing we need to do in order to read the head commit is to find out which commit is the head of the repository. The .git/HEAD file is a plain text file that contains the name of a file in the .git/refs/heads directory. If you’ve checked out master, this will be .git/refs/heads/master. This file is a plain text file which contains a hash, that is: a 40 digit hexadecimal number. The hash can be converted to a filename of a Git Object under .git/objects. This file is a compressed file containing the commit information. Here’s the code to read it:

Running this code produces the following output (notice that some of the spaces in the output are actually null bytes in the file):

Finding the directory tree of a commit

When we have the commit information, we can parse it to find the tree hash. The tree hash references another file under .git/objects which contains the index of the root directory of the files in the commit. In the example above, the tree hash is “c03265971361724e18e31cc83e5c60cd0e0f5754″. But before we read the tree hash, we have to read the object type (in this case a “commit”) and size (in this case 237).

Looking at the tree hash file is not as straight forward, however:

The next part of this article will show how to deal with this.

Parsing a directory tree

The tree file has what looks like a lot of garbage. But don’t panic. Just like with the commit object, the tree object starts with the type (“tree”) and the size (130). After this, it will list each file or directory. Each tree entry consists of permissions (which also tells us whether this is a file or a directory), the file name and the hash of the entry, but this time as a binary number. We can read through the entries and find the file we want. We can then just print out the contents of this file:

Here’s an example of a parsed directory listing. I have not showed the octalMode for each file, but this can be extremely useful to separate between directories (which octalMode starts with 0) and files:

Reading a file

This leads us to the end of our journey – how to read the contents of a file. Once we have the entries of a tree, it’s a simple matter of looking up the hash for a filename and parsing that file. As before, the file contents will start with the type (“blob” – which means “data”, I guess) and file size:

This prints the contents of our file. Obviously, if you want to find a file a subdirectory, you’ll have to do a bit more work: Parse another tree object and look and an entry in that object, etc.

Conclusions

This blog post shows how in less than 50 lines of code, with no dependencies (but a small utility helper class), we can find the head commit of a git repository, parse the file listing of the root of the file tree for that commit and print out the contents of a file. The most difficult part was to discover that it was the InflaterInputStream and not Zip or Gzip that was needed to unpack a git object.

My silly-jgit project supports reading and writing commits, trees and hashes from .git/objects. This is just the core subset of the Git plumbing commands. Furthermore, just as I wrote the article, I noticed that git often packs objects into .git/objects/pack. This adds a totally new dimension that I haven’t dealt with before.

I hope that nobody is crazy enough to actually use my silly Git library for Java. But I do hope that this article gave you some feeling of Git mastery.

Creative Commons License
Having fun with Git by Johannes Brodwall, unless otherwise expressly stated, is licensed under a Creative Commons Attribution 3.0 Unported License.

About Johannes Brodwall

Johannes is the Oslo based Chief Scientist for the Sri Lanka based company Exilesoft. He trains distributed teams and contributes to projects halfway across the world. He is an active contributor to the programmer community in Oslo and Sri Lanka and a veteran speaker in Norway and abroad.
This entry was posted in Code, English, Java, Technology. Bookmark the permalink.
  • Magnus Bondesson

    I remember you being skeptic to Git in December 2012. Great to hear that you have required the knowledge necessary to understand how smart and beautiful Git is!

    Linus Torvalds is a true genius.

  • http://johannesbrodwall.com/ Johannes Brodwall

    I’ve never really been skeptical to the technology behind Git, although I’ve enjoyed learning more about it. I find that the learning curve is hard for many developers and this is something we need to take serious. But it’s worth it.