Having fun with Git
I recently read The Git Book. As I went through the Git Internals parts, it struck me how simple and elegant the structure of Git really is. I decided that I just had to create my own little library to work with Git repositories (as you do). I call the result Silly Jgit. In this article, I will be walking through the code.
This article is for you if you want to understand Git a bit deeper or perhaps even want to work directly with a Git repository in your favorite programming language. I will be walking through four topics: 1) Reading a raw commit from a repository, 2) Reading the tree hash of the root of a commit, 3) parsing the file list of a directory tree, and 4) Reading the file contents from a subdirectory of a commit root.
Reading the head commit from a repository
The first thing we need to do in order to read the head commit is to find out which commit is the head of the repository. The .git/HEAD file is a plain text file that contains the name of a file in the .git/refs/heads directory. If you’ve checked out master, this will be .git/refs/heads/master. This file is a plain text file which contains a hash, that is: a 40 digit hexadecimal number. The hash can be converted to a filename of a Git Object under .git/objects. This file is a compressed file containing the commit information. Here’s the code to read it:
File repository = new File(".git");
File headFile = new File(repository,
Util.asString(new File(repository, "HEAD")).split(" ")[1].trim());
String commitHash = Util.asString(headFile).trim();
File commitFile = new File(repository,
"objects/" + commitHash.substring(0,2) + "/" + commitHash.substring(2));
try(final InputStream inputStream = new InflaterInputStream(new FileInputStream(commitFile))) {
System.out.println(Util.asString(inputStream));
}
Running this code produces the following output (notice that some of the spaces in the output are actually null bytes in the file):
commit 237 tree c03265971361724e18e31cc83e5c60cd0e0f5754
parent 141f5d5a2cc0c268e7b05be17a49c1c0dc61efad
author Johannes Brodwall 1379445359 +0200
committer Johannes Brodwall 1379445359 +0200
This is the commit comment
Finding the directory tree of a commit
When we have the commit information, we can parse it to find the tree hash. The tree hash references another file under .git/objects which contains the index of the root directory of the files in the commit. In the example above, the tree hash is “c03265971361724e18e31cc83e5c60cd0e0f5754”. But before we read the tree hash, we have to read the object type (in this case a “commit”) and size (in this case 237).
String treeHash;
try(final InputStream inputStream = new InflaterInputStream(new FileInputStream(commitFile))) {
String type = Util.stringUntil(inputStream, ' ');
long length = Long.valueOf(Util.stringUntil(inputStream, (char)0));
Util.stringUntil(inputStream, ' ');
treeHash = Util.stringUntil(inputStream, '\\n');
System.out.println("Tree hash: " + treeHash);
}
File rootTreeFile = new File(repository,
"objects/" + treeHash.substring(0,2) + "/" + treeHash.substring(2));
try(final InputStream inputStream = new InflaterInputStream(new FileInputStream(rootTreeFile))) {
System.out.println(Util.asString(inputStream));
}
Looking at the tree hash file is not as straight forward, however:
tree 130 100644 FOO æ?â?²ÑØCK?)®wZØÂä?S?
100644 FOO.txt ýc?Õô¹ìmìªGAk?X?ï'&
100644 README Wýs?ºyâx+@îR°X040000 lib ?ñG»Ñ?¼>&8´. ?úË¢i[o
The next part of this article will show how to deal with this.
Parsing a directory tree
The tree file has what looks like a lot of garbage. But don’t panic. Just like with the commit object, the tree object starts with the type (“tree”) and the size (130). After this, it will list each file or directory. Each tree entry consists of permissions (which also tells us whether this is a file or a directory), the file name and the hash of the entry, but this time as a binary number. We can read through the entries and find the file we want. We can then just print out the contents of this file:
File rootTreeFile = new File(repository,
"objects/" + treeHash.substring(0,2) + "/" + treeHash.substring(2));
Map entries = new HashMap<>();
try(final InputStream inputStream = new InflaterInputStream(new FileInputStream(rootTreeFile))) {
String type = Util.stringUntil(inputStream, ' ');
long length = Long.valueOf(Util.stringUntil(inputStream, (char)0));
while (true) {
String octalMode = Util.leftPad(Util.stringUntil(inputStream, ' '), 6, '0');
if (octalMode == null) break;
String path = Util.stringUntil(inputStream, (char)0);
StringBuilder hash = new StringBuilder();
for (int i=0; i<20; i++) {
hash.append(Util.leftPad(Integer.toHexString(inputStream.read()), 2, '0'));
}
entries.put(path, hash.toString());
}
}
System.out.println(entries);
Here’s an example of a parsed directory listing. I have not showed the octalMode for each file, but this can be extremely useful to separate between directories (which octalMode starts with 0) and files:
{FOO.txt=fd6385d5f4b9ec6decaa47416b7f96588aef2726,
lib=8ff147bbd18fbc3e2638b42ea09cfacba2695b6f,
README=57fd19a7738eba1e79e2782b161a40ee52b05801,
FOO=e69de29bb2d1d6434b8b29ae775ad8c2e48c5391}
Reading a file
This leads us to the end of our journey - how to read the contents of a file. Once we have the entries of a tree, it’s a simple matter of looking up the hash for a filename and parsing that file. As before, the file contents will start with the type (“blob” - which means “data”, I guess) and file size:
String blobHash = entries.get("README");
File blobFile = new File(repository, "objects/" + blobHash.substring(0,2) + "/" + blobHash.substring(2));
try(final InputStream inputStream = new InflaterInputStream(new FileInputStream(blobFile))) {
String type = Util.stringUntil(inputStream, ' ');
long length = Long.valueOf(Util.stringUntil(inputStream, (char)0));
System.out.println(Util.asString(inputStream));
}
This prints the contents of our file. Obviously, if you want to find a file a subdirectory, you’ll have to do a bit more work: Parse another tree object and look and an entry in that object, etc.
Conclusions
This blog post shows how in less than 50 lines of code, with no dependencies (but a small utility helper class), we can find the head commit of a git repository, parse the file listing of the root of the file tree for that commit and print out the contents of a file. The most difficult part was to discover that it was the InflaterInputStream
and not Zip or Gzip that was needed to unpack a git object.
My silly-jgit project supports reading and writing commits, trees and hashes from .git/objects. This is just the core subset of the Git plumbing commands. Furthermore, just as I wrote the article, I noticed that git often packs objects into .git/objects/pack. This adds a totally new dimension that I haven’t dealt with before.
I hope that nobody is crazy enough to actually use my silly Git library for Java. But I do hope that this article gave you some feeling of Git mastery.
Comments:
[Magnus Bondesson] - Oct 18, 2013
I remember you being skeptic to Git in December 2012. Great to hear that you have required the knowledge necessary to understand how smart and beautiful Git is!
Linus Torvalds is a true genius.
Johannes Brodwall - Nov 7, 2013
I’ve never really been skeptical to the technology behind Git, although I’ve enjoyed learning more about it. I find that the learning curve is hard for many developers and this is something we need to take serious. But it’s worth it.