Converting git commit history to a solr full-text index

I built a 4 million document archive from Github commits, which lets you search for open source experts, ranked by commit count. Click here to try the demo.

Solr is a relatively recent addition to the world of Lucene (2007); it adds a web-app UI over lucene, scaling (highly available reads), and configuration. For those unfamiliar, full-text indexing products build databases of words used in documents, allowing fast searching of text within a document. They handle language features such as synonyms (ran/run) and stemmed words (sear, seared, sears). Unlike typical database indices, they perform very well for finding similar words in close proximity.

I took solr training at a recent conference, Lucene Revolution, which is sponsored by Lucid Imagination, a solr consulting company. I decided to test it out in a few small projects, having previously done a FAST proof-of-concept. A colleague and I brainstormed ideas, and I wrote ETL code to convert several company git repositories into full text indexes. This could easily be expanded to link relevant JIRA tickets, Sharepoint documents, or other source control systems – in fact, one of the repositories I used for testing was a converted CVS repository.

Interest is growing in Solr, at least in part because a competitor, FAST, was acquired by Microsoft, who currently use it in Sharepoint. A solr index is conceptually similar to a single database table, and thus could also be used as a key-value store. Each column has detailed configuration, linking to various Java classes, so you can control whether data is hashed, compressed, or stored. Configuration files also control languages features such as stemming, heuristics, and synonyms. Each feature is stored in separate sets of files in the index, so you can easily figure out which features to turn off to save space.

I settled on JGit to read git repositories, after testing a couple Java-based libraries. JGit is a re-implementation of git in Java, and differs from other libraries which wrap the command line interface. It is also the basis for the Eclipse plugin EGit. EGit does not handle some line ending settings correctly, which concerned me, but it seems to work fine for this project. Reading repositories is by far the slowest part of this script. It only reads what you ask for— it is much faster to read just the commit history, without file diffs.

Solr provides a simple Java interface called solrj, which lets you push a list of rows (“documents”) to the index. Because of the single-table structure, it is quite common to have denormalized data, as well as data repeating in multiple columns. For example, you might concatenate all the fields you wish to search on, but store only indexed values so they can never be viewed. However, they can referenced through individual fields.

One of the important design considerations is what to use for a unique ID; each document needs an ID for later updates or deletions. When you push an existing document to a solr repository, the document is deleted, and the replacement is added (a soft delete). If you change the schema, everything is deleted. I used the commit ID, because I only indexed commit metadata; however, if I added entries for each file in the diff, I would need to use a different value for the ID. Database style sequence-based IDs would be a poor choice because it would make updates very difficult.

Some transformations are best done in Java, such as entity name normalization or extracting data from external systems. Because I combined several git repositories into one full-text index, I needed to normalize author names (“gsieling” and “Gary Sieling”). Intellij adds a line to commits identifying itself, so I added an artificial field identifying the commit author’s IDE of choice.

In a very crude and unfair comparison, the unoptimized Solr index is 90 MB of commits versus 2,000 MB of git history. The trade-offs that make Solr indexes small also make it fast, even without any kind of scaling.

This is post the first of many—in the future I will present interesting practical uses of this project, as well as options for parsing code. You can see the full source of this project on Github here.