nosql - Gary Sieling

In a software company that does consulting, it’s often valuable for engineers to look at a change and know if it was done for a particular client – for instance, if an API feature does not appear to be used, it may actually be in use by a specific client. That said, the person who made the change may be a junior developer, and you’d prefer to know who the lead for a project was.

Similarly, when someone calls the company, it’s also helpful to know who to route phone calls to – this is often the case when a client doesn’t have a problem for a year or two after project was built, and since the person who picks up the phone could be a marketing or administrative person receiving the call and not someone with knowledge of the project.

I built a small application to do this, with the architecture pictured here:

I found that typically you can find who is the lead on a project, merely based on the number of commits they’ve made referencing various topics, as long as you make good use of the available data. To do these searches we want to use the full-text indexing capacity of Solr to filter down commits. However, this is different from a typical search, where you’re looking for the commits individually, as we’re displaying facet results.

The backend schema for this is fairly simple:

To populate this database, we need to do some work using JGit, a Java library maintained as part of Eclipse. To traverse source code history, you need to pick a starting point and work backwards – if you were to parse every branch, you would need to de-duplicate older commits.

FileRepositoryBuilder builder = new FileRepositoryBuilder();
Repository repository = 
  builder.setGitDir(new File(path))
    .build();

RevWalk walk = new RevWalk(repository);

for (Ref ref : repository.getAllRefs().values()) {
  if ("HEAD".equals(ref.getName())) {
    walk.markStart(walk.parseCommit(ref.getObjectId()));
    break;
  }
}

for (RevCommit commit : walk) {
  ...
}

The final loop steps over each file in the commit – you can construct patches here if you wish. For this application, it’s helpful to remove very large commits, where someone might have made mass refactorings, and to hide file deletions. (That said, there may be some value in reporting on who does mass changes – this might also identify your senior developers).

As we loop over the commits, we can construct “documents” to send to Solr- these are just maps containing all the attributes we want to save.

// Connections happen over HTTP:
HttpSolrServer server = new HttpSolrServer(
     "http://localhost:8080/solr");

Collection docs = new ArrayList();

SolrInputDocument doc = new SolrInputDocument();

// ID for the document should have enough information to 
// find it in the source data:
doc.addField("id", remoteUrl + "." + commit.getId());

doc.addField("author", commit.getAuthorIdent().getName());
doc.addField("email", commit.getAuthorIdent().getEmailAddress());
doc.addField("message", commit.getFullMessage());

// Any data we let the user search against is included in this value:
doc.addField("search", search);

The final search results look like this:

Notably, all these people are knowledgeable about SQL, but from different angles: a committer on Postgres, a DBA, a principal on MySQL, and people who built PHP drivers for Oracle and SQL Server. Thus, through fairly simple means, and a few hours of work, you can quickly built a reporting system which identifies key players in an organization- If you’re interested in this project, check out slides from my talk, “Mining Source Code Repositories with Solr“, or the code for the project on github.

Tag: nosql

Discovering Senior Developers from Source Code History