Discovering Senior Developers from Source Code History

In a software company that does consulting, it’s often valuable for engineers to look at a change and know if it was done for a particular client – for instance, if an API feature does not appear to be used, it may actually be in use by a specific client. That said, the person who made the change may be a junior developer, and you’d prefer to know who the lead for a project was.

demo1

Similarly, when someone calls the company, it’s also helpful to know who to route phone calls to – this is often the case when a client doesn’t have a problem for a year or two after project was built, and since the person who picks up the phone could be a marketing or administrative person receiving the call and not someone with knowledge of the project.

I built a small application to do this, with the architecture pictured here:

uml

I found that typically you can find who is the lead on a project, merely based on the number of commits they’ve made referencing various topics, as long as you make good use of the available data. To do these searches we want to use the full-text indexing capacity of Solr to filter down commits. However, this is different from a typical search, where you’re looking for the commits individually, as we’re displaying facet results.

The backend schema for this is fairly simple:

<field name="id" type="string" indexed="true" stored="false"/>
<field name="author" type="string" indexed="true" stored="true"/> 
<field name="company" type="string" indexed="true" stored="true"/> 
<field name="year" type="string" indexed="true" stored="true"/> 
<field name="email" type="string" indexed="true" stored="true"/> 
<field name="message" type="string" indexed="true" stored="true"/> 
<field name="search" type="string" indexed="true" stored="true" />

To populate this database, we need to do some work using JGit, a Java library maintained as part of Eclipse. To traverse source code history, you need to pick a starting point and work backwards – if you were to parse every branch, you would need to de-duplicate older commits.

FileRepositoryBuilder builder = new FileRepositoryBuilder();
Repository repository = 
  builder.setGitDir(new File(path))
    .build();
 
RevWalk walk = new RevWalk(repository);
 
for (Ref ref : repository.getAllRefs().values()) {
  if ("HEAD".equals(ref.getName())) {
    walk.markStart(walk.parseCommit(ref.getObjectId()));
    break;
  }
}
 
for (RevCommit commit : walk) {
  ...
}

The final loop steps over each file in the commit – you can construct patches here if you wish. For this application, it’s helpful to remove very large commits, where someone might have made mass refactorings, and to hide file deletions. (That said, there may be some value in reporting on who does mass changes – this might also identify your senior developers).

As we loop over the commits, we can construct “documents” to send to Solr- these are just maps containing all the attributes we want to save.

// Connections happen over HTTP:
HttpSolrServer server = new HttpSolrServer(
     "http://localhost:8080/solr");
 
Collection docs = new ArrayList();
 
SolrInputDocument doc = new SolrInputDocument();
 
// ID for the document should have enough information to 
// find it in the source data:
doc.addField("id", remoteUrl + "." + commit.getId());
 
doc.addField("author", commit.getAuthorIdent().getName());
doc.addField("email", commit.getAuthorIdent().getEmailAddress());
doc.addField("message", commit.getFullMessage());
 
// Any data we let the user search against is included in this value:
doc.addField("search", search);

The final search results look like this:

demo2

Notably, all these people are knowledgeable about SQL, but from different angles: a committer on Postgres, a DBA, a principal on MySQL, and people who built PHP drivers for Oracle and SQL Server. Thus, through fairly simple means, and a few hours of work, you can quickly built a reporting system which identifies key players in an organization- If you’re interested in this project, check out slides from my talk, “Mining Source Code Repositories with Solr“, or the code for the project on github.

Tags: , , , , ,

3 comments ↓

#1 Keith Casey on 11.01.13 at 5:46 pm

Twilio evangelist here..

Very nifty. I use something similar to find hotspots in a codebase. As in the files/modules that change the most are probably the most unstable requirements, are the most problematic, riskiest, etc, etc.

On another note, you could add a little automation on the call routing side of things too. When an call comes in, before ringing the account or admin person, run through your checks and route it directly to the developers/team in question.

If you wanted to get really creative, you could check your CRM to see if they have a bill overdue.. and if so, route it to your accounting person first. ;)

#2 Robert Elwell on 11.01.13 at 6:37 pm

This is a pretty cool idea! Here are some thoughts I’ve got on it:

A lot of experts don’t actually mention the core technology that much in their commits because it’s implicit. You may need to develop a taxonomy of sub-topics and then derive their parent concept in order to best aggregate subject matter expertise.

Common commit messages that include the name of the core technology involve activation of features in a stack (ops). Another putative example would be integration of a feature between two layers within a system, which may not require subject matter expertise if the edges of those layers have a well-documented API.

So for instance, let’s say a Solr expert sets up a RequestHandler to only require the ‘qq’ parameter for an eDisMax query. His commit message probably says, “Configured request handler to receive query parameters from app.”

It’s the job of a mid-level application engineer to write the form and controller logic to receive Solr results, so his commit says, “Send search request from from input to Solr.”

So who’s the expert? :) I spent a solid year and a half working on Solr at Wikia and only 2% of my 1700 commit messages in our application code includes the term “solr”.

I think that this is one of those problems where information retrieval just gets us to the boundaries of a deeper problem.

Anyhow, this is a very interesting concept, with a lot of future work, and certainly some value to offer the engineering/hr world. Good work!

#3 Poopenstein on 11.06.13 at 3:08 am

git log –grep=’solr’ –format=’format:%cn’ | sort | uniq -c | sort -n

Leave a Comment

Current ye@r *