full-text search - Gary Sieling

In my last article I presented a simple tool for discovering senior developers from source code history, which imports Git history into a search engine. This lets you do searches like “who is our Sharepoint expert,” or “who fixed all our Oracle problems.” There are a few code search products that handle similiar problems, e.g. Krugle, Github, and Fisheye, however they tend to be focused on who made specific changes, rather than finding out who ran a project.

Version control history typically shows file changes using text diffs – in theory we could instead reconstruct a file before and after a change, then compare the parse trees instead. This is impractical in the general case, and if you read on I’ll show how to handle this correctly.

These results can be obtained from commit history, which includes metadata for commits (e.g. the author and commit comments), as well as the actual changes.

Typically full-text search engines (e.g. Solr) use natural language techniques to make text searchable, assuming that the text in question is English. NLP as a discipline is largely a list of language-specific tricks. This is a lossy process, where words are changed or removed, so that “similar” words are actually treated the same:

The dogs ran home => dog run home
The dog runs home => dog run home

This naturally resembles commit messages, but not so much code. In this article I’m going to show how to extract comments (a common request is searching Javadocs), but you could easily use the same technique to pull keywords and other code features.

The following diagram shows the indexing process. Our ETL process receives a sequence of file diffs and processes each in sequence. The API allows you to pull just the lines that changed, or the entire file, so you need to think about this in advance.

Here we have a few test cases for comments. If this was all you wanted, these should be relatively easy to find with a regular expression:

@@ -1,9 +1,9 @@
/* 
* Here is a multi-line comment.
*/
String a = "x"; // here is a comment at the end of a line

-String b /* a comment in a line */ = "y"; 
+String b /* a long comment in a line */ = "y"; 

// Commented out code:
// String c = "d";

This sample is output from git, and shows a few interesting problems. One is determining whether a line is really English or just commented code. When someone adds a single word to a comment, do we credit them the entire comment (remember, I’m trying to find out who a project lead was)? Even then, we need enough lines of context to find the beginning and end of the comment.

These can be resolved by agreeing on business rules for the product being built; true code parsing is much harder in the general case. The following is a sample from the Documentation for JSX, one of many new languages that compile to Javascript.

var MarkdownEditor = React.createClass({
  render: function() {
    return (
      
        Input
        
      
    );
  }
});

This has xml inside Javascript (!), which is modified at compile time to do something sane. With all the new languages like this, you’d be hard pressed to build a generalized solution to code parsing, and anyway, in my experience, large software projects all eventually acquire their own domain-specific languages.

Having established that accurate code-parsing is out of the question, we might look at how syntax highlighters work. Most programming blogs have tools to do highlight snippets, which is essentially the same problem, since the highlighter can’t resolve API references, and has to deal with typos. There are numerous IDEs and plugins which support highlighting (e.g ctags in Vim) so clearly this can’t be as hard of a problem as I’ve made it out to be. With some quick searching you’ll find there are at least a half dozen different tools to do this – many are Javascript libraries which like to consume and produce HTML.

Shown here is an example in Node.js, using highlight.js to tokenize and highlight code:

var hljs = require('highlight.js');

var fs = require('fs');
fs.readFile('./reveal.js', 'utf8', function read(err, code) {
  if (err) {
    throw err;
  }

  var output = hljs.highlightAuto(code).value;
  output = output.replace(/<\/span>/g, '\n');

  var spans = new RegExp('(.+)', 'g');
  output.replace(spans, function(match) {
    var spans = new RegExp('(.+)', 'g');
    var values = spans.exec(match);

    var type = values[1];
    var string = values[2];

    var document = {
      type: type,
      string: string
    };

    console.log(JSON.stringify(document));
  });
});

Since the library only gives you HTML, you have to do a little regex trickery to get useful output. As you can see below it gives pretty good results (anything it doesn’t recognize it skips). When you’re populating a search engine’s index, it typically doesn’t matter what order the words go in, unless you search against the same field you display.

{"type":"comment","string":"// Checks if reveal.js has been loaded"}
{"type":"function","string":"function"}
{"type":"params","string":"()"}
{"type":"keyword","string":"return"}
{"type":"comment","string":"// Forward event binding to the reveal DOM element"}
{"type":"function","string":"function"}
{"type":"params","string":"( type, listener, useCapture )"}

In the highlight.js code, we can see this works by applying regexes, like I suggested above:

 this.C_LINE_COMMENT_MODE = {
    className: 'comment',
    begin: '//', end: '$'
  };
  this.C_BLOCK_COMMENT_MODE = {
    className: 'comment',
    begin: '/\\*', end: '\\*/'
  };
  this.HASH_COMMENT_MODE = {
    className: 'comment',
    begin: '#', end: '$'
  };

With this information in hand, you could easily filter to just comments, and push that data into a full-text index (the code here is for Lunr.js, a Javascript library that resembles Lucene).

var index = lunr(function () {
  this.field('commments')
  this.ref('id')
});

index.add({
  id: 1,
  comments: '// Checks if reveal.js has been loaded'
});

index.add({
  id: 2,
  commit_message: '// Forward event binding to the reveal DOM element',
  commit: ''
});

This works well for building a search engine for developers, but if this were a customer facing project and this was 95% correct, you have to decide if it’s worth the cost of the call volume for the remaining 5%. In a search engine that starts with scraped or scanned data, the support issue is also an inevitability, and must be planned for – any project that does data migration must consider the specific cases it intends to solve.

If you’re interested in this project, check out slides from my talk, “Mining Source Code Repositories with Solr“, or the code for the project on github.

Tag: full-text search

Parsing Code in a Search Engine with Node.js

Input