Identifying important keywords using Lunr.js and the Blekko API

Lunr.js is a simple full-text engine in Javascript. Full text search ranks documents returned from a query by how closely they resemble the query, based on word frequency and grammatical considerations – frequently occurring words have minimal effect, whereas if a rare word occurs in a document several times, it boosts the ranking significantly. This hearkens back to the days of ’90s search engines, where keyword stuffing was a valuable SEO tactic.

The ranking formula in full text is called tf-idf, which stands for text frequency / inverse document frequency – an indication of how the relevance is computed. This requires the indexing software to measure frequencies of words within a document, a query, and across the entire corpus. Lunr.js has a series of internal functions and objects to track word frequency, and is easy to customize:

lunr.Index.prototype.idf = function (term) {
  if (this._idfCache[term]) return this._idfCache[term]

 // var documentFrequency = this.tokenStore.count(term)
  var documentFrequency = blekko(term),
      idf = 1

  if (term === "") documentFrequency = 1

  if (documentFrequency > 0) {
    idf = 1 + Math.log(this.tokenStore.length / documentFrequency)
  }

  return this._idfCache[term] = idf
};

I thought it’d be interesting to extract word frequency from a search engine – for a small number of documents, it’s hard to get good numbers. The aim is to show “relevant keywords” for website content – this technique has a nice property of tending to ignore very common words, and phrases that have been spammed to death. The code below shows retrieving the numbers from Blekko’s API – to avoid cross-domain AJAX issues I run the queries through a proxy.

function blekko(query) {
  var result = blekko_cache[query];
  if (result !== undefined) return result;

  $.ajax({
    url: 'http://www.garysieling.com/poc/lunrkw/proxy.php?query=' + query,
    async: false
  }).done(function (data) {
    if (data === "") {
      result = 1000000000000000000000;
    } else {
      var json = JSON.parse(data);

      result = json.universal_total_results;
      if (result) {
        result = result.replace('M', '000000');
        result = result.replace('K', '000');
        result = parseInt(result);
      } else {
        result = 1000000000000000000000;
      }
    }
  });

  blekko_cache[query] = result;

  return result;
}

To populate the index, go through these steps:

  • Generate a list of unique words.
  • Collect all uses of each word into one ‘document’
  • Stick each batch of words into the index

Note also that I removed the stemmer, otherwise the stems of words are sent to Blekko during the ranking process, which skews the results. No notion of context. This technique has no concept of context – for instance “D3” is a model of Cadillac, a vitamin, Nikon SLR Model, and a Javascript Library.

var index = lunr(function () {
  this.field('word')
  this.ref('id')
});

index.pipeline.remove(lunr.stemmer);

var items = text.split(/[ ()'{0123456789}"\[\].:;+$,..-]/);
var words = {};
$.each(items, function (index, word) {
  if (word.length < 4) {
    return;
  }
  if ("" !== word) {
    var lword = word.toLowerCase();
    words[lword] = (words[lword] ? words[lword] : 0) + 1;
  }
});

var docs = [];
var id = 0;
$.each(words, function (k, v) {
  var wordlist = '';
  for (var i = 0; i < v * v; i++) {
    wordlist = wordlist + ' ' + k;
  }


  docs[id] = k;
  index.add({
    id: id++,
    word: wordlist
  });

});

To retrieve results, search the lunr index for all results - normally if you send in a blank query, it returns nothing, so I modified it to return all results.

var printed = {};
var topcnt = 250;
$.each(index.search(""),
  function (i, d) {
    var ref = parseInt(d.ref);
    var word = docs[ref];
    if (printed[word]) return;
    if (blekko_cache[word] === undefined) return;
    if (word.substr(word.length - 2) === 'ly') return;
    if (topcnt < 0) return;

    topcnt--;
    console.log(word + " (" + blekko_cache[word] + ")")
    printed[word] = true;
  }
);

Here's what the results look like:

hooks (980)
stumbled (968)
doc_num (963)
splits (957)
paints (948)
parameters (939)
indexed (927)
realm (916)
minifies (915)
python (912)
underscores (907)
unrelated (905)
replacements (903)
irrelevant (900)
closures (888)
unfinished (878)
summaries (877)
algorithms (873)
metrics (870)
painters (869)
manipulation (864)
facet (852)
clone (849)
occurrence (843)
defects (840)
brennan (837)
stains (830)
risen (824)
catenate (823)
richer (821)
packets (815)
commits (804)
mock (802)
sorting (775)
documenting (771)
visualization (768)
twitter (762)
recursed (761)
clicked (760)
lends (757)
hacked (755)
listens (747)
folders (745)
variables (742)
encrypted (736)
differs (736)
litigation (733)
tighter (729)
naive (725)
whipped (718)
smoother (714)
numpages (709)
loser (707)
override (703)
bins (693)
protections (691)
exposes (689)
ceramic (677)
programmer (676)
buttongroup (674)
wrapper (661)
facets (652)
oracle (644)