{"id":1197,"date":"2013-06-20T12:00:08","date_gmt":"2013-06-20T12:00:08","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=1197"},"modified":"2013-06-20T12:00:08","modified_gmt":"2013-06-20T12:00:08","slug":"identifying-important-keywords-using-lunr-js-and-the-blekko-api","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/identifying-important-keywords-using-lunr-js-and-the-blekko-api\/","title":{"rendered":"Identifying important keywords using Lunr.js and the Blekko API"},"content":{"rendered":"<p>Lunr.js is a simple full-text engine in Javascript. Full text search ranks documents returned from a query by how closely they resemble the query, based on word frequency and grammatical considerations &#8211; frequently occurring words have minimal effect, whereas if a rare word occurs in a document several times, it boosts the ranking significantly. This hearkens back to the days of &#8217;90s search engines, where keyword stuffing was a valuable SEO tactic.<\/p>\n<p>The ranking formula in full text is called tf-idf, which stands for text frequency \/ inverse document frequency &#8211; an indication of how the relevance is computed. This requires the indexing software to measure frequencies of words within a document, a query, and across the entire corpus. Lunr.js has a series of internal functions and objects to track word frequency, and is easy to customize:<\/p>\n<pre lang=\"Javascript\">\nlunr.Index.prototype.idf = function (term) {\n  if (this._idfCache[term]) return this._idfCache[term]\n\n \/\/ var documentFrequency = this.tokenStore.count(term)\n  var documentFrequency = blekko(term),\n      idf = 1\n\n  if (term === \"\") documentFrequency = 1\n\n  if (documentFrequency > 0) {\n    idf = 1 + Math.log(this.tokenStore.length \/ documentFrequency)\n  }\n\n  return this._idfCache[term] = idf\n};\n<\/pre>\n<p>I thought it&#8217;d be interesting to extract word frequency from a search engine &#8211; for a small number of documents, it&#8217;s hard to get good numbers. The aim is to show &#8220;relevant keywords&#8221; for website content &#8211; this technique has a nice property of tending to ignore very common words, and phrases that have been spammed to death. The code below shows retrieving the numbers from Blekko&#8217;s API &#8211; to avoid cross-domain AJAX issues I run the queries through a proxy.<\/p>\n<pre lang=\"Javascript\">\nfunction blekko(query) {\n  var result = blekko_cache[query];\n  if (result !== undefined) return result;\n\n  $.ajax({\n    url: 'http:\/\/www.garysieling.com\/poc\/lunrkw\/proxy.php?query=' + query,\n    async: false\n  }).done(function (data) {\n    if (data === \"\") {\n      result = 1000000000000000000000;\n    } else {\n      var json = JSON.parse(data);\n\n      result = json.universal_total_results;\n      if (result) {\n        result = result.replace('M', '000000');\n        result = result.replace('K', '000');\n        result = parseInt(result);\n      } else {\n        result = 1000000000000000000000;\n      }\n    }\n  });\n\n  blekko_cache[query] = result;\n\n  return result;\n}\n<\/pre>\n<p>To populate the index, go through these steps:<\/p>\n<ul>\n<li>Generate a list of unique words.<\/li>\n<li>Collect all uses of each word into one &#8216;document&#8217;<\/li>\n<li>Stick each batch of words into the index<\/li>\n<\/ul>\n<p>Note also that I removed the stemmer, otherwise the stems of words are sent to Blekko during the ranking process, which skews the results. No notion of context. This technique has no concept of context &#8211; for instance &#8220;D3&#8221; is a model of Cadillac, a vitamin, Nikon SLR Model, and a Javascript Library.<\/p>\n<pre lang=\"Javascript\">\nvar index = lunr(function () {\n  this.field('word')\n  this.ref('id')\n});\n\nindex.pipeline.remove(lunr.stemmer);\n\nvar items = text.split(\/[ ()'{0123456789}\"\\[\\].:;+$,..-]\/);\nvar words = {};\n$.each(items, function (index, word) {\n  if (word.length < 4) {\n    return;\n  }\n  if (\"\" !== word) {\n    var lword = word.toLowerCase();\n    words[lword] = (words[lword] ? words[lword] : 0) + 1;\n  }\n});\n\nvar docs = [];\nvar id = 0;\n$.each(words, function (k, v) {\n  var wordlist = '';\n  for (var i = 0; i < v * v; i++) {\n    wordlist = wordlist + ' ' + k;\n  }\n\n\n  docs[id] = k;\n  index.add({\n    id: id++,\n    word: wordlist\n  });\n\n});\n<\/pre>\n<p>To retrieve results, search the lunr index for all results - normally if you send in a blank query, it returns nothing, so I modified it to return all results.<\/p>\n<pre lang=\"Javascript\">\nvar printed = {};\nvar topcnt = 250;\n$.each(index.search(\"\"),\n  function (i, d) {\n    var ref = parseInt(d.ref);\n    var word = docs[ref];\n    if (printed[word]) return;\n    if (blekko_cache[word] === undefined) return;\n    if (word.substr(word.length - 2) === 'ly') return;\n    if (topcnt < 0) return;\n\n    topcnt--;\n    console.log(word + \" (\" + blekko_cache[word] + \")\")\n    printed[word] = true;\n  }\n);\n<\/pre>\n<p>Here's what the results look like:<\/p>\n<pre>\nhooks (980)\nstumbled (968)\ndoc_num (963)\nsplits (957)\npaints (948)\nparameters (939)\nindexed (927)\nrealm (916)\nminifies (915)\npython (912)\nunderscores (907)\nunrelated (905)\nreplacements (903)\nirrelevant (900)\nclosures (888)\nunfinished (878)\nsummaries (877)\nalgorithms (873)\nmetrics (870)\npainters (869)\nmanipulation (864)\nfacet (852)\nclone (849)\noccurrence (843)\ndefects (840)\nbrennan (837)\nstains (830)\nrisen (824)\ncatenate (823)\nricher (821)\npackets (815)\ncommits (804)\nmock (802)\nsorting (775)\ndocumenting (771)\nvisualization (768)\ntwitter (762)\nrecursed (761)\nclicked (760)\nlends (757)\nhacked (755)\nlistens (747)\nfolders (745)\nvariables (742)\nencrypted (736)\ndiffers (736)\nlitigation (733)\ntighter (729)\nnaive (725)\nwhipped (718)\nsmoother (714)\nnumpages (709)\nloser (707)\noverride (703)\nbins (693)\nprotections (691)\nexposes (689)\nceramic (677)\nprogrammer (676)\nbuttongroup (674)\nwrapper (661)\nfacets (652)\noracle (644)\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Lunr.js is a simple full-text engine in Javascript. Full text search ranks documents returned from a query by how closely they resemble the query, based on word frequency and grammatical considerations &#8211; frequently occurring words have minimal effect, whereas if a rare word occurs in a document several times, it boosts the ranking significantly. This &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/identifying-important-keywords-using-lunr-js-and-the-blekko-api\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Identifying important keywords using Lunr.js and the Blekko API&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[5,6,11,13],"tags":[89,140,302,359,385,432],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1197"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=1197"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1197\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=1197"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=1197"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=1197"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}