{"id":1361,"date":"2013-07-09T11:39:42","date_gmt":"2013-07-09T11:39:42","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=1361"},"modified":"2013-07-09T11:39:42","modified_gmt":"2013-07-09T11:39:42","slug":"creating-n-gram-indexes-with-python","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/","title":{"rendered":"Creating N-Gram Indexes with Python"},"content":{"rendered":"<p>&#8220;<a href=\"http:\/\/www.amazon.com\/gp\/product\/0596516495\/ref=as_li_ss_tl?ie=UTF8&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0596516495&#038;linkCode=as2&#038;tag=thesecrelifeo-20\">Natural Language Processing with Python<\/a>&#8221; (<a href=\"http:\/\/garysieling.com\/blog\/book-review-natural-language-processing-with-python\">read my review<\/a>) has lots of motivating examples for natural language processing. I quickly found it valuable to build indices ahead of time &#8211; I have a corpus of legal texts, and build a set of n-gram indices from it.<\/p>\n<p>The first index is a list of just tokenized text, with all text contents combined. You may wish to add a special token that indicates where files end.<\/p>\n<pre lang=\"python\">\nimport nltk;\nimport os;\nimport re;\n\nall_tokens = []\nidx = 0\nfor root, directories, filenames in os.walk('.'):\n  for file in filenames:\n    if file.endswith('.txt'):\n      idx = idx + 1\n      contents = open(file)\n      raw = contents.read()\n      lc_raw = raw.lower()\n      new_tokens = nltk.word_tokenize(lc_raw)\n      new_tokens_filtered = [w for w in new_tokens \n                             if len(w) < 20 and \n                             (re.search('^[a-zA-Z]+$', w) or len(w) == 1)]\n      all_tokens = all_tokens + new_tokens_filtered\n\nfor token in all_tokens:\n    print token\n<\/pre>\n<p>This filters the tokens - depending on the situation you may wish to modify this (e.g. whether you want punctuation or not).<\/p>\n<p>The following turns the token index into an n-gram index. This runs pretty quickly while there is memory - this could easily be extended to work on parts of a document set, and combine with the results with a map-reduce operation by addition, especially if the output was sorted. One surprising result is that the 5-gram index can be quite a bit larger than the original data. <\/p>\n<pre lang=\"python\">\nimport nltk;\nimport os;\n\ncontents = open('all_tokens5', 'rU')\n\nn1_gram_word = ' '\nn2_gram_word = ' ' \nn3_gram_word = ' '\nn4_gram_word = ' '\nn5_gram_word = ' '\n\nn1_counts = {}\nn2_counts = {}\nn3_counts = {}\nn4_counts = {}\nn5_counts = {}\n\nindex = 0\n\ndef incr(c, w):\n  try:\n    c[w] = c[w] + 1\n  except:\n    c[w] = 1\n\nfor word in contents:\n  index = index + 1\n\n  if (index % 10000 == 0):\n    print \"Processed %-d words\" % (index)\n\n  # defects: loses last character, loses EOF, adds bigrams\/trigrams at starts\n  (n5_gram_word, n4_gram_word, n3_gram_word, n2_gram_word, n1_gram_word) = \\\n    (n4_gram_word, n3_gram_word, n2_gram_word, n1_gram_word, word[:-1])   \n  \n  n1_gram = n1_gram_word\n  n2_gram = n2_gram_word + ' ' + n1_gram\n  n3_gram = n3_gram_word + ' ' + n2_gram\n  n4_gram = n4_gram_word + ' ' + n3_gram\n  n5_gram = n5_gram_word + ' ' + n4_gram\n  \n  incr(n1_counts, n1_gram)\n  incr(n2_counts, n2_gram)\n  incr(n3_counts, n3_gram)\n  incr(n4_counts, n4_gram)\n  incr(n5_counts, n5_gram)\n\ncontents.close()\n\ndef save_ngram(c, f):\n  output = open(f, 'w')\n  ordered = sorted(c.items(), lambda a, b: cmp(c[a[0]], c[b[0]]))\n  for a, b in ordered:\n    output.write('%-s %-d\\n' % (a, b))\n\n  output.close()\n\nsave_ngram(n1_counts, '1gram')\nsave_ngram(n2_counts, '2gram')\nsave_ngram(n3_counts, '3gram')\nsave_ngram(n4_counts, '4gram')\nsave_ngram(n5_counts, '5gram')\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;Natural Language Processing with Python&#8221; (read my review) has lots of motivating examples for natural language processing. I quickly found it valuable to build indices ahead of time &#8211; I have a corpus of legal texts, and build a set of n-gram indices from it. The first index is a list of just tokenized text, &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Creating N-Gram Indexes with Python&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[5],"tags":[385,386,447],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1361"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=1361"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1361\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=1361"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=1361"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=1361"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}