{"id":1361,"date":"2013-07-09T11:39:42","date_gmt":"2013-07-09T11:39:42","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=1361"},"modified":"2013-07-09T11:39:42","modified_gmt":"2013-07-09T11:39:42","slug":"creating-n-gram-indexes-with-python","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/","title":{"rendered":"Creating N-Gram Indexes with Python"},"content":{"rendered":"<p>&#8220;<a href=\"http:\/\/www.amazon.com\/gp\/product\/0596516495\/ref=as_li_ss_tl?ie=UTF8&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0596516495&#038;linkCode=as2&#038;tag=thesecrelifeo-20\">Natural Language Processing with Python<\/a>&#8221; (<a href=\"http:\/\/garysieling.com\/blog\/book-review-natural-language-processing-with-python\">read my review<\/a>) has lots of motivating examples for natural language processing. I quickly found it valuable to build indices ahead of time &#8211; I have a corpus of legal texts, and build a set of n-gram indices from it.<\/p>\n<p>The first index is a list of just tokenized text, with all text contents combined. You may wish to add a special token that indicates where files end.<\/p>\n<pre lang=\"python\">\nimport nltk;\nimport os;\nimport re;\n\nall_tokens = []\nidx = 0\nfor root, directories, filenames in os.walk('.'):\n  for file in filenames:\n    if file.endswith('.txt'):\n      idx = idx + 1\n      contents = open(file)\n      raw = contents.read()\n      lc_raw = raw.lower()\n      new_tokens = nltk.word_tokenize(lc_raw)\n      new_tokens_filtered = [w for w in new_tokens \n                             if len(w) < 20 and \n                             (re.search('^[a-zA-Z]+$', w) or len(w) == 1)]\n      all_tokens = all_tokens + new_tokens_filtered\n\nfor token in all_tokens:\n    print token\n<\/pre>\n<p>This filters the tokens - depending on the situation you may wish to modify this (e.g. whether you want punctuation or not).<\/p>\n<p>The following turns the token index into an n-gram index. This runs pretty quickly while there is memory - this could easily be extended to work on parts of a document set, and combine with the results with a map-reduce operation by addition, especially if the output was sorted. One surprising result is that the 5-gram index can be quite a bit larger than the original data. <\/p>\n<pre lang=\"python\">\nimport nltk;\nimport os;\n\ncontents = open('all_tokens5', 'rU')\n\nn1_gram_word = ' '\nn2_gram_word = ' ' \nn3_gram_word = ' '\nn4_gram_word = ' '\nn5_gram_word = ' '\n\nn1_counts = {}\nn2_counts = {}\nn3_counts = {}\nn4_counts = {}\nn5_counts = {}\n\nindex = 0\n\ndef incr(c, w):\n  try:\n    c[w] = c[w] + 1\n  except:\n    c[w] = 1\n\nfor word in contents:\n  index = index + 1\n\n  if (index % 10000 == 0):\n    print \"Processed %-d words\" % (index)\n\n  # defects: loses last character, loses EOF, adds bigrams\/trigrams at starts\n  (n5_gram_word, n4_gram_word, n3_gram_word, n2_gram_word, n1_gram_word) = \\\n    (n4_gram_word, n3_gram_word, n2_gram_word, n1_gram_word, word[:-1])   \n  \n  n1_gram = n1_gram_word\n  n2_gram = n2_gram_word + ' ' + n1_gram\n  n3_gram = n3_gram_word + ' ' + n2_gram\n  n4_gram = n4_gram_word + ' ' + n3_gram\n  n5_gram = n5_gram_word + ' ' + n4_gram\n  \n  incr(n1_counts, n1_gram)\n  incr(n2_counts, n2_gram)\n  incr(n3_counts, n3_gram)\n  incr(n4_counts, n4_gram)\n  incr(n5_counts, n5_gram)\n\ncontents.close()\n\ndef save_ngram(c, f):\n  output = open(f, 'w')\n  ordered = sorted(c.items(), lambda a, b: cmp(c[a[0]], c[b[0]]))\n  for a, b in ordered:\n    output.write('%-s %-d\\n' % (a, b))\n\n  output.close()\n\nsave_ngram(n1_counts, '1gram')\nsave_ngram(n2_counts, '2gram')\nsave_ngram(n3_counts, '3gram')\nsave_ngram(n4_counts, '4gram')\nsave_ngram(n5_counts, '5gram')\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;Natural Language Processing with Python&#8221; (read my review) has lots of motivating examples for natural language processing. I quickly found it valuable to build indices ahead of time &#8211; I have a corpus of legal texts, and build a set of n-gram indices from it. The first index is a list of just tokenized text, &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Creating N-Gram Indexes with Python&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[5],"tags":[385,386,447],"aioseo_notices":[],"aioseo_head":"\n\t\t<!-- All in One SEO 4.9.9 - aioseo.com -->\n\t<meta name=\"description\" content=\"&quot;Natural Language Processing with Python&quot; (read my review) has lots of motivating examples for natural language processing. I quickly found it valuable to build indices ahead of time - I have a corpus of legal texts, and build a set of n-gram indices from it. The first index is a list of just tokenized text,\" \/>\n\t<meta name=\"robots\" content=\"max-image-preview:large\" \/>\n\t<meta name=\"author\" content=\"gary\"\/>\n\t<link rel=\"canonical\" href=\"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/\" \/>\n\t<meta name=\"generator\" content=\"All in One SEO (AIOSEO) 4.9.9\" \/>\n\t\t<meta property=\"og:locale\" content=\"en_US\" \/>\n\t\t<meta property=\"og:site_name\" content=\"Gary Sieling - Software Engineer\" \/>\n\t\t<meta property=\"og:type\" content=\"article\" \/>\n\t\t<meta property=\"og:title\" content=\"Creating N-Gram Indexes with Python - Gary Sieling\" \/>\n\t\t<meta property=\"og:description\" content=\"&quot;Natural Language Processing with Python&quot; (read my review) has lots of motivating examples for natural language processing. I quickly found it valuable to build indices ahead of time - I have a corpus of legal texts, and build a set of n-gram indices from it. The first index is a list of just tokenized text,\" \/>\n\t\t<meta property=\"og:url\" content=\"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/\" \/>\n\t\t<meta property=\"article:published_time\" content=\"2013-07-09T11:39:42+00:00\" \/>\n\t\t<meta property=\"article:modified_time\" content=\"2013-07-09T11:39:42+00:00\" \/>\n\t\t<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n\t\t<meta name=\"twitter:title\" content=\"Creating N-Gram Indexes with Python - Gary Sieling\" \/>\n\t\t<meta name=\"twitter:description\" content=\"&quot;Natural Language Processing with Python&quot; (read my review) has lots of motivating examples for natural language processing. I quickly found it valuable to build indices ahead of time - I have a corpus of legal texts, and build a set of n-gram indices from it. The first index is a list of just tokenized text,\" \/>\n\t\t<script type=\"application\/ld+json\" class=\"aioseo-schema\">\n\t\t\t{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"BlogPosting\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/creating-n-gram-indexes-with-python\\\/#blogposting\",\"name\":\"Creating N-Gram Indexes with Python - Gary Sieling\",\"headline\":\"Creating N-Gram Indexes with Python\",\"author\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/author\\\/gary\\\/#author\"},\"publisher\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/#organization\"},\"datePublished\":\"2013-07-09T11:39:42+00:00\",\"dateModified\":\"2013-07-09T11:39:42+00:00\",\"inLanguage\":\"en-US\",\"commentCount\":6,\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/creating-n-gram-indexes-with-python\\\/#webpage\"},\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/creating-n-gram-indexes-with-python\\\/#webpage\"},\"articleSection\":\"Data Mining, nlp, nltk, python\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/creating-n-gram-indexes-with-python\\\/#breadcrumblist\",\"itemListElement\":[{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog#listItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.garysieling.com\\\/blog\",\"nextItem\":{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/category\\\/data-mining\\\/#listItem\",\"name\":\"Data Mining\"}},{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/category\\\/data-mining\\\/#listItem\",\"position\":2,\"name\":\"Data Mining\",\"item\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/category\\\/data-mining\\\/\",\"nextItem\":{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/creating-n-gram-indexes-with-python\\\/#listItem\",\"name\":\"Creating N-Gram Indexes with Python\"},\"previousItem\":{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog#listItem\",\"name\":\"Home\"}},{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/creating-n-gram-indexes-with-python\\\/#listItem\",\"position\":3,\"name\":\"Creating N-Gram Indexes with Python\",\"previousItem\":{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/category\\\/data-mining\\\/#listItem\",\"name\":\"Data Mining\"}}]},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/#organization\",\"name\":\"Gary Sieling\",\"description\":\"Software Engineer\",\"url\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/author\\\/gary\\\/#author\",\"url\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/author\\\/gary\\\/\",\"name\":\"gary\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/creating-n-gram-indexes-with-python\\\/#authorImage\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/0be925276d848ffe98a6a9dc8cf33e67?s=96&d=identicon&r=g\",\"width\":96,\"height\":96,\"caption\":\"gary\"}},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/creating-n-gram-indexes-with-python\\\/#webpage\",\"url\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/creating-n-gram-indexes-with-python\\\/\",\"name\":\"Creating N-Gram Indexes with Python - Gary Sieling\",\"description\":\"\\\"Natural Language Processing with Python\\\" (read my review) has lots of motivating examples for natural language processing. I quickly found it valuable to build indices ahead of time - I have a corpus of legal texts, and build a set of n-gram indices from it. The first index is a list of just tokenized text,\",\"inLanguage\":\"en-US\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/#website\"},\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/creating-n-gram-indexes-with-python\\\/#breadcrumblist\"},\"author\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/author\\\/gary\\\/#author\"},\"creator\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/author\\\/gary\\\/#author\"},\"datePublished\":\"2013-07-09T11:39:42+00:00\",\"dateModified\":\"2013-07-09T11:39:42+00:00\"},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/\",\"name\":\"Gary Sieling\",\"description\":\"Software Engineer\",\"inLanguage\":\"en-US\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/#organization\"}}]}\n\t\t<\/script>\n\t\t<!-- All in One SEO -->\n\n","aioseo_head_json":{"title":"Creating N-Gram Indexes with Python - Gary Sieling","description":"\"Natural Language Processing with Python\" (read my review) has lots of motivating examples for natural language processing. I quickly found it valuable to build indices ahead of time - I have a corpus of legal texts, and build a set of n-gram indices from it. The first index is a list of just tokenized text,","canonical_url":"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/","robots":"max-image-preview:large","keywords":"","webmasterTools":{"miscellaneous":""},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"BlogPosting","@id":"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/#blogposting","name":"Creating N-Gram Indexes with Python - Gary Sieling","headline":"Creating N-Gram Indexes with Python","author":{"@id":"https:\/\/www.garysieling.com\/blog\/author\/gary\/#author"},"publisher":{"@id":"https:\/\/www.garysieling.com\/blog\/#organization"},"datePublished":"2013-07-09T11:39:42+00:00","dateModified":"2013-07-09T11:39:42+00:00","inLanguage":"en-US","commentCount":6,"mainEntityOfPage":{"@id":"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/#webpage"},"isPartOf":{"@id":"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/#webpage"},"articleSection":"Data Mining, nlp, nltk, python"},{"@type":"BreadcrumbList","@id":"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/#breadcrumblist","itemListElement":[{"@type":"ListItem","@id":"https:\/\/www.garysieling.com\/blog#listItem","position":1,"name":"Home","item":"https:\/\/www.garysieling.com\/blog","nextItem":{"@type":"ListItem","@id":"https:\/\/www.garysieling.com\/blog\/category\/data-mining\/#listItem","name":"Data Mining"}},{"@type":"ListItem","@id":"https:\/\/www.garysieling.com\/blog\/category\/data-mining\/#listItem","position":2,"name":"Data Mining","item":"https:\/\/www.garysieling.com\/blog\/category\/data-mining\/","nextItem":{"@type":"ListItem","@id":"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/#listItem","name":"Creating N-Gram Indexes with Python"},"previousItem":{"@type":"ListItem","@id":"https:\/\/www.garysieling.com\/blog#listItem","name":"Home"}},{"@type":"ListItem","@id":"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/#listItem","position":3,"name":"Creating N-Gram Indexes with Python","previousItem":{"@type":"ListItem","@id":"https:\/\/www.garysieling.com\/blog\/category\/data-mining\/#listItem","name":"Data Mining"}}]},{"@type":"Organization","@id":"https:\/\/www.garysieling.com\/blog\/#organization","name":"Gary Sieling","description":"Software Engineer","url":"https:\/\/www.garysieling.com\/blog\/"},{"@type":"Person","@id":"https:\/\/www.garysieling.com\/blog\/author\/gary\/#author","url":"https:\/\/www.garysieling.com\/blog\/author\/gary\/","name":"gary","image":{"@type":"ImageObject","@id":"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/#authorImage","url":"https:\/\/secure.gravatar.com\/avatar\/0be925276d848ffe98a6a9dc8cf33e67?s=96&d=identicon&r=g","width":96,"height":96,"caption":"gary"}},{"@type":"WebPage","@id":"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/#webpage","url":"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/","name":"Creating N-Gram Indexes with Python - Gary Sieling","description":"\"Natural Language Processing with Python\" (read my review) has lots of motivating examples for natural language processing. I quickly found it valuable to build indices ahead of time - I have a corpus of legal texts, and build a set of n-gram indices from it. The first index is a list of just tokenized text,","inLanguage":"en-US","isPartOf":{"@id":"https:\/\/www.garysieling.com\/blog\/#website"},"breadcrumb":{"@id":"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/#breadcrumblist"},"author":{"@id":"https:\/\/www.garysieling.com\/blog\/author\/gary\/#author"},"creator":{"@id":"https:\/\/www.garysieling.com\/blog\/author\/gary\/#author"},"datePublished":"2013-07-09T11:39:42+00:00","dateModified":"2013-07-09T11:39:42+00:00"},{"@type":"WebSite","@id":"https:\/\/www.garysieling.com\/blog\/#website","url":"https:\/\/www.garysieling.com\/blog\/","name":"Gary Sieling","description":"Software Engineer","inLanguage":"en-US","publisher":{"@id":"https:\/\/www.garysieling.com\/blog\/#organization"}}]},"og:locale":"en_US","og:site_name":"Gary Sieling - Software Engineer","og:type":"article","og:title":"Creating N-Gram Indexes with Python - Gary Sieling","og:description":"&quot;Natural Language Processing with Python&quot; (read my review) has lots of motivating examples for natural language processing. I quickly found it valuable to build indices ahead of time - I have a corpus of legal texts, and build a set of n-gram indices from it. The first index is a list of just tokenized text,","og:url":"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/","article:published_time":"2013-07-09T11:39:42+00:00","article:modified_time":"2013-07-09T11:39:42+00:00","twitter:card":"summary_large_image","twitter:title":"Creating N-Gram Indexes with Python - Gary Sieling","twitter:description":"&quot;Natural Language Processing with Python&quot; (read my review) has lots of motivating examples for natural language processing. I quickly found it valuable to build indices ahead of time - I have a corpus of legal texts, and build a set of n-gram indices from it. The first index is a list of just tokenized text,"},"aioseo_meta_data":{"post_id":"1361","title":null,"description":null,"keywords":null,"keyphrases":null,"primary_term":null,"canonical_url":null,"og_title":null,"og_description":null,"og_object_type":"default","og_image_type":"default","og_image_url":null,"og_image_width":null,"og_image_height":null,"og_image_custom_url":null,"og_image_custom_fields":null,"og_video":null,"og_custom_url":null,"og_article_section":null,"og_article_tags":null,"twitter_use_og":false,"twitter_card":"default","twitter_image_type":"default","twitter_image_url":null,"twitter_image_custom_url":null,"twitter_image_custom_fields":null,"twitter_title":null,"twitter_description":null,"schema":{"blockGraphs":[],"customGraphs":[],"default":{"data":{"Article":[],"Course":[],"Dataset":[],"FAQPage":[],"Movie":[],"Person":[],"Product":[],"ProductReview":[],"Car":[],"Recipe":[],"Service":[],"SoftwareApplication":[],"WebPage":[]},"graphName":"","isEnabled":true},"graphs":[]},"schema_type":"default","schema_type_options":null,"pillar_content":false,"robots_default":true,"robots_noindex":false,"robots_noarchive":false,"robots_nosnippet":false,"robots_nofollow":false,"robots_noimageindex":false,"robots_noodp":false,"robots_notranslate":false,"robots_max_snippet":null,"robots_max_videopreview":null,"robots_max_imagepreview":"large","priority":null,"frequency":null,"local_seo":null,"limit_modified_date":false,"created":"2023-02-04 16:18:28","updated":"2026-07-06 00:58:49","ai":null,"breadcrumb_settings":null,"seo_analyzer_scan_date":null},"aioseo_breadcrumb":"<div class=\"aioseo-breadcrumbs\"><span class=\"aioseo-breadcrumb\">\n\t\t\t<a href=\"https:\/\/www.garysieling.com\/blog\" title=\"Home\">Home<\/a>\n\t\t<\/span><span class=\"aioseo-breadcrumb-separator\">&raquo;<\/span><span class=\"aioseo-breadcrumb\">\n\t\t\t<a href=\"https:\/\/www.garysieling.com\/blog\/category\/data-mining\/\" title=\"Data Mining\">Data Mining<\/a>\n\t\t<\/span><span class=\"aioseo-breadcrumb-separator\">&raquo;<\/span><span class=\"aioseo-breadcrumb\">\n\t\t\tCreating N-Gram Indexes with Python\n\t\t<\/span><\/div>","aioseo_breadcrumb_json":[{"label":"Home","link":"https:\/\/www.garysieling.com\/blog"},{"label":"Data Mining","link":"https:\/\/www.garysieling.com\/blog\/category\/data-mining\/"},{"label":"Creating N-Gram Indexes with Python","link":"https:\/\/www.garysieling.com\/blog\/creating-n-gram-indexes-with-python\/"}],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1361"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=1361"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1361\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=1361"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=1361"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=1361"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}