Creating N-Gram Indexes with Python

Natural Language Processing with Python” (read my review) has lots of motivating examples for natural language processing. I quickly found it valuable to build indices ahead of time – I have a corpus of legal texts, and build a set of n-gram indices from it.

The first index is a list of just tokenized text, with all text contents combined. You may wish to add a special token that indicates where files end.

import nltk;
import os;
import re;
 
all_tokens = []
idx = 0
for root, directories, filenames in os.walk('.'):
  for file in filenames:
    if file.endswith('.txt'):
      idx = idx + 1
      contents = open(file)
      raw = contents.read()
      lc_raw = raw.lower()
      new_tokens = nltk.word_tokenize(lc_raw)
      new_tokens_filtered = [w for w in new_tokens 
                             if len(w) < 20 and 
                             (re.search('^[a-zA-Z]+$', w) or len(w) == 1)]
      all_tokens = all_tokens + new_tokens_filtered
 
for token in all_tokens:
    print token

This filters the tokens – depending on the situation you may wish to modify this (e.g. whether you want punctuation or not).

The following turns the token index into an n-gram index. This runs pretty quickly while there is memory – this could easily be extended to work on parts of a document set, and combine with the results with a map-reduce operation by addition, especially if the output was sorted. One surprising result is that the 5-gram index can be quite a bit larger than the original data.

import nltk;
import os;
 
contents = open('all_tokens5', 'rU')
 
n1_gram_word = ' '
n2_gram_word = ' ' 
n3_gram_word = ' '
n4_gram_word = ' '
n5_gram_word = ' '
 
n1_counts = {}
n2_counts = {}
n3_counts = {}
n4_counts = {}
n5_counts = {}
 
index = 0
 
def incr(c, w):
  try:
    c[w] = c[w] + 1
  except:
    c[w] = 1
 
for word in contents:
  index = index + 1
 
  if (index % 10000 == 0):
    print "Processed %-d words" % (index)
 
  # defects: loses last character, loses EOF, adds bigrams/trigrams at starts
  (n5_gram_word, n4_gram_word, n3_gram_word, n2_gram_word, n1_gram_word) = \
    (n4_gram_word, n3_gram_word, n2_gram_word, n1_gram_word, word[:-1])   
 
  n1_gram = n1_gram_word
  n2_gram = n2_gram_word + ' ' + n1_gram
  n3_gram = n3_gram_word + ' ' + n2_gram
  n4_gram = n4_gram_word + ' ' + n3_gram
  n5_gram = n5_gram_word + ' ' + n4_gram
 
  incr(n1_counts, n1_gram)
  incr(n2_counts, n2_gram)
  incr(n3_counts, n3_gram)
  incr(n4_counts, n4_gram)
  incr(n5_counts, n5_gram)
 
contents.close()
 
def save_ngram(c, f):
  output = open(f, 'w')
  ordered = sorted(c.items(), lambda a, b: cmp(c[a[0]], c[b[0]]))
  for a, b in ordered:
    output.write('%-s %-d\n' % (a, b))
 
  output.close()
 
save_ngram(n1_counts, '1gram')
save_ngram(n2_counts, '2gram')
save_ngram(n3_counts, '3gram')
save_ngram(n4_counts, '4gram')
save_ngram(n5_counts, '5gram')

Tags: , ,

6 comments ↓

#1 Matías on 07.10.13 at 7:27 pm

Isn’t there a ngram building tool in NLTK?

#2 Gary on 07.11.13 at 11:27 am

Yes, there is, and I used that version first. I wrote my own because I was running into memory issues and wanted to control how it worked, and it’s pretty simple to write anyway.

#3 Jeff on 07.11.13 at 1:01 pm

You are aware that the semicolons at the end of your import statements are ugly and useless?

#4 Gary on 07.12.13 at 2:45 am

Habit :)

#5 exhuma on 07.12.13 at 8:14 am

You might want to have a look at this:

http://docs.python.org/2/library/collections.html#collections.Counter

#6 Gary on 07.12.13 at 12:03 pm

Thanks!

Leave a Comment

Current day month ye@r *