Compute TF-IDF in Python with Google N-grams dataset

If you import Google N-Grams data into Postgres, you can use this to compute TF-IDF measures on documents.

In my environment, I have talk transcripts stored in JSON files. In this example, I’ll show how to measure the distance between these and a word list (e.g. “I, me, my, myself, mine” etc).

import json

def get_transcript(theFile):
  try:
    with open(path + theFile, encoding="utf8") as json_data:
      d = json.load(json_data)
      json_data.close()
      return d["transcript_s"]
  except:
    print("Found error")
  
  return null

Once we have a transcript we need to tokenize the text into words. The best way to do this is to use NLTK, since it has a lot of choices for how to go about doing this.

from nltk.tokenize import RegexpTokenizer
from collections import defaultdict

def get_tokens(text):
  tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
  return [t for t in tokenizer.tokenize(text)]

def get_counts(tokens):
  counts = defaultdict(int)

  for curr in tokens:
    counts[curr] += 1

  return counts

Before we comput TF-IDF, we need to know how often each word occurs in the N-Grams dataset. The important thing with this is to memoize the results.

import psycopg2

seen_tokens = {}

def get_docs_with_token(token):
  if token in seen_tokens:
    return seen_tokens[token]

  conn = psycopg2.connect( \
    "dbname='postgres' " + \
    "user='postgres' " + \
    "host='localhost' " \ 
    "password='postgres'")
  cur = conn.cursor()

  table = token[0].lower()
  cur.execute(\
    "select volume_count from ngrams_" + \
    table + " where year = 2008 and ngram = '" + \
    token + "'")

  rows = cur.fetchall()
  result = 0
  for row in rows:
    result = row[0]

  seen_tokens[token] = result;

  return result

Once we have this, we can define the tf-idf function for one term in our search. Strangely, the “log” function in python is a natural log (there is no “ln” like you might expect). THere are some options here – you may wish to dampen the values (“Relevant Search” says that Lucene takes the square root of values)

Note also that we’re using “volumes” reported by Google n-grams as the number of documents in the “full” set. I’ve hard-coded the max # of documents in that set, since there is no point querying for this, but if you wanted to re-execute this computation for every year in the dataset, it would need to be an array or a SQL query.

def tfidf_token(search_token, all_tokens, all_token_counts):
  total_terms = len(all_tokens)
  term_count = all_token_counts[search_token]

  total_docs = 206272

  tf = 1.0 * term_count / total_terms
   
  docs_with_term = get_docs_with_term(search_token)

  idf = math.log(1.0 * total_docs / docs_with_term)
  
  tfidf = tf * idf

  return tf * idf

Once we have this it’s a trivial exercise to get the score for each search term, and sum them up:

def tfidf_search(search, file):
  transcript = get_transcript(file)
  all_tokens = get_tokens(transcript)
  all_token_counts = get_counts(all_tokens)

  vals = [tfidf_token(token, all_tokens, all_token_counts) for token in search]

  print(vals)

  score = sum(vals)

  print(score)

  return score

Once we’ve done this, all sorts of interesting possibilities are now available.


personal = ["I", "i", "Me", "me", "My", "my", "myself", "Myself"]
for file in files:
  tfidf_search(personal, file)