Compute TF-IDF in Python with Google N-grams dataset

If you import Google N-Grams data into Postgres, you can use this to compute TF-IDF measures on documents.

In my environment, I have talk transcripts stored in JSON files. In this example, I’ll show how to measure the distance between these and a word list (e.g. “I, me, my, myself, mine” etc).

import json
 
def get_transcript(theFile):
  try:
    with open(path + theFile, encoding="utf8") as json_data:
      d = json.load(json_data)
      json_data.close()
      return d["transcript_s"]
  except:
    print("Found error")
 
  return null

Once we have a transcript we need to tokenize the text into words. The best way to do this is to use NLTK, since it has a lot of choices for how to go about doing this.

from nltk.tokenize import RegexpTokenizer
from collections import defaultdict
 
def get_tokens(text):
  tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
  return [t for t in tokenizer.tokenize(text)]
 
def get_counts(tokens):
  counts = defaultdict(int)
 
  for curr in tokens:
    counts[curr] += 1
 
  return counts

Before we comput TF-IDF, we need to know how often each word occurs in the N-Grams dataset. The important thing with this is to memoize the results.

import psycopg2
 
seen_tokens = {}
 
def get_docs_with_token(token):
  if token in seen_tokens:
    return seen_tokens[token]
 
  conn = psycopg2.connect( \
    "dbname='postgres' " + \
    "user='postgres' " + \
    "host='localhost' " \ 
    "password='postgres'")
  cur = conn.cursor()
 
  table = token[0].lower()
  cur.execute(\
    "select volume_count from ngrams_" + \
    table + " where year = 2008 and ngram = '" + \
    token + "'")
 
  rows = cur.fetchall()
  result = 0
  for row in rows:
    result = row[0]
 
  seen_tokens[token] = result;
 
  return result

Once we have this, we can define the tf-idf function for one term in our search. Strangely, the “log” function in python is a natural log (there is no “ln” like you might expect). THere are some options here – you may wish to dampen the values (“Relevant Search” says that Lucene takes the square root of values)

Note also that we’re using “volumes” reported by Google n-grams as the number of documents in the “full” set. I’ve hard-coded the max # of documents in that set, since there is no point querying for this, but if you wanted to re-execute this computation for every year in the dataset, it would need to be an array or a SQL query.

def tfidf_token(search_token, all_tokens, all_token_counts):
  total_terms = len(all_tokens)
  term_count = all_token_counts[search_token]
 
  total_docs = 206272
 
  tf = 1.0 * term_count / total_terms
 
  docs_with_term = get_docs_with_term(search_token)
 
  idf = math.log(1.0 * total_docs / docs_with_term)
 
  tfidf = tf * idf
 
  return tf * idf

Once we have this it’s a trivial exercise to get the score for each search term, and sum them up:

def tfidf_search(search, file):
  transcript = get_transcript(file)
  all_tokens = get_tokens(transcript)
  all_token_counts = get_counts(all_tokens)
 
  vals = [tfidf_token(token, all_tokens, all_token_counts) for token in search]
 
  print(vals)
 
  score = sum(vals)
 
  print(score)
 
  return score

Once we’ve done this, all sorts of interesting possibilities are now available.

 
personal = ["I", "i", "Me", "me", "My", "my", "myself", "Myself"]
for file in files:
  tfidf_search(personal, file)
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *