{"id":5011,"date":"2016-08-29T00:39:25","date_gmt":"2016-08-29T00:39:25","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=5011"},"modified":"2016-08-29T00:39:25","modified_gmt":"2016-08-29T00:39:25","slug":"compute-tf-idf-in-python","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/compute-tf-idf-in-python\/","title":{"rendered":"Compute TF-IDF in Python with Google N-grams dataset"},"content":{"rendered":"<p>If you <a href=\"https:\/\/www.garysieling.com\/blog\/import-google-ngrams-data-postgres\">import Google N-Grams data into Postgres<\/a>, you can use this to compute TF-IDF measures on documents. <\/p>\n<p>In my environment, I have talk transcripts stored in JSON files. In this example, I&#8217;ll show how to measure the distance between these and a word list (e.g. &#8220;I, me, my, myself, mine&#8221; etc).<\/p>\n<pre lang=\"python\">\nimport json\n\ndef get_transcript(theFile):\n  try:\n    with open(path + theFile, encoding=\"utf8\") as json_data:\n      d = json.load(json_data)\n      json_data.close()\n      return d[\"transcript_s\"]\n  except:\n    print(\"Found error\")\n  \n  return null\n<\/pre>\n<p>Once we have a transcript we need to tokenize the text into words. The best way to do this is to use NLTK, since it has a lot of choices for how to go about doing this.<\/p>\n<pre lang=\"python\">\nfrom nltk.tokenize import RegexpTokenizer\nfrom collections import defaultdict\n\ndef get_tokens(text):\n  tokenizer = RegexpTokenizer('\\w+|\\$[\\d\\.]+|\\S+')\n  return [t for t in tokenizer.tokenize(text)]\n\ndef get_counts(tokens):\n  counts = defaultdict(int)\n\n  for curr in tokens:\n    counts[curr] += 1\n\n  return counts\n<\/pre>\n<p>Before we comput TF-IDF, we need to know how often each word occurs in the N-Grams dataset. The important thing with this is to memoize the results.<\/p>\n<pre lang=\"python\">\nimport psycopg2\n\nseen_tokens = {}\n\ndef get_docs_with_token(token):\n  if token in seen_tokens:\n    return seen_tokens[token]\n\n  conn = psycopg2.connect( \\\n    \"dbname='postgres' \" + \\\n    \"user='postgres' \" + \\\n    \"host='localhost' \" \\ \n    \"password='postgres'\")\n  cur = conn.cursor()\n\n  table = token[0].lower()\n  cur.execute(\\\n    \"select volume_count from ngrams_\" + \\\n    table + \" where year = 2008 and ngram = '\" + \\\n    token + \"'\")\n\n  rows = cur.fetchall()\n  result = 0\n  for row in rows:\n    result = row[0]\n\n  seen_tokens[token] = result;\n\n  return result\n<\/pre>\n<p>Once we have this, we can define the tf-idf function for one term in our search. Strangely, the &#8220;log&#8221; function in python is a natural log (there is no &#8220;ln&#8221; like you might expect). THere are some options here &#8211; you may wish to dampen the values (<a href=\"http:\/\/amzn.to\/2bK8eK5\">&#8220;Relevant Search&#8221;<\/a> says that Lucene takes the square root of values)<\/p>\n<p>Note also that we&#8217;re using &#8220;volumes&#8221; reported by Google n-grams as the number of documents in the &#8220;full&#8221; set. I&#8217;ve hard-coded the max # of documents in that set, since there is no point querying for this, but if you wanted to re-execute this computation for every year in the dataset, it would need to be an array or a SQL query.<\/p>\n<pre lang=\"python\">\ndef tfidf_token(search_token, all_tokens, all_token_counts):\n  total_terms = len(all_tokens)\n  term_count = all_token_counts[search_token]\n\n  total_docs = 206272\n\n  tf = 1.0 * term_count \/ total_terms\n   \n  docs_with_term = get_docs_with_term(search_token)\n\n  idf = math.log(1.0 * total_docs \/ docs_with_term)\n  \n  tfidf = tf * idf\n\n  return tf * idf\n<\/pre>\n<p>Once we have this it&#8217;s a trivial exercise to get the score for each search term, and sum them up:<\/p>\n<pre lang=\"python\">\ndef tfidf_search(search, file):\n  transcript = get_transcript(file)\n  all_tokens = get_tokens(transcript)\n  all_token_counts = get_counts(all_tokens)\n\n  vals = [tfidf_token(token, all_tokens, all_token_counts) for token in search]\n\n  print(vals)\n\n  score = sum(vals)\n\n  print(score)\n\n  return score\n<\/pre>\n<p>Once we&#8217;ve done this, all sorts of interesting possibilities are now available.<\/p>\n<pre lang=\"python\">\n\npersonal = [\"I\", \"i\", \"Me\", \"me\", \"My\", \"my\", \"myself\", \"Myself\"]\nfor file in files:\n  tfidf_search(personal, file)\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Compute TF-IDF scores in Python using the Google N-grams dataset<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[12],"tags":[384,385,447],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/5011"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=5011"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/5011\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=5011"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=5011"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=5011"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}