{"id":1435,"date":"2013-07-19T04:17:34","date_gmt":"2013-07-19T04:17:34","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=1435"},"modified":"2020-03-31T00:46:32","modified_gmt":"2020-03-31T00:46:32","slug":"exploring-zipfs-law-with-python-nltk-scipy-and-matplotlib","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/exploring-zipfs-law-with-python-nltk-scipy-and-matplotlib\/","title":{"rendered":"Exploring Zipf&#8217;s Law with Python, NLTK, SciPy, and Matplotlib"},"content":{"rendered":"<p><a href=\"http:\/\/en.wikipedia.org\/wiki\/Zipf's_law\">Zipf&#8217;s Law<\/a> states that the frequency of a word in a corpus of text is proportional to it&#8217;s rank &#8211; first noticed in the 1930&#8217;s. Unlike a &#8220;law&#8221; in the sense of mathematics or physics, this is purely on observation, without strong explanation that I can find of the causes.<\/p>\n<p>We can explore this concept fairly simply on a bit of text using <a href=\"http:\/\/nltk.org\/\">NLTK<\/a>, which provides handy APIs for accessing and processing text. There is a good textbook on the subject, &#8220;<a href=\"http:\/\/www.amazon.com\/gp\/product\/0596516495\/ref=as_li_ss_tl?ie=UTF8&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0596516495&#038;linkCode=as2&#038;tag=thesecrelifeo-20\">Natural Language Processing with Python<\/a>&#8221; (<a href=\"http:\/\/garysieling.com\/blog\/book-review-natural-language-processing-with-python\">read my review<\/a>), with lots of motivating examples like this.<\/p>\n<pre lang=\"python\">\nimport nltk\nfrom nltk.corpus import reuters\nfrom nltk.corpus import wordnet\n\nreuters_words = [w.lower() for w in reuters.words()]\nwords = set(reuters_words)\ncounts = [(w, reuters_words.count(w)) for w in words]\n<\/pre>\n<p>The first step of counting word frequencies takes quite a while (an hour or two). Here are the top tokens (obviously not all words)<\/p>\n<pre lang=\"python\">\n>>> [(w, c) for (w, c) in counts if c > 5000]\n[('.', 94687), ('s', 15680), ('with', 6179), (\"'\", 11272), \n('>', 7449), ('year', 7529), ('000', 10277), ('loss', 5124), \n('u', 6392), ('pct', 9810), ('\"', 6816), ('from', 8217), \n('for', 13782), ('2', 6528), ('at', 7017), ('be', 6357), \n('the', 69277), (';', 8762), ('he', 5215), ('net', 6989), \n('is', 7668), ('it', 11104), ('in', 29253), ('billion', 5829), \n('lt', 8696), ('-', 13705), ('of', 36779), ('&', 8698), \n('to', 36400), ('vs', 14341), ('was', 5816), ('1', 9977),\n ('and', 25648), ('dlrs', 12417), ('by', 7101), ('its', 7402), \n('mln', 18623), ('cts', 8361), ('on', 9244), ('that', 7540), \n('3', 5091), ('a', 25103), (',', 72360), ('said', 25383), \n('will', 5952)]\n<\/pre>\n<p>Next, I joined this to Wordnet data &#8211; while exploring Zipf&#8217;s law, I want to test the hypthesis that common words are more likely to be irregular. We also generate the rankings of each word in the set, but frequencies, and by number of &#8220;synsets&#8221; in Wordnet (synsets are sets meanings)<\/p>\n<pre lang=\"python\">\nimport scipy.stats as ss\n\namb = [(w, c, len(wordnet.synsets(w))) \/ \n    for (w, c) in counts if len(wordnet.synsets(w)) > 0]\n\namb_p_rank = ss.rankdata([p for (w, c, p) in amb])\namb_c_rank = ss.rankdata([c for (w, c, p) in amb])\n\namb_ranked = zip(amb, amb_p_rank, amb_c_rank)\n\namb_ranked[100:110]\nOut[37]: \n[('regulator', 2, 3, 8945.0, 5500.0),\n ('friend', 2, 5, 12344.0, 5500.0),\n ('feeling', 19, 19, 16810.0, 12979.0),\n ('sustaining', 7, 7, 14215.0, 10282.5),\n ('spectrum', 8, 2, 6142.0, 10684.5),\n ('consenting', 1, 2, 6142.0, 2218.5),\n ('resignations', 3, 3, 8945.0, 7215.5),\n ('dozen', 11, 2, 6142.0, 11554.0),\n ('affairs', 75, 5, 12344.0, 15397.5),\n ('mostly', 57, 2, 6142.0, 15038.0)]\n\n<\/pre>\n<p>The last line combines the data values together into one big list of tuples, for convenience. To do a quick test of Zipf&#8217;s law, we can test that the rank correlates to the log of the count:<\/p>\n<pre lang=\"python\">\nnumpy.corrcoef(amb_c_rank, [math.log(c) for (w, c, p) in amb])\nOut[106]: \narray([[ 1.        ,  0.95322962],\n       [ 0.95322962,  1.        ]])\n<\/pre>\n<p>We can also demonstrate this a different way. First, we sort the records by occurence:<\/p>\n<pre lang=\"python\"> \namb_ranked_sorted = sorted(amb_ranked, key=lambda (w, c, p, cr, pr): c)\n<\/pre>\n<p>And then we take some samples. Note how when we move by approximately 3x rank, the frequency goes down by about 1\/3 (this will vary based on how the rank is counted, especially at the long tail where there are many matching counts in a row)<\/p>\n<pre lang=\"python\">\namb_ranked_sorted[-50]\nOut[121]: ('profit', 2960, 4, 10907.0, 17128.0)\n\namb_ranked_sorted[-150]\nOut[122]: ('major', 1000, 13, 16262.0, 17028.0)\n\namb_ranked_sorted[-450]\nOut[123]: ('goods', 408, 4, 10907.0, 16728.0)\n\namb_ranked_sorted[-1350]\nOut[124]: ('hope', 113, 9, 15215.0, 15827.5)\n<\/pre>\n<p>Finally, we can plot this:<\/p>\n<pre lang=\"python\">\nimport matplotlib\nrev = [l-r+1 for r in amb_c_rank]\n\nplt.plot([math.log(c) for c in rev], [math.log(c) for (w, c, p) in amb], 'ro')\n<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/172.104.26.128\/wp-content\/uploads\/2013\/07\/zipfs-law-1.png\" alt=\"zipfs-law-1\" width=\"567\" height=\"444\" class=\"aligncenter size-full wp-image-1436\" srcset=\"https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2013\/07\/zipfs-law-1.png 567w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2013\/07\/zipfs-law-1-300x235.png 300w\" sizes=\"(max-width: 567px) 100vw, 567px\" \/><\/p>\n<p>This isn&#8217;t exactly straight, especially at the end, likely due to how ranks are computed, but it&#8217;s close. Now we can plot the number of &#8220;meanings&#8221; in wordnet vs use in the reuters corpus:<\/p>\n<pre lang=\"python\">\nplt.plot([c for (w,c,p) in amb], [p for (w,c,p) in amb], 'bs')\nOut[150]: [<matplotlib.lines.Line2D at 0x147ae950>]\n\nplt.show()\n<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/garysieling.com\/blog\/wp-content\/uploads\/2013\/07\/cluster-578x455.png\" alt=\"cluster\" width=\"578\" height=\"455\" class=\"aligncenter size-large wp-image-1437\" \/><\/p>\n<p>The problem is that there is no notion of context, for instance the common word &#8220;at&#8221; is only referenced by the chemical element &#8220;astatine&#8221;:<\/p>\n<pre lang=\"python\">\nwordnet.synsets('at')\nOut[149]: [Synset('astatine.n.01'), Synset('at.n.02')]\n<\/pre>\n<p>Thus, a better technique is to look at verbs only.<\/p>\n<pre lang=\"python\">\ndef wordnet_verbs(w):\n  synsets = wordnet.synsets(w)\n  verbs = [w for w in synsets if w.pos == 'v']\n  return verbs\n\namb_v = [(w, c, len(wordnet_verbs(w)), len(wordnet.synsets(w))) \\ \n  for (w, c) in counts if len(wordnet_verbs(w)) > 0]\n\namb_v[100:110]\nOut[167]: \n[('shoots', 1, 20, 22),\n ('suffice', 1, 1, 1),\n ('acquainting', 1, 3, 3),\n ('perfumes', 1, 2, 4),\n ('safeguard', 14, 2, 4),\n ('arrays', 4, 2, 6),\n ('crowns', 143, 4, 16),\n ('roll', 21, 18, 33),\n ('intend', 63, 4, 4),\n ('palms', 2, 1, 5)]\n\nplt.plot([c for (w,c,vc,wc) in amb_v], [vc for (w,c,vc,wc) in amb_v], 'bs',\\ \n         [c for (w,c,vc, wc) in amb_v], [wc for (w,c,vc,wc) in amb_v], 'ro')\n<\/pre>\n<p>And, we can plot these (y axis is # of alternate meanings, x axis is frequency, blue is counts of verb meanings, and red is counts of word meanings).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/172.104.26.128\/wp-content\/uploads\/2013\/07\/cluster2.png\" alt=\"cluster2\" width=\"577\" height=\"447\" class=\"aligncenter size-full wp-image-1438\" srcset=\"https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2013\/07\/cluster2.png 577w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2013\/07\/cluster2-300x232.png 300w\" sizes=\"(max-width: 577px) 100vw, 577px\" \/><\/p>\n<p>There does seem to be a pattern here, but not what I expect &#8211; I would expect some straight-line upwards, with alternate meanings increasing with usage, which is actually the opposite. This likely means that the wordnet thesaurus counts are a poor indicator of irregularity. Rather, words which have many meanings are discouraged in news writing, as they encourage ambiguity.<\/p>\n<p>For a final note, lets look at what are these common words are:<\/p>\n<pre lang=\"python\">\n[(w,c,vc,wc) for (w,c,vc,wc) in amb_v if vc > 40]\nOut[179]: \n[('cut', 905, 41, 70),\n ('makes', 169, 49, 51),\n ('runs', 38, 41, 57),\n ('making', 369, 49, 52),\n ('gave', 208, 44, 44),\n ('breaks', 14, 59, 75),\n ('take', 745, 42, 44),\n ('broke', 40, 59, 60),\n ('run', 125, 41, 57),\n ('give', 303, 44, 45),\n ('cuts', 319, 41, 61),\n ('broken', 36, 59, 72),\n ('giving', 99, 44, 48),\n ('took', 140, 42, 42),\n ('breaking', 19, 59, 60),\n ('takes', 91, 42, 44),\n ('taken', 244, 42, 44),\n ('make', 592, 49, 51),\n ('taking', 197, 42, 44),\n ('given', 370, 44, 47),\n ('gives', 47, 44, 45),\n ('made', 987, 49, 52),\n ('break', 76, 59, 75),\n ('giveing', 1, 44, 44),\n ('cutting', 122, 41, 54),\n ('running', 73, 41, 52),\n ('ran', 29, 41, 41)]\n<\/pre>\n<p>&#8220;Cut&#8221; has a staggering 70 entrieds in wordnet &#8211; many, but not all verbs. &#8220;Has&#8221;, by contrast, has merely twenty. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Zipf&#8217;s Law states that the frequency of a word in a corpus of text is proportional to it&#8217;s rank &#8211; first noticed in the 1930&#8217;s. Unlike a &#8220;law&#8221; in the sense of mathematics or physics, this is purely on observation, without strong explanation that I can find of the causes. We can explore this concept &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/exploring-zipfs-law-with-python-nltk-scipy-and-matplotlib\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Exploring Zipf&#8217;s Law with Python, NLTK, SciPy, and Matplotlib&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[4,5,6],"tags":[385,447],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1435"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=1435"}],"version-history":[{"count":1,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1435\/revisions"}],"predecessor-version":[{"id":6507,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1435\/revisions\/6507"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=1435"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=1435"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=1435"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}