{"id":1244,"date":"2013-07-08T12:04:50","date_gmt":"2013-07-08T12:04:50","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=1244"},"modified":"2013-07-08T12:04:50","modified_gmt":"2013-07-08T12:04:50","slug":"nlp-analysis-using-modal-verbs","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/","title":{"rendered":"NLP Analysis in Python using Modal Verbs"},"content":{"rendered":"<p>Modal verbs are auxiliary verbs which indicate semantic information about an action, i.e. likelihood (will, should) , permission (could, may), obligation (shall\/must). One interesting concept to explore is whether the presence of these verbs varies over different types of text, and whether that means anything.<\/p>\n<p>&#8220;<a href=\"http:\/\/www.amazon.com\/gp\/product\/0596516495\/ref=as_li_ss_tl?ie=UTF8&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0596516495&#038;linkCode=as2&#038;tag=thesecrelifeo-20\">Natural Language Processing with Python<\/a>&#8221; (<a href=\"http:\/\/garysieling.com\/blog\/book-review-natural-language-processing-with-python\">read my review<\/a>) has an example of how to start this process, comparing verb frequencies across various genres of text using the Brown corpus, a well-known collection of texts assembled in the 60&#8217;s for language research.<\/p>\n<p>I extended the example to include an additional corpus of court cases, and extra helper verbs- This includes the contents of ~15,000 legal documents.<\/p>\n<p>We first define a function to retrieve genres of literature, and a second to retrieve words from the genre. For the legal documents, I am reading from an index <a href=\"http:\/\/garysieling.com\/blog\/creating-n-gram-indexes-with-python\">I previously built of n-grams<\/a> (i.e. word\/phrase counts).<\/p>\n<pre lang=\"python\">\nimport nltk\nimport os\n\ndef get_genres():\nyield 'legal'\nfor genre in brown.categories():\nyield genre\n\nmodals = ['can', 'could', 'may', 'might', 'must', 'will', 'would', 'should']\n\ndef get_words(genre):\n  if (genre == 'legal'):\n    grams = open('1gram', 'rU')\n    for line in grams:\n      vals = line.split(' ')\n      word = vals[0]\n      count = int(vals[1])\n      if (word in modals):\n        for index in range(0, count):\n          yield word\n        else:\n          yield word\n    grams.close()\n  else:\n    for word in brown.words(categories=genre):\n      yield word\n<\/pre>\n<p>The Natural Language Toolkit provides a class for tracking frequencies of &#8220;experiment&#8221; results &#8211; here we track the use of different verb tenses.<\/p>\n<pre lang=\"python\">\ncfd = nltk.ConditionalFreqDist(\n  (genre, word)\n  for genre in get_genres()\n  for word in get_words(genre)\n)\n\ngenres = [g for g in get_genres()]\ncfd.tabulate(conditions=genres, samples=modals)\n\ncfd.tabulate(conditions=genres, samples=modals)\n<\/pre>\n<p>The tabulate method is provided by NTLK, and makes a nicely formatted chart (in a  command line it makes everything line up neatly)<\/p>\n<table>\n<th>\n<td>can<\/td>\n<td>could<\/td>\n<td>may<\/td>\n<td>might<\/td>\n<td>must<\/td>\n<td>will<\/td>\n<td>would<\/td>\n<td>should<\/td>\n<\/th>\n<tr>\n<td>legal<\/td>\n<td>13059<\/td>\n<td>7849<\/td>\n<td>26968<\/td>\n<td>1762<\/td>\n<td>15974<\/td>\n<td>20757<\/td>\n<td>19931<\/td>\n<td>13916<\/td>\n<\/tr>\n<tr>\n<td>adventure<\/td>\n<td>46<\/td>\n<td>151<\/td>\n<td>5<\/td>\n<td>58<\/td>\n<td>27<\/td>\n<td>50<\/td>\n<td>191<\/td>\n<td>15<\/td>\n<\/tr>\n<tr>\n<td>belles_lettres<\/td>\n<td>246<\/td>\n<td>213<\/td>\n<td>207<\/td>\n<td>113<\/td>\n<td>170<\/td>\n<td>236<\/td>\n<td>392<\/td>\n<td>102<\/td>\n<\/tr>\n<tr>\n<td>editorial<\/td>\n<td>121<\/td>\n<td>56<\/td>\n<td>74<\/td>\n<td>39<\/td>\n<td>53<\/td>\n<td>233<\/td>\n<td>180<\/td>\n<td>88<\/td>\n<\/tr>\n<tr>\n<td>fiction<\/td>\n<td>37<\/td>\n<td>166<\/td>\n<td>8<\/td>\n<td>44<\/td>\n<td>55<\/td>\n<td>52<\/td>\n<td>287<\/td>\n<td>35<\/td>\n<\/tr>\n<tr>\n<td>government<\/td>\n<td>117<\/td>\n<td>38<\/td>\n<td>153<\/td>\n<td>13<\/td>\n<td>102<\/td>\n<td>244<\/td>\n<td>120<\/td>\n<td>112<\/td>\n<\/tr>\n<tr>\n<td>hobbies<\/td>\n<td>268<\/td>\n<td>58<\/td>\n<td>131<\/td>\n<td>22<\/td>\n<td>83<\/td>\n<td>264<\/td>\n<td>78<\/td>\n<td>73<\/td>\n<\/tr>\n<tr>\n<td>humor<\/td>\n<td>16<\/td>\n<td>30<\/td>\n<td>8<\/td>\n<td>8<\/td>\n<td>9<\/td>\n<td>13<\/td>\n<td>56<\/td>\n<td>7<\/td>\n<\/tr>\n<tr>\n<td>learned<\/td>\n<td>365<\/td>\n<td>159<\/td>\n<td>324<\/td>\n<td>128<\/td>\n<td>202<\/td>\n<td>340<\/td>\n<td>319<\/td>\n<td>171<\/td>\n<\/tr>\n<tr>\n<td>lore<\/td>\n<td>170<\/td>\n<td>141<\/td>\n<td>165<\/td>\n<td>49<\/td>\n<td>96<\/td>\n<td>175<\/td>\n<td>186<\/td>\n<td>76<\/td>\n<\/tr>\n<tr>\n<td>mystery<\/td>\n<td>42<\/td>\n<td>141<\/td>\n<td>13<\/td>\n<td>57<\/td>\n<td>30<\/td>\n<td>20<\/td>\n<td>186<\/td>\n<td>29<\/td>\n<\/tr>\n<tr>\n<td>news<\/td>\n<td>93<\/td>\n<td>86<\/td>\n<td>66<\/td>\n<td>38<\/td>\n<td>50<\/td>\n<td>389<\/td>\n<td>244<\/td>\n<td>59<\/td>\n<\/tr>\n<tr>\n<td>religion<\/td>\n<td>82<\/td>\n<td>59<\/td>\n<td>78<\/td>\n<td>12<\/td>\n<td>54<\/td>\n<td>71<\/td>\n<td>68<\/td>\n<td>45<\/td>\n<\/tr>\n<tr>\n<td>reviews<\/td>\n<td>45<\/td>\n<td>40<\/td>\n<td>45<\/td>\n<td>26<\/td>\n<td>19<\/td>\n<td>58<\/td>\n<td>47<\/td>\n<td>18<\/td>\n<\/tr>\n<tr>\n<td>romance<\/td>\n<td>74<\/td>\n<td>193<\/td>\n<td>11<\/td>\n<td>51<\/td>\n<td>45<\/td>\n<td>43<\/td>\n<td>244<\/td>\n<td>32<\/td>\n<\/tr>\n<tr>\n<td>science_fiction<\/td>\n<td>16<\/td>\n<td>49<\/td>\n<td>4<\/td>\n<td>12<\/td>\n<td>8<\/td>\n<td>16<\/td>\n<td>79<\/td>\n<td>3<\/td>\n<\/tr>\n<\/table>\n<p>Looking at these numbers, it is clear that we need to add a concept of normalization. My added corpus has a lot more tokens than the Brown corpus, which makes it hard to compare across. <\/p>\n<p>The frequency distribution class exists to count things, and I didn&#8217;t see a good way to normalize the rows. I re-wrote the tabulate function to do this &#8211; it simply finds the max for each row, divides by that, and multiplies by 100.<\/p>\n<pre lang=\"python\">\ndef tabulate(cfd, conditions, samples):\n  max_len = max(len(w) for w in conditions)\n  sys.stdout.write(\" \" * (max_len + 1))\n  for c in samples:\n    sys.stdout.write(\"%-s\\t\" % c)\n    sys.stdout.write(\"\\n\")\n  for c in conditions:\n    sys.stdout.write(\" \" * (max_len - len(c)))\n    sys.stdout.write(\"%-s\" % c)\n    sys.stdout.write(\" \")\n    dist = cfd[c]\n    norm = sum([dist[w] for w in modals])\n    for s in samples:\n      value = 100 * dist[s] \/ norm\n      sys.stdout.write(\"%-d\\t\" % value)\n      sys.stdout.write(\"\\n\")\n\ntabulate(cfd, genres, modals)\n<\/pre>\n<p>This makes it easier to scan up and down the chart-<\/p>\n<table>\n<th>\n<td>can<\/td>\n<td>could<\/td>\n<td>may<\/td>\n<td>might<\/td>\n<td>must<\/td>\n<td>will<\/td>\n<td>would<\/td>\n<td>should<\/td>\n<\/th>\n<tr>\n<td>legal<\/td>\n<td>10<\/td>\n<td>6<\/td>\n<td>22<\/td>\n<td>1<\/td>\n<td>13<\/td>\n<td>17<\/td>\n<td>16<\/td>\n<td>11<\/td>\n<\/tr>\n<tr>\n<td>adventure<\/td>\n<td>8<\/td>\n<td>27<\/td>\n<td>0<\/td>\n<td>10<\/td>\n<td>4<\/td>\n<td>9<\/td>\n<td>35<\/td>\n<td>2<\/td>\n<\/tr>\n<tr>\n<td>belles_lettres<\/td>\n<td>14<\/td>\n<td>12<\/td>\n<td>12<\/td>\n<td>6<\/td>\n<td>10<\/td>\n<td>14<\/td>\n<td>23<\/td>\n<td>6<\/td>\n<\/tr>\n<tr>\n<td>editorial<\/td>\n<td>14<\/td>\n<td>6<\/td>\n<td>8<\/td>\n<td>4<\/td>\n<td>6<\/td>\n<td>27<\/td>\n<td>21<\/td>\n<td>10<\/td>\n<\/tr>\n<tr>\n<td>fiction<\/td>\n<td>5<\/td>\n<td>24<\/td>\n<td>1<\/td>\n<td>6<\/td>\n<td>8<\/td>\n<td>7<\/td>\n<td>41<\/td>\n<td>5<\/td>\n<\/tr>\n<tr>\n<td>government<\/td>\n<td>13<\/td>\n<td>4<\/td>\n<td>17<\/td>\n<td>1<\/td>\n<td>11<\/td>\n<td>27<\/td>\n<td>13<\/td>\n<td>12<\/td>\n<\/tr>\n<tr>\n<td>hobbies<\/td>\n<td>27<\/td>\n<td>5<\/td>\n<td>13<\/td>\n<td>2<\/td>\n<td>8<\/td>\n<td>27<\/td>\n<td>7<\/td>\n<td>7<\/td>\n<\/tr>\n<tr>\n<td>humor<\/td>\n<td>10<\/td>\n<td>20<\/td>\n<td>5<\/td>\n<td>5<\/td>\n<td>6<\/td>\n<td>8<\/td>\n<td>38<\/td>\n<td>4<\/td>\n<\/tr>\n<tr>\n<td>learned<\/td>\n<td>18<\/td>\n<td>7<\/td>\n<td>16<\/td>\n<td>6<\/td>\n<td>10<\/td>\n<td>16<\/td>\n<td>15<\/td>\n<td>8<\/td>\n<\/tr>\n<tr>\n<td>lore<\/td>\n<td>16<\/td>\n<td>13<\/td>\n<td>15<\/td>\n<td>4<\/td>\n<td>9<\/td>\n<td>16<\/td>\n<td>17<\/td>\n<td>7<\/td>\n<\/tr>\n<tr>\n<td>mystery<\/td>\n<td>8<\/td>\n<td>27<\/td>\n<td>2<\/td>\n<td>11<\/td>\n<td>5<\/td>\n<td>3<\/td>\n<td>35<\/td>\n<td>5<\/td>\n<\/tr>\n<tr>\n<td>news<\/td>\n<td>9<\/td>\n<td>8<\/td>\n<td>6<\/td>\n<td>3<\/td>\n<td>4<\/td>\n<td>37<\/td>\n<td>23<\/td>\n<td>5<\/td>\n<\/tr>\n<tr>\n<td>religion<\/td>\n<td>17<\/td>\n<td>12<\/td>\n<td>16<\/td>\n<td>2<\/td>\n<td>11<\/td>\n<td>15<\/td>\n<td>14<\/td>\n<td>9<\/td>\n<\/tr>\n<tr>\n<td>reviews<\/td>\n<td>15<\/td>\n<td>13<\/td>\n<td>15<\/td>\n<td>8<\/td>\n<td>6<\/td>\n<td>19<\/td>\n<td>15<\/td>\n<td>6<\/td>\n<\/tr>\n<tr>\n<td>romance<\/td>\n<td>10<\/td>\n<td>27<\/td>\n<td>1<\/td>\n<td>7<\/td>\n<td>6<\/td>\n<td>6<\/td>\n<td>35<\/td>\n<td>4<\/td>\n<\/tr>\n<tr>\n<td>science_fiction<\/td>\n<td>8<\/td>\n<td>26<\/td>\n<td>2<\/td>\n<td>6<\/td>\n<td>4<\/td>\n<td>8<\/td>\n<td>42<\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<\/tr>\n<\/table>\n<p>One thing this makes clear is most genres have numerous references to &#8216;would&#8217; and few have &#8216;should&#8217;.<\/p>\n<p>It might be nice to see these on a scale of 1-10 &#8211; seeing the columns of numbers communicates something in the lengths.<\/p>\n<pre lang=\"python\">\ndef tabulate(cfd, conditions, samples):\n  max_len = max(len(w) for w in conditions)\n  sys.stdout.write(\" \" * (max_len + 1))\n  for c in samples:\n    sys.stdout.write(\"%-s\\t\" % c)\n    sys.stdout.write(\"\\n\")\n    for c in conditions:\n      sys.stdout.write(\" \" * (max_len - len(c)))\n      sys.stdout.write(\"%-s\" % c)\n      sys.stdout.write(\" \")\n      dist = cfd[c]\n      norm = sum([dist[w] for w in modals])\n  for s in samples:\n    value = 10 * float(dist[s]) \/ norm\n    sys.stdout.write(\"%.1f\\t\" % value)\n    sys.stdout.write(\"\\n\")\n\ntabulate(cfd, genres, modals)\n<\/pre>\n<table>\n<th>\n<td>can<\/td>\n<td>could<\/td>\n<td>may<\/td>\n<td>might<\/td>\n<td>must<\/td>\n<td>will<\/td>\n<td>would<\/td>\n<td>should<\/td>\n<\/th>\n<tr>\n<td>legal<\/td>\n<td>1.1<\/td>\n<td>0.7<\/td>\n<td>2.2<\/td>\n<td>0.1<\/td>\n<td>1.3<\/td>\n<td>1.7<\/td>\n<td>1.7<\/td>\n<td>1.2<\/td>\n<\/tr>\n<tr>\n<td>adventure<\/td>\n<td>0.8<\/td>\n<td>2.8<\/td>\n<td>0.1<\/td>\n<td>1.1<\/td>\n<td>0.5<\/td>\n<td>0.9<\/td>\n<td>3.5<\/td>\n<td>0.3<\/td>\n<\/tr>\n<tr>\n<td>belles_lettres<\/td>\n<td>1.5<\/td>\n<td>1.3<\/td>\n<td>1.2<\/td>\n<td>0.7<\/td>\n<td>1.0<\/td>\n<td>1.4<\/td>\n<td>2.3<\/td>\n<td>0.6<\/td>\n<\/tr>\n<tr>\n<td>editorial<\/td>\n<td>1.4<\/td>\n<td>0.7<\/td>\n<td>0.9<\/td>\n<td>0.5<\/td>\n<td>0.6<\/td>\n<td>2.8<\/td>\n<td>2.1<\/td>\n<td>1.0<\/td>\n<\/tr>\n<tr>\n<td>fiction<\/td>\n<td>0.5<\/td>\n<td>2.4<\/td>\n<td>0.1<\/td>\n<td>0.6<\/td>\n<td>0.8<\/td>\n<td>0.8<\/td>\n<td>4.2<\/td>\n<td>0.5<\/td>\n<\/tr>\n<tr>\n<td>government<\/td>\n<td>1.3<\/td>\n<td>0.4<\/td>\n<td>1.7<\/td>\n<td>0.1<\/td>\n<td>1.1<\/td>\n<td>2.7<\/td>\n<td>1.3<\/td>\n<td>1.2<\/td>\n<\/tr>\n<tr>\n<td>hobbies<\/td>\n<td>2.7<\/td>\n<td>0.6<\/td>\n<td>1.3<\/td>\n<td>0.2<\/td>\n<td>0.8<\/td>\n<td>2.7<\/td>\n<td>0.8<\/td>\n<td>0.7<\/td>\n<\/tr>\n<tr>\n<td>humor<\/td>\n<td>1.1<\/td>\n<td>2.0<\/td>\n<td>0.5<\/td>\n<td>0.5<\/td>\n<td>0.6<\/td>\n<td>0.9<\/td>\n<td>3.8<\/td>\n<td>0.5<\/td>\n<\/tr>\n<tr>\n<td>learned<\/td>\n<td>1.8<\/td>\n<td>0.8<\/td>\n<td>1.6<\/td>\n<td>0.6<\/td>\n<td>1.0<\/td>\n<td>1.7<\/td>\n<td>1.6<\/td>\n<td>0.9<\/td>\n<\/tr>\n<tr>\n<td>lore<\/td>\n<td>1.6<\/td>\n<td>1.3<\/td>\n<td>1.6<\/td>\n<td>0.5<\/td>\n<td>0.9<\/td>\n<td>1.7<\/td>\n<td>1.8<\/td>\n<td>0.7<\/td>\n<\/tr>\n<tr>\n<td>mystery<\/td>\n<td>0.8<\/td>\n<td>2.7<\/td>\n<td>0.3<\/td>\n<td>1.1<\/td>\n<td>0.6<\/td>\n<td>0.4<\/td>\n<td>3.6<\/td>\n<td>0.6<\/td>\n<\/tr>\n<tr>\n<td>news<\/td>\n<td>0.9<\/td>\n<td>0.8<\/td>\n<td>0.6<\/td>\n<td>0.4<\/td>\n<td>0.5<\/td>\n<td>3.8<\/td>\n<td>2.4<\/td>\n<td>0.6<\/td>\n<\/tr>\n<tr>\n<td>religion<\/td>\n<td>1.7<\/td>\n<td>1.3<\/td>\n<td>1.7<\/td>\n<td>0.3<\/td>\n<td>1.2<\/td>\n<td>1.5<\/td>\n<td>1.4<\/td>\n<td>1.0<\/td>\n<\/tr>\n<tr>\n<td>reviews<\/td>\n<td>1.5<\/td>\n<td>1.3<\/td>\n<td>1.5<\/td>\n<td>0.9<\/td>\n<td>0.6<\/td>\n<td>1.9<\/td>\n<td>1.6<\/td>\n<td>0.6<\/td>\n<\/tr>\n<tr>\n<td>romance<\/td>\n<td>1.1<\/td>\n<td>2.8<\/td>\n<td>0.2<\/td>\n<td>0.7<\/td>\n<td>0.6<\/td>\n<td>0.6<\/td>\n<td>3.5<\/td>\n<td>0.5<\/td>\n<\/tr>\n<tr>\n<td>science_fiction<\/td>\n<td>0.9<\/td>\n<td>2.6<\/td>\n<td>0.2<\/td>\n<td>0.6<\/td>\n<td>0.4<\/td>\n<td>0.9<\/td>\n<td>4.2<\/td>\n<td>0.2<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<\/tr>\n<\/table>\n<p>It would be nice to see how similar these genres are &#8211; we can compute that by imagining the counts of modals as describing vectors. The angle between vectors approximates &#8220;similarity&#8221;. The nice thing about this measure is that it removes other words (words which may only exist in one text &#8211; some of this will be due to how well the data is cleaned, which does not reflect on the genre of literature).<\/p>\n<pre lang=\"python\">\nimport math\n\ndef distance(cfd, conditions, samples, base):\n  base_cond = cfd[base]\n  base_vector = [base_cond[w] for w in samples]\n  base_length = math.sqrt(sum(a * a for a in base_vector))\n  for c in conditions:\n    cond = cfd[c]\n    cond_vector = [cond[w] for w in samples]\n    dotp = sum(a * b for (a,b) in zip(base_vector, cond_vector))\n    cond_length = math.sqrt(sum(a * a for a in cond_vector))\n    angle = math.acos(dotp \/ (cond_length * base_length))\n    percent = (math.pi \/ 2 - angle) \/ (math.pi \/ 2) * 100\n    print \"%-s similarity to %-s: %-.1f\" % (c, base, percent)\n<\/pre>\n<p>The result are interesting &#8211; the genres showing closes to legal in this case are  government and religion. <\/p>\n<p>As an interesting side-note, belles_lettres means &#8220;fine writing&#8221;, i.e. poems, drama, fiction. <\/p>\n<pre>\nlegal similarity to legal: 100.0\nadventure similarity to legal: 41.6\nbelles_lettres similarity to legal: 72.4\neditorial similarity to legal: 68.8\nfiction similarity to legal: 42.9\ngovernment similarity to legal: 80.6\nhobbies similarity to legal: 63.5\nhumor similarity to legal: 50.1\nlearned similarity to legal: 80.6\nlore similarity to legal: 78.6\nmystery similarity to legal: 41.3\nnews similarity to legal: 58.1\nreligion similarity to legal: 81.2\nreviews similarity to legal: 73.5\nromance similarity to legal: 42.9\nscience_fiction similarity to legal: 41.8\n<\/pre>\n<p>Some genres appear similar to legal documents &#8211; it is possible, however, that some verbs are not independent. For instance, you might see &#8220;may&#8221; and &#8220;might&#8221; with equal similarity. One way to test this might be to flip what we track for distance (make a vector for each modal, rather than genre)<\/p>\n<p>The following code tracks the distance between each modal and the mean, using the different genres as dimensions. Since each of them contributes to the mean somewhat, there is guaranteed to be some similarity, but note that some are closer than others. Note also that these have to be normalized, like the last example, or the answer will be defined by the &#8216;legal&#8217; genre. <\/p>\n<pre lang=\"Javascript\">\ndef distance(cfd, conditions, samples):\n  base_vector = [0.0 for w in conditions]\n  norm = {}\n  for c_i in range(0, len(conditions)):\n    cond_name = conditions[c_i]\n    cond = cfd[cond_name]\n    norm[cond_name] = float(sum(cond[s] for s in samples))\n    for s in samples:\n      base_vector[c_i] = base_vector[c_i] + float(cond[s]) \/ norm[cond_name]\n      base_length = math.sqrt(sum(a * a for a in base_vector))\n  for s in samples: # compute each vector - which, might, etc\n    sample_vector = []\n    for c in conditions: # find condition for each vector\n      sample_vector.append(cfd[c][s] \/ norm[c])\n      dotp = sum(a * b for (a,b) in zip(base_vector, sample_vector))\n      sample_length = math.sqrt(sum(a * a for a in sample_vector))\n      angle = math.acos(dotp \/ (sample_length * base_length))\n      percent = (math.pi \/ 2 - angle) \/ (math.pi \/ 2) * 100\n      print \"%-s similarity to mean: %-.1f\" % (s, percent)\n\ndistance(cfd, genres, modals)\n<\/pre>\n<p>What I&#8217;d infer from this is that the least helpful verb for distinguishing genres is &#8220;must,&#8221; and the most helpful is &#8220;may.&#8221;<\/p>\n<pre>\ncan similarity to mean: 76.0\ncould similarity to mean: 67.6\nmay similarity to mean: 61.5\nmight similarity to mean: 70.0\nmust similarity to mean: 79.7\nwill similarity to mean: 67.7\nwould similarity to mean: 73.6\nshould similarity to mean: 74.2\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Modal verbs are auxiliary verbs which indicate semantic information about an action, i.e. likelihood (will, should) , permission (could, may), obligation (shall\/must). One interesting concept to explore is whether the presence of these verbs varies over different types of text, and whether that means anything. &#8220;Natural Language Processing with Python&#8221; (read my review) has an &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;NLP Analysis in Python using Modal Verbs&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[5,6],"tags":[368,385,386,447,567],"aioseo_notices":[],"aioseo_head":"\n\t\t<!-- All in One SEO 4.9.9 - aioseo.com -->\n\t<meta name=\"description\" content=\"Modal verbs are auxiliary verbs which indicate semantic information about an action, i.e. likelihood (will, should) , permission (could, may), obligation (shall\/must). One interesting concept to explore is whether the presence of these verbs varies over different types of text, and whether that means anything. &quot;Natural Language Processing with Python&quot; (read my review) has an\" \/>\n\t<meta name=\"robots\" content=\"max-image-preview:large\" \/>\n\t<meta name=\"author\" content=\"gary\"\/>\n\t<link rel=\"canonical\" href=\"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/\" \/>\n\t<meta name=\"generator\" content=\"All in One SEO (AIOSEO) 4.9.9\" \/>\n\t\t<meta property=\"og:locale\" content=\"en_US\" \/>\n\t\t<meta property=\"og:site_name\" content=\"Gary Sieling - Software Engineer\" \/>\n\t\t<meta property=\"og:type\" content=\"article\" \/>\n\t\t<meta property=\"og:title\" content=\"NLP Analysis in Python using Modal Verbs - Gary Sieling\" \/>\n\t\t<meta property=\"og:description\" content=\"Modal verbs are auxiliary verbs which indicate semantic information about an action, i.e. likelihood (will, should) , permission (could, may), obligation (shall\/must). One interesting concept to explore is whether the presence of these verbs varies over different types of text, and whether that means anything. &quot;Natural Language Processing with Python&quot; (read my review) has an\" \/>\n\t\t<meta property=\"og:url\" content=\"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/\" \/>\n\t\t<meta property=\"article:published_time\" content=\"2013-07-08T12:04:50+00:00\" \/>\n\t\t<meta property=\"article:modified_time\" content=\"2013-07-08T12:04:50+00:00\" \/>\n\t\t<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n\t\t<meta name=\"twitter:title\" content=\"NLP Analysis in Python using Modal Verbs - Gary Sieling\" \/>\n\t\t<meta name=\"twitter:description\" content=\"Modal verbs are auxiliary verbs which indicate semantic information about an action, i.e. likelihood (will, should) , permission (could, may), obligation (shall\/must). One interesting concept to explore is whether the presence of these verbs varies over different types of text, and whether that means anything. &quot;Natural Language Processing with Python&quot; (read my review) has an\" \/>\n\t\t<script type=\"application\/ld+json\" class=\"aioseo-schema\">\n\t\t\t{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"BlogPosting\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/nlp-analysis-using-modal-verbs\\\/#blogposting\",\"name\":\"NLP Analysis in Python using Modal Verbs - Gary Sieling\",\"headline\":\"NLP Analysis in Python using Modal Verbs\",\"author\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/author\\\/gary\\\/#author\"},\"publisher\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/#organization\"},\"datePublished\":\"2013-07-08T12:04:50+00:00\",\"dateModified\":\"2013-07-08T12:04:50+00:00\",\"inLanguage\":\"en-US\",\"commentCount\":2,\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/nlp-analysis-using-modal-verbs\\\/#webpage\"},\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/nlp-analysis-using-modal-verbs\\\/#webpage\"},\"articleSection\":\"Data Mining, Data Science, modals, nlp, nltk, python, vector math\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/nlp-analysis-using-modal-verbs\\\/#breadcrumblist\",\"itemListElement\":[{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog#listItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.garysieling.com\\\/blog\",\"nextItem\":{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/category\\\/data-mining\\\/#listItem\",\"name\":\"Data Mining\"}},{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/category\\\/data-mining\\\/#listItem\",\"position\":2,\"name\":\"Data Mining\",\"item\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/category\\\/data-mining\\\/\",\"nextItem\":{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/nlp-analysis-using-modal-verbs\\\/#listItem\",\"name\":\"NLP Analysis in Python using Modal Verbs\"},\"previousItem\":{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog#listItem\",\"name\":\"Home\"}},{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/nlp-analysis-using-modal-verbs\\\/#listItem\",\"position\":3,\"name\":\"NLP Analysis in Python using Modal Verbs\",\"previousItem\":{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/category\\\/data-mining\\\/#listItem\",\"name\":\"Data Mining\"}}]},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/#organization\",\"name\":\"Gary Sieling\",\"description\":\"Software Engineer\",\"url\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/author\\\/gary\\\/#author\",\"url\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/author\\\/gary\\\/\",\"name\":\"gary\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/nlp-analysis-using-modal-verbs\\\/#authorImage\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/0be925276d848ffe98a6a9dc8cf33e67?s=96&d=identicon&r=g\",\"width\":96,\"height\":96,\"caption\":\"gary\"}},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/nlp-analysis-using-modal-verbs\\\/#webpage\",\"url\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/nlp-analysis-using-modal-verbs\\\/\",\"name\":\"NLP Analysis in Python using Modal Verbs - Gary Sieling\",\"description\":\"Modal verbs are auxiliary verbs which indicate semantic information about an action, i.e. likelihood (will, should) , permission (could, may), obligation (shall\\\/must). One interesting concept to explore is whether the presence of these verbs varies over different types of text, and whether that means anything. \\\"Natural Language Processing with Python\\\" (read my review) has an\",\"inLanguage\":\"en-US\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/#website\"},\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/nlp-analysis-using-modal-verbs\\\/#breadcrumblist\"},\"author\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/author\\\/gary\\\/#author\"},\"creator\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/author\\\/gary\\\/#author\"},\"datePublished\":\"2013-07-08T12:04:50+00:00\",\"dateModified\":\"2013-07-08T12:04:50+00:00\"},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/\",\"name\":\"Gary Sieling\",\"description\":\"Software Engineer\",\"inLanguage\":\"en-US\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.garysieling.com\\\/blog\\\/#organization\"}}]}\n\t\t<\/script>\n\t\t<!-- All in One SEO -->\n\n","aioseo_head_json":{"title":"NLP Analysis in Python using Modal Verbs - Gary Sieling","description":"Modal verbs are auxiliary verbs which indicate semantic information about an action, i.e. likelihood (will, should) , permission (could, may), obligation (shall\/must). One interesting concept to explore is whether the presence of these verbs varies over different types of text, and whether that means anything. \"Natural Language Processing with Python\" (read my review) has an","canonical_url":"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/","robots":"max-image-preview:large","keywords":"","webmasterTools":{"miscellaneous":""},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"BlogPosting","@id":"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/#blogposting","name":"NLP Analysis in Python using Modal Verbs - Gary Sieling","headline":"NLP Analysis in Python using Modal Verbs","author":{"@id":"https:\/\/www.garysieling.com\/blog\/author\/gary\/#author"},"publisher":{"@id":"https:\/\/www.garysieling.com\/blog\/#organization"},"datePublished":"2013-07-08T12:04:50+00:00","dateModified":"2013-07-08T12:04:50+00:00","inLanguage":"en-US","commentCount":2,"mainEntityOfPage":{"@id":"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/#webpage"},"isPartOf":{"@id":"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/#webpage"},"articleSection":"Data Mining, Data Science, modals, nlp, nltk, python, vector math"},{"@type":"BreadcrumbList","@id":"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/#breadcrumblist","itemListElement":[{"@type":"ListItem","@id":"https:\/\/www.garysieling.com\/blog#listItem","position":1,"name":"Home","item":"https:\/\/www.garysieling.com\/blog","nextItem":{"@type":"ListItem","@id":"https:\/\/www.garysieling.com\/blog\/category\/data-mining\/#listItem","name":"Data Mining"}},{"@type":"ListItem","@id":"https:\/\/www.garysieling.com\/blog\/category\/data-mining\/#listItem","position":2,"name":"Data Mining","item":"https:\/\/www.garysieling.com\/blog\/category\/data-mining\/","nextItem":{"@type":"ListItem","@id":"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/#listItem","name":"NLP Analysis in Python using Modal Verbs"},"previousItem":{"@type":"ListItem","@id":"https:\/\/www.garysieling.com\/blog#listItem","name":"Home"}},{"@type":"ListItem","@id":"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/#listItem","position":3,"name":"NLP Analysis in Python using Modal Verbs","previousItem":{"@type":"ListItem","@id":"https:\/\/www.garysieling.com\/blog\/category\/data-mining\/#listItem","name":"Data Mining"}}]},{"@type":"Organization","@id":"https:\/\/www.garysieling.com\/blog\/#organization","name":"Gary Sieling","description":"Software Engineer","url":"https:\/\/www.garysieling.com\/blog\/"},{"@type":"Person","@id":"https:\/\/www.garysieling.com\/blog\/author\/gary\/#author","url":"https:\/\/www.garysieling.com\/blog\/author\/gary\/","name":"gary","image":{"@type":"ImageObject","@id":"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/#authorImage","url":"https:\/\/secure.gravatar.com\/avatar\/0be925276d848ffe98a6a9dc8cf33e67?s=96&d=identicon&r=g","width":96,"height":96,"caption":"gary"}},{"@type":"WebPage","@id":"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/#webpage","url":"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/","name":"NLP Analysis in Python using Modal Verbs - Gary Sieling","description":"Modal verbs are auxiliary verbs which indicate semantic information about an action, i.e. likelihood (will, should) , permission (could, may), obligation (shall\/must). One interesting concept to explore is whether the presence of these verbs varies over different types of text, and whether that means anything. \"Natural Language Processing with Python\" (read my review) has an","inLanguage":"en-US","isPartOf":{"@id":"https:\/\/www.garysieling.com\/blog\/#website"},"breadcrumb":{"@id":"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/#breadcrumblist"},"author":{"@id":"https:\/\/www.garysieling.com\/blog\/author\/gary\/#author"},"creator":{"@id":"https:\/\/www.garysieling.com\/blog\/author\/gary\/#author"},"datePublished":"2013-07-08T12:04:50+00:00","dateModified":"2013-07-08T12:04:50+00:00"},{"@type":"WebSite","@id":"https:\/\/www.garysieling.com\/blog\/#website","url":"https:\/\/www.garysieling.com\/blog\/","name":"Gary Sieling","description":"Software Engineer","inLanguage":"en-US","publisher":{"@id":"https:\/\/www.garysieling.com\/blog\/#organization"}}]},"og:locale":"en_US","og:site_name":"Gary Sieling - Software Engineer","og:type":"article","og:title":"NLP Analysis in Python using Modal Verbs - Gary Sieling","og:description":"Modal verbs are auxiliary verbs which indicate semantic information about an action, i.e. likelihood (will, should) , permission (could, may), obligation (shall\/must). One interesting concept to explore is whether the presence of these verbs varies over different types of text, and whether that means anything. &quot;Natural Language Processing with Python&quot; (read my review) has an","og:url":"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/","article:published_time":"2013-07-08T12:04:50+00:00","article:modified_time":"2013-07-08T12:04:50+00:00","twitter:card":"summary_large_image","twitter:title":"NLP Analysis in Python using Modal Verbs - Gary Sieling","twitter:description":"Modal verbs are auxiliary verbs which indicate semantic information about an action, i.e. likelihood (will, should) , permission (could, may), obligation (shall\/must). One interesting concept to explore is whether the presence of these verbs varies over different types of text, and whether that means anything. &quot;Natural Language Processing with Python&quot; (read my review) has an"},"aioseo_meta_data":{"post_id":"1244","title":null,"description":null,"keywords":null,"keyphrases":null,"primary_term":null,"canonical_url":null,"og_title":null,"og_description":null,"og_object_type":"default","og_image_type":"default","og_image_url":null,"og_image_width":null,"og_image_height":null,"og_image_custom_url":null,"og_image_custom_fields":null,"og_video":null,"og_custom_url":null,"og_article_section":null,"og_article_tags":null,"twitter_use_og":false,"twitter_card":"default","twitter_image_type":"default","twitter_image_url":null,"twitter_image_custom_url":null,"twitter_image_custom_fields":null,"twitter_title":null,"twitter_description":null,"schema":{"blockGraphs":[],"customGraphs":[],"default":{"data":{"Article":[],"Course":[],"Dataset":[],"FAQPage":[],"Movie":[],"Person":[],"Product":[],"ProductReview":[],"Car":[],"Recipe":[],"Service":[],"SoftwareApplication":[],"WebPage":[]},"graphName":"","isEnabled":true},"graphs":[]},"schema_type":"default","schema_type_options":null,"pillar_content":false,"robots_default":true,"robots_noindex":false,"robots_noarchive":false,"robots_nosnippet":false,"robots_nofollow":false,"robots_noimageindex":false,"robots_noodp":false,"robots_notranslate":false,"robots_max_snippet":null,"robots_max_videopreview":null,"robots_max_imagepreview":"large","priority":null,"frequency":null,"local_seo":null,"limit_modified_date":false,"created":"2023-02-04 16:18:28","updated":"2026-07-06 00:58:49","ai":null,"breadcrumb_settings":null,"seo_analyzer_scan_date":null},"aioseo_breadcrumb":"<div class=\"aioseo-breadcrumbs\"><span class=\"aioseo-breadcrumb\">\n\t\t\t<a href=\"https:\/\/www.garysieling.com\/blog\" title=\"Home\">Home<\/a>\n\t\t<\/span><span class=\"aioseo-breadcrumb-separator\">&raquo;<\/span><span class=\"aioseo-breadcrumb\">\n\t\t\t<a href=\"https:\/\/www.garysieling.com\/blog\/category\/data-mining\/\" title=\"Data Mining\">Data Mining<\/a>\n\t\t<\/span><span class=\"aioseo-breadcrumb-separator\">&raquo;<\/span><span class=\"aioseo-breadcrumb\">\n\t\t\tNLP Analysis in Python using Modal Verbs\n\t\t<\/span><\/div>","aioseo_breadcrumb_json":[{"label":"Home","link":"https:\/\/www.garysieling.com\/blog"},{"label":"Data Mining","link":"https:\/\/www.garysieling.com\/blog\/category\/data-mining\/"},{"label":"NLP Analysis in Python using Modal Verbs","link":"https:\/\/www.garysieling.com\/blog\/nlp-analysis-using-modal-verbs\/"}],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1244"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=1244"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1244\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=1244"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=1244"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=1244"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}