python - Gary Sieling

Modal verbs are auxiliary verbs which indicate semantic information about an action, i.e. likelihood (will, should) , permission (could, may), obligation (shall/must). One interesting concept to explore is whether the presence of these verbs varies over different types of text, and whether that means anything.

“Natural Language Processing with Python” (read my review) has an example of how to start this process, comparing verb frequencies across various genres of text using the Brown corpus, a well-known collection of texts assembled in the 60’s for language research.

I extended the example to include an additional corpus of court cases, and extra helper verbs- This includes the contents of ~15,000 legal documents.

We first define a function to retrieve genres of literature, and a second to retrieve words from the genre. For the legal documents, I am reading from an index I previously built of n-grams (i.e. word/phrase counts).

import nltk
import os

def get_genres():
yield 'legal'
for genre in brown.categories():
yield genre

modals = ['can', 'could', 'may', 'might', 'must', 'will', 'would', 'should']

def get_words(genre):
  if (genre == 'legal'):
    grams = open('1gram', 'rU')
    for line in grams:
      vals = line.split(' ')
      word = vals[0]
      count = int(vals[1])
      if (word in modals):
        for index in range(0, count):
          yield word
        else:
          yield word
    grams.close()
  else:
    for word in brown.words(categories=genre):
      yield word

The Natural Language Toolkit provides a class for tracking frequencies of “experiment” results – here we track the use of different verb tenses.

cfd = nltk.ConditionalFreqDist(
  (genre, word)
  for genre in get_genres()
  for word in get_words(genre)
)

genres = [g for g in get_genres()]
cfd.tabulate(conditions=genres, samples=modals)

cfd.tabulate(conditions=genres, samples=modals)

The tabulate method is provided by NTLK, and makes a nicely formatted chart (in a command line it makes everything line up neatly)

	can	could	may	might	must	will	would	should
legal	13059	7849	26968	1762	15974	20757	19931	13916
adventure	46	151	5	58	27	50	191	15
belles_lettres	246	213	207	113	170	236	392	102
editorial	121	56	74	39	53	233	180	88
fiction	37	166	8	44	55	52	287	35
government	117	38	153	13	102	244	120	112
hobbies	268	58	131	22	83	264	78	73
humor	16	30	8	8	9	13	56	7
learned	365	159	324	128	202	340	319	171
lore	170	141	165	49	96	175	186	76
mystery	42	141	13	57	30	20	186	29
news	93	86	66	38	50	389	244	59
religion	82	59	78	12	54	71	68	45
reviews	45	40	45	26	19	58	47	18
romance	74	193	11	51	45	43	244	32
science_fiction	16	49	4	12	8	16	79	3

Looking at these numbers, it is clear that we need to add a concept of normalization. My added corpus has a lot more tokens than the Brown corpus, which makes it hard to compare across.

The frequency distribution class exists to count things, and I didn’t see a good way to normalize the rows. I re-wrote the tabulate function to do this – it simply finds the max for each row, divides by that, and multiplies by 100.

def tabulate(cfd, conditions, samples):
  max_len = max(len(w) for w in conditions)
  sys.stdout.write(" " * (max_len + 1))
  for c in samples:
    sys.stdout.write("%-s\t" % c)
    sys.stdout.write("\n")
  for c in conditions:
    sys.stdout.write(" " * (max_len - len(c)))
    sys.stdout.write("%-s" % c)
    sys.stdout.write(" ")
    dist = cfd[c]
    norm = sum([dist[w] for w in modals])
    for s in samples:
      value = 100 * dist[s] / norm
      sys.stdout.write("%-d\t" % value)
      sys.stdout.write("\n")

tabulate(cfd, genres, modals)

This makes it easier to scan up and down the chart-

	can	could	may	might	must	will	would	should
legal	10	6	22	1	13	17	16	11
adventure	8	27	0	10	4	9	35	2
belles_lettres	14	12	12	6	10	14	23	6
editorial	14	6	8	4	6	27	21	10
fiction	5	24	1	6	8	7	41	5
government	13	4	17	1	11	27	13	12
hobbies	27	5	13	2	8	27	7	7
humor	10	20	5	5	6	8	38	4
learned	18	7	16	6	10	16	15	8
lore	16	13	15	4	9	16	17	7
mystery	8	27	2	11	5	3	35	5
news	9	8	6	3	4	37	23	5
religion	17	12	16	2	11	15	14	9
reviews	15	13	15	8	6	19	15	6
romance	10	27	1	7	6	6	35	4
science_fiction	8	26	2	6	4	8	42	1

One thing this makes clear is most genres have numerous references to ‘would’ and few have ‘should’.

It might be nice to see these on a scale of 1-10 – seeing the columns of numbers communicates something in the lengths.

def tabulate(cfd, conditions, samples):
  max_len = max(len(w) for w in conditions)
  sys.stdout.write(" " * (max_len + 1))
  for c in samples:
    sys.stdout.write("%-s\t" % c)
    sys.stdout.write("\n")
    for c in conditions:
      sys.stdout.write(" " * (max_len - len(c)))
      sys.stdout.write("%-s" % c)
      sys.stdout.write(" ")
      dist = cfd[c]
      norm = sum([dist[w] for w in modals])
  for s in samples:
    value = 10 * float(dist[s]) / norm
    sys.stdout.write("%.1f\t" % value)
    sys.stdout.write("\n")

tabulate(cfd, genres, modals)

	can	could	may	might	must	will	would	should
legal	1.1	0.7	2.2	0.1	1.3	1.7	1.7	1.2
adventure	0.8	2.8	0.1	1.1	0.5	0.9	3.5	0.3
belles_lettres	1.5	1.3	1.2	0.7	1.0	1.4	2.3	0.6
editorial	1.4	0.7	0.9	0.5	0.6	2.8	2.1	1.0
fiction	0.5	2.4	0.1	0.6	0.8	0.8	4.2	0.5
government	1.3	0.4	1.7	0.1	1.1	2.7	1.3	1.2
hobbies	2.7	0.6	1.3	0.2	0.8	2.7	0.8	0.7
humor	1.1	2.0	0.5	0.5	0.6	0.9	3.8	0.5
learned	1.8	0.8	1.6	0.6	1.0	1.7	1.6	0.9
lore	1.6	1.3	1.6	0.5	0.9	1.7	1.8	0.7
mystery	0.8	2.7	0.3	1.1	0.6	0.4	3.6	0.6
news	0.9	0.8	0.6	0.4	0.5	3.8	2.4	0.6
religion	1.7	1.3	1.7	0.3	1.2	1.5	1.4	1.0
reviews	1.5	1.3	1.5	0.9	0.6	1.9	1.6	0.6
romance	1.1	2.8	0.2	0.7	0.6	0.6	3.5	0.5
science_fiction	0.9	2.6	0.2	0.6	0.4	0.9	4.2	0.2

It would be nice to see how similar these genres are – we can compute that by imagining the counts of modals as describing vectors. The angle between vectors approximates “similarity”. The nice thing about this measure is that it removes other words (words which may only exist in one text – some of this will be due to how well the data is cleaned, which does not reflect on the genre of literature).

import math

def distance(cfd, conditions, samples, base):
  base_cond = cfd[base]
  base_vector = [base_cond[w] for w in samples]
  base_length = math.sqrt(sum(a * a for a in base_vector))
  for c in conditions:
    cond = cfd[c]
    cond_vector = [cond[w] for w in samples]
    dotp = sum(a * b for (a,b) in zip(base_vector, cond_vector))
    cond_length = math.sqrt(sum(a * a for a in cond_vector))
    angle = math.acos(dotp / (cond_length * base_length))
    percent = (math.pi / 2 - angle) / (math.pi / 2) * 100
    print "%-s similarity to %-s: %-.1f" % (c, base, percent)

The result are interesting – the genres showing closes to legal in this case are government and religion.

As an interesting side-note, belles_lettres means “fine writing”, i.e. poems, drama, fiction.

legal similarity to legal: 100.0
adventure similarity to legal: 41.6
belles_lettres similarity to legal: 72.4
editorial similarity to legal: 68.8
fiction similarity to legal: 42.9
government similarity to legal: 80.6
hobbies similarity to legal: 63.5
humor similarity to legal: 50.1
learned similarity to legal: 80.6
lore similarity to legal: 78.6
mystery similarity to legal: 41.3
news similarity to legal: 58.1
religion similarity to legal: 81.2
reviews similarity to legal: 73.5
romance similarity to legal: 42.9
science_fiction similarity to legal: 41.8

Some genres appear similar to legal documents – it is possible, however, that some verbs are not independent. For instance, you might see “may” and “might” with equal similarity. One way to test this might be to flip what we track for distance (make a vector for each modal, rather than genre)

The following code tracks the distance between each modal and the mean, using the different genres as dimensions. Since each of them contributes to the mean somewhat, there is guaranteed to be some similarity, but note that some are closer than others. Note also that these have to be normalized, like the last example, or the answer will be defined by the ‘legal’ genre.

def distance(cfd, conditions, samples):
  base_vector = [0.0 for w in conditions]
  norm = {}
  for c_i in range(0, len(conditions)):
    cond_name = conditions[c_i]
    cond = cfd[cond_name]
    norm[cond_name] = float(sum(cond[s] for s in samples))
    for s in samples:
      base_vector[c_i] = base_vector[c_i] + float(cond[s]) / norm[cond_name]
      base_length = math.sqrt(sum(a * a for a in base_vector))
  for s in samples: # compute each vector - which, might, etc
    sample_vector = []
    for c in conditions: # find condition for each vector
      sample_vector.append(cfd[c][s] / norm[c])
      dotp = sum(a * b for (a,b) in zip(base_vector, sample_vector))
      sample_length = math.sqrt(sum(a * a for a in sample_vector))
      angle = math.acos(dotp / (sample_length * base_length))
      percent = (math.pi / 2 - angle) / (math.pi / 2) * 100
      print "%-s similarity to mean: %-.1f" % (s, percent)

distance(cfd, genres, modals)

What I’d infer from this is that the least helpful verb for distinguishing genres is “must,” and the most helpful is “may.”

can similarity to mean: 76.0
could similarity to mean: 67.6
may similarity to mean: 61.5
might similarity to mean: 70.0
must similarity to mean: 79.7
will similarity to mean: 67.7
would similarity to mean: 73.6
should similarity to mean: 74.2

Tag: python

NLP Analysis in Python using Modal Verbs