Uncovering Lexical Relationships with Python and NLP

Wordnet is a database containing hierarchies of certain types of relationships – “a tree is part of a forest”, “a car is a type of motor vehicle”, “an engine is part of a car” (meronyms, holonyms). “Natural Language Processing with Python” (read my review) suggests that you might discover these relationships in a corpus by searching for strings like “is a” and filtering it down – thus discovering things that could be manually added to Wordnet. Presumably this is how the database was first constructed.

Searching for strings this way seems like a simple regex search would suffice, but in practice just searching for strings generates a lot of noise. Rather than search the original text files, which would generate a lot of duplicates, I’m using an n-gram index I generated. This includes counts of the frequencies of phrases in 15,000 court cases, which means garbage tokens have also been filtered from the text. You see a lot of strings like “defendent is a flight risk”, which is interesting if you want to report on the case, but not for listing relationships.

NLTK has several text corpora included with the library, so I joined my text data to two other datasets. The first is a stopwords list – this removes a lot of garbage entries, which probably make sense in context, but not here.

import nltk
from nltk import memoize

@memoize
def get_stopwords():
  return set(nltk.corpus.stopwords.words())

def has_stopword(test_word):
  return test_word in get_stopwords()

I also use Wordnet to check known uses of a word, to see if it ever can be a noun (a word like “mint” could be a noun or verb, for instance, whereas “within” gets removed). A surprising number of words are nouns (“have”, as in “haves” and “have nots”). I also remove hapaxes (words that occur once) – this removes some people’s names and bogus misspellings.

@memoize
def get_all_words():
  return set(nltk.corpus.words.words())

def is_word(test_word):
  return test_word in get_all_words()

def can_be_noun(test_word):
  synsets = nltk.corpus.wordnet.synsets(test_word)
  if len(synsets) == 0:
    return True
  for s in synsets:
    if s.pos == 'n':
      return True
  return False

Note the use of sets above- this is to make lookups faster. The memoize annotation comes from NLTK.

Now that we’ve defined filter functions, we can look through the file. Each entry in the file is of the form “phrase count”, e.g. “law 321” or “a black cat 7” depending which n-gram file you’re looking at. The order of which tests are run against a phrase matters a bit, since Wordnet lookups take longer than the rest.

def get_words():
  grams = open('4gram', 'rU')
  for line in grams:
    vals = line.split(' ')
    ngram = ' '.join(vals[0:-1])
    count = int(vals[-1])
    if count > 3:
      relationships = ["is a", "forms a", "contains a"]
      for rln in relationships:
        segments = ngram.split(' ' + rln + ' ')
        if len(segments) == 2:
          begin_segment = segments[0]
          end_segment = segments[1]
          if is_word(begin_segment) and is_word(end_segment):
            if (not has_stopword(begin_segment) and not has_stopword(end_segment)):
              if can_be_noun(begin_segment) and can_be_noun(end_segment):
                yield ngram
  grams.close()

[w for w in get_words()]

This still generates a whole lot of garbage, but now a manageable amount.

[‘patent contains a total’, ‘relief is a remedy’, ‘reason is a pretext’, ‘stingray is a device’, ‘complaint contains a statement’, ‘belief is a citizen’, ‘class is a citizen’, ‘plaintiff is a former’, ‘international is a corporation’, ‘change is a question’, ‘action is a question’, ‘wireless is a service’, ‘jurisdiction is a doctrine’, ‘order is a comprehensive’, ‘way is a federal’, ‘contract is a question’, ‘corporation is a subsidiary’, ‘subsection is a fourth’, ‘server is a pen’, ‘debtor is a general’, ‘debtor is a name’, ‘present is a flow’, ‘plaintiff is a prisoner’, ‘counsel is a sole’, ‘company is a party’, ‘debtor is a director’, ‘plaintiff is a limited’, ‘defendant is a career’, ‘judge is a final’, ‘injunction is a true’, ‘meeting is a declaration’, ‘exhibit is a police’, ‘machinery is a parent’, ‘claim is a claim’, ‘petitioner is a partnership’, ‘plaintiff is a consumer’, ‘defendant is a debt’, ‘petitioner is a corporation’, ‘h is a re’, ‘v is a b’, ‘permit is a criminal’, ‘copy is a separate’, ‘probability is a probability’, ‘defendant is a citizen’, ‘information is a criminal’, ‘plaintiff is a natural’, ‘compensation is a benefit’, ‘corporation is a corporation’, ‘guidance is a binding’, ‘tacking is a question’, ‘exhibit is a copy’, ‘price is a matter’, ‘b is a true’, ‘specimen is a material’, ‘plaintiff is a corporation’, ‘stingray is a generic’, ‘paragraph contains a statement’, ‘form contains a certification’, ‘court is a motion’, ‘record is a picture’, ‘petitioner is a state’, ‘offense is a crime’, ‘child is a party’, ‘defendant is a danger’, ‘child is a codebtor’, ‘defendant is a flight’, ‘company is a corporation’, ‘following is a list’, ‘objection is a judge’, ‘debtor is a debtor’, ‘debtor is a organization’, ‘plaintiff is a resident’, ‘paragraph contains a description’, ‘plaintiff is a state’, ‘debtor is a small’, ‘following is a summary’, ‘plaintiff is a member’, ‘works is a student’, ‘tab is a true’, ‘child is a creditor’, ‘plaintiff is a citizen’, ‘exhibit is a true’, ‘complaint is a statement’, ‘debtor is a partnership’, ‘debtor is a corporation’]

If you filter this down manually, you get some real relationships – not all of these may be intended, and may actually represent cropped terms. A better approach to this problem would be to first tag the texts with parts of speech, then use those to determine where noun phrases begin and end. Then, one could filter to certain arrangements of parts of speech (noun phrase / verb / noun phrase) so that the noun phrases would be kept whole.

[‘relief is a remedy’, ‘reason is a pretext’, ‘complaint contains a statement’, ‘wireless is a service’, ‘jurisdiction is a doctrine’, ‘corporation is a subsidiary’, ‘company is a party’, ‘meeting is a declaration’, ‘plaintiff is a consumer’, ‘compensation is a benefit’, ‘specimen is a material’, ‘paragraph contains a statement’, ‘form contains a certification’, ‘offense is a crime’, ‘company is a corporation’, ‘paragraph contains a description’]

This technique can also be used to find opposites (antymony) or entailment (a verb that implies another verb, either because the action contains another action, or because they are synonyms).

Leave a Reply

Your email address will not be published. Required fields are marked *