nltk - Gary Sieling

U.S. Law periodically names specific institutions; historically it is possible for Congress to write a law naming an individual, although I think that has become less common. I expect the most common entities named in Federal Law to be groups like Congress. It turns out this is true, but the other most common entities are the law itself and bureaucratic functions like archivists.

To get at this information, we need to read the Code XML, and use a natural language processing library to get at the named groups.

NLTK is such an NLP library. It provides interesting features like sentence parsing, part of speech tagging, and named entity recognition. (If interested in the subject see my review of “Natural Language Processing with Python“, a book which covers this library in detail)

To achieve the results we want, we first parse one of the U.S. Code XML documents:

from elementtree import ElementTree as ET
tree = ET.parse("G:\\us_code\\xml_uscAll@113-21\\usc01.xml")

Then we have to write a function to retrieve just the text nodes. I’ve started this at the

elements, which seems to give good results (i.e. paragraphs of laws, but not the headings).

def getText(node,depth):
  if node is None:
    return ""

  result = []

  if depth == 0:
    iter = node.getiterator(tag='{http://xml.house.gov/schemas/uslm/1.0}p')
  else:
    iter = node.getiterator()

  for child in iter:
    if child.text is not None:
      result.append(child.text)
    if len(child.getchildren()) > 0:
      for n in child.getchildren():
        result = result + getText(n, depth+1)
    result.append("\n")

  if depth == 0:
    return " ".join(result)
  else:
    return result

print getText(tree.getroot(),0)

The Committee on the Judiciary of the House of 
Representatives is authorized to print bills to 
codify, revise, and reenact the general and permanent 
laws relating to the District of Columbia and 
cumulative supplements thereto, similar in style, 
respectively, to the Code of Laws of the United States, 
and supplements thereto, and to so continue until final 
enactment thereof in both Houses of the Congress 
of the United States. 
 Pub. L. 90–226, title X

We can see in this some of the “entities” we expect to extract – “House of Representatives”, “District of Columbia”, “Code of Laws of the United States.”

It takes a little work to get at this – we first need to parse the text into sentences (an alternative approach might be to just keep the paragraphs as separate sentences, or parse each individually).

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = getText(tree.getroot(), 0)
sentences = sent_tokenize(text)
len(sentences)
1319

 u'211 \n \n July 30, 1947 \n 1 U.S.C.',
 u'211 \n \n Copies of District of Columbia Code 
and Supplements not available to Senators or 
Representatives unless specifically requested by 
them, in writing, see  Pub.',
 u'L. 94\u201359, title VIII, \xa7\u202f801 
\n July 25, 1975 \n 89 Stat.',
 u'296 \n section 1317 of Title 44 \n \n 
 In addition the Superintendent of Documents shall, 
at the beginning of the first session of each Congress, 
supply to each Senator and Representative in such Congress, 
who may in writing apply for the same, one copy each of the
 Code of Laws of the United States, the Code of Laws relating 
to the District of Columbia, and the latest supplement to each
 code:  Provided \n And provided further \n \n For preparation
and editing an annual appropriation of $6,500 is authorized 
to carry out the purposes of sections 202 and 203 of this title. \n'

From there, we need to parse each sentence into constituent words. The value of this library is that it handles issues like punctuation, which would otherwise cause infinite misery.

words = [nltk.word_tokenize(sentence) for sentence in sentences]

 words[0]
Out[47]: 
[u'This',
 u'title',
 u'was',
 u'enacted',
 u'by',
 u'act',
 u'July',
 u'30',
 u',',
 u'1947',
 u',',
 u'ch',
 u'.']

Once we have the words, we need NLTK to guess at parts of speech – it considers more detailed categories than you may have learned in school; this added precision seems to help it get more accurate results in later steps.

tagged = [nltk.pos_tag(w) for w in words]
: tagged[0]
Out[49]: 
[(u'This', 'DT'),
 (u'title', 'NN'),
 (u'was', 'VBD'),
 (u'enacted', 'VBN'),
 (u'by', 'IN'),
 (u'act', 'NN'),
 (u'July', 'NNP'),
 (u'30', 'CD'),
 (u',', ','),
 (u'1947', 'CD'),
 (u',', ','),
 (u'ch', 'JJ'),

And finally, we can look for “entities” in each sentence. NLTK returns what to me is an idiosyncratic result- a list that contains either a tuple, or a tree representing the entity.

entities = [nltk.chunk.ne_chunk(t) for t in tagged]

entities[6]
Out[58]: Tree('S', [(u'990', 'CD'), 
(u'\u201cAll', 'JJ'), (u'Acts', 'NNS'), 
(u'of', 'IN'), Tree('ORGANIZATION', 
[(u'Congress', 'NNP')]), (u'referring', 'NN'), 
(u'to', 'TO'), (u'writs', 'NNS'), (u'of', 'IN'), 
(u'error', 'NN'), (u'shall', 'MD'), (u'be', 'VB'), 
(u'construed', 'VBN'), (u'as', 'IN'), 
(u'amended', 'VBN'), (u'to', 'TO'), 
(u'the', 'DT'), (u'extent', 'NN'), 
(u'necessary', 'JJ'), (u'to', 'TO'), 
(u'substitute', 'VB'), (u'appeal', 'NN'), 
(u'for', 'IN'), (u'writ', 'NN'), (u'of', 'IN'), 
(u'error.\u201d', 'NNP'), (u'2002\u2014', 'CD'), 
(u'Pub', 'NNP'), (u'.', '.')])

[(e.node, e.leaves()[0][0]) for e in entities[6] \
 if isinstance(e, nltk.tree.Tree)]

Out[104]: [('ORGANIZATION', u'Congress')]

From here, I’ve defined a couple simple utility functions to extract just the parts we need from the tree. At this point from inspecting the results it becomes clear that there are a few downsides cause be lack of context: it seems to lose some stopwords (“House OF Representatives”) and we can’t correlate this back to which law the text was in.

def entityStr(e):
  return " ".join([word for (word, pos) in e.leaves()])

def getEntities(nodes):
  return [(e.node, entityStr(e)) \
    for e in nodes if isinstance(e, nltk.tree.Tree)]

e = [entity for entity in \ 
    [getEntities(node) for node in entities] if len(entity) > 0 ]

final = []
for lst in e:
  final = final + lst

There are a few interesting examples here- in some cases NLTK was able to combine multi-word names successfully, but not all cases. I think it loses track of the “of” in the center of some of them.

('ORGANIZATION', u'General Services')
('ORGANIZATION', u'Congress')
('ORGANIZATION', u'Representatives')
('GPE', u'United States')
('ORGANIZATION', u'Internal Revenue Code')

At last, we can count these and see who shows up the most:

Counter(final).most_common(20)
Out[152]: 
[(('ORGANIZATION', u'House'), 185),
 (('GPE', u'United States Code'), 127),
 (('ORGANIZATION', u'Congress'), 126),
 (('ORGANIZATION', u'Representatives'), 107),
 (('GPE', u'United States'), 96),
 (('ORGANIZATION', u'Committee'), 89),
 (('ORGANIZATION', u'OBRA'), 56),
 (('ORGANIZATION', u'Clerk'), 45),
 (('ORGANIZATION', u'Large'), 45),
 (('ORGANIZATION', u'Archivist'), 44),
 (('GPE', u'United States Statutes'), 43),
 (('ORGANIZATION', u'Senate'), 37),
 (('ORGANIZATION', u'House Administration'), 34),
 (('ORGANIZATION', u'Social'), 32),
 (('GPE', u'Pub'), 27),
 (('ORGANIZATION', u'PARCHMENT'), 20),
 (('ORGANIZATION', u'REQUIREMENT FOR'), 20),
 (('PERSON', u'Tables'), 17),
 (('ORGANIZATION', u'Public'), 17),
 (('ORGANIZATION', u'SUBSEQUENT'), 12)]

You can see there is some noise at the end there – “PARCHMENT”, “SUBSEQUENT”, etc. This is likely due to the legal profession’s obsession with using capital letters to represent bold text (where NLTK assumes a more standard use of English). This would likely be improved with some pre-processing on the texts. Notably “Committee” and “Clerk” and “Archivist” are popular – likely the “Committee” would drop out into specific committees if this were improved.

Tag: nltk

Finding Parties Named in U.S. Law using Python and NLTK