{"id":1753,"date":"2013-08-13T13:06:37","date_gmt":"2013-08-13T13:06:37","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=1753"},"modified":"2020-03-30T02:43:56","modified_gmt":"2020-03-30T02:43:56","slug":"finding-parties-named-in-u-s-law-using-python-and-nltk","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/finding-parties-named-in-u-s-law-using-python-and-nltk\/","title":{"rendered":"Finding Parties Named in U.S. Law using Python and NLTK"},"content":{"rendered":"<p>U.S. Law periodically names specific institutions; historically it is possible for Congress to write a law naming an individual, although I think that has become less common. I expect the most common entities named in Federal Law to be groups like Congress. It turns out this is true, but the other most common entities are the law itself and bureaucratic functions like archivists.<\/p>\n<p>To get at this information, we need to read the <a href=\"http:\/\/garysieling.com\/blog\/u-s-code-available-in-xml-format\" rel=\"canonical\">Code XML<\/a>, and use a natural language processing library to get at the named groups.<\/p>\n<p>NLTK is such an NLP library. It provides interesting features like sentence parsing, part of speech tagging, and named entity recognition. (If interested in the subject see <a href=\"http:\/\/garysieling.com\/blog\/book-review-natural-language-processing-with-python\" rel=\"canonical\">my review<\/a> of &#8220;<a href=\"http:\/\/www.amazon.com\/gp\/product\/0596516495\/ref=as_li_ss_il?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0596516495&amp;linkCode=as2&amp;tag=thesecrelifeo-20\" rel=\"nofollow\">Natural Language Processing with Python<\/a>&#8220;, a book which covers this library in detail)<\/p>\n<p>To achieve the results we want, we first parse one of the U.S. Code XML documents:<\/p>\n<pre lang=\"python\">from elementtree import ElementTree as ET\ntree = ET.parse(\"G:\\\\us_code\\\\xml_uscAll@113-21\\\\usc01.xml\")\n<\/pre>\n<p>Then we have to write a function to retrieve just the text nodes. I&#8217;ve started this at the<\/p>\n<p>elements, which seems to give good results (i.e. paragraphs of laws, but not the headings).<\/p>\n<pre lang=\"python\">def getText(node,depth):\n  if node is None:\n    return \"\"\n\n  result = []\n\n  if depth == 0:\n    iter = node.getiterator(tag='{http:\/\/xml.house.gov\/schemas\/uslm\/1.0}p')\n  else:\n    iter = node.getiterator()\n\n  for child in iter:\n    if child.text is not None:\n      result.append(child.text)\n    if len(child.getchildren()) &gt; 0:\n      for n in child.getchildren():\n        result = result + getText(n, depth+1)\n    result.append(\"\\n\")\n\n  if depth == 0:\n    return \" \".join(result)\n  else:\n    return result\n\nprint getText(tree.getroot(),0)\n\nThe Committee on the Judiciary of the House of \nRepresentatives is authorized to print bills to \ncodify, revise, and reenact the general and permanent \nlaws relating to the District of Columbia and \ncumulative supplements thereto, similar in style, \nrespectively, to the Code of Laws of the United States, \nand supplements thereto, and to so continue until final \nenactment thereof in both Houses of the Congress \nof the United States. \n Pub. L. 90\u2013226, title X \n<\/pre>\n<p>We can see in this some of the &#8220;entities&#8221; we expect to extract &#8211; &#8220;House of Representatives&#8221;, &#8220;District of Columbia&#8221;, &#8220;Code of Laws of the United States.&#8221;<\/p>\n<p>It takes a little work to get at this &#8211; we first need to parse the text into sentences (an alternative approach might be to just keep the paragraphs as separate sentences, or parse each individually).<\/p>\n<pre lang=\"python\">\nimport nltk\nfrom nltk.tokenize import word_tokenize, sent_tokenize\n\ntext = getText(tree.getroot(), 0)\nsentences = sent_tokenize(text)\nlen(sentences)\n1319\n\n u'211 \\n \\n July 30, 1947 \\n 1 U.S.C.',\n u'211 \\n \\n Copies of District of Columbia Code \nand Supplements not available to Senators or \nRepresentatives unless specifically requested by \nthem, in writing, see  Pub.',\n u'L. 94\\u201359, title VIII, \\xa7\\u202f801 \n\\n July 25, 1975 \\n 89 Stat.',\n u'296 \\n section 1317 of Title 44 \\n \\n \n In addition the Superintendent of Documents shall, \nat the beginning of the first session of each Congress, \nsupply to each Senator and Representative in such Congress, \nwho may in writing apply for the same, one copy each of the\n Code of Laws of the United States, the Code of Laws relating \nto the District of Columbia, and the latest supplement to each\n code:  Provided \\n And provided further \\n \\n For preparation\nand editing an annual appropriation of $6,500 is authorized \nto carry out the purposes of sections 202 and 203 of this title. \\n'\n<\/pre>\n<p>From there, we need to parse each sentence into constituent words. The value of this library is that it handles issues like punctuation, which would otherwise cause infinite misery.<\/p>\n<pre lang=\"python\">words = [nltk.word_tokenize(sentence) for sentence in sentences]\n\n words[0]\nOut[47]: \n[u'This',\n u'title',\n u'was',\n u'enacted',\n u'by',\n u'act',\n u'July',\n u'30',\n u',',\n u'1947',\n u',',\n u'ch',\n u'.']\n<\/pre>\n<p>Once we have the words, we need NLTK to guess at parts of speech &#8211; it considers more detailed categories than you may have learned in school; this added precision seems to help it get more accurate results in later steps.<\/p>\n<pre lang=\"python\">tagged = [nltk.pos_tag(w) for w in words]\n: tagged[0]\nOut[49]: \n[(u'This', 'DT'),\n (u'title', 'NN'),\n (u'was', 'VBD'),\n (u'enacted', 'VBN'),\n (u'by', 'IN'),\n (u'act', 'NN'),\n (u'July', 'NNP'),\n (u'30', 'CD'),\n (u',', ','),\n (u'1947', 'CD'),\n (u',', ','),\n (u'ch', 'JJ'),\n<\/pre>\n<p>And finally, we can look for &#8220;entities&#8221; in each sentence. NLTK returns what to me is an idiosyncratic result-  a list that contains either a tuple, or a tree representing the entity.<\/p>\n<pre lang=\"python\">entities = [nltk.chunk.ne_chunk(t) for t in tagged]\n\nentities[6]\nOut[58]: Tree('S', [(u'990', 'CD'), \n(u'\\u201cAll', 'JJ'), (u'Acts', 'NNS'), \n(u'of', 'IN'), Tree('ORGANIZATION', \n[(u'Congress', 'NNP')]), (u'referring', 'NN'), \n(u'to', 'TO'), (u'writs', 'NNS'), (u'of', 'IN'), \n(u'error', 'NN'), (u'shall', 'MD'), (u'be', 'VB'), \n(u'construed', 'VBN'), (u'as', 'IN'), \n(u'amended', 'VBN'), (u'to', 'TO'), \n(u'the', 'DT'), (u'extent', 'NN'), \n(u'necessary', 'JJ'), (u'to', 'TO'), \n(u'substitute', 'VB'), (u'appeal', 'NN'), \n(u'for', 'IN'), (u'writ', 'NN'), (u'of', 'IN'), \n(u'error.\\u201d', 'NNP'), (u'2002\\u2014', 'CD'), \n(u'Pub', 'NNP'), (u'.', '.')])\n\n[(e.node, e.leaves()[0][0]) for e in entities[6] \\\n if isinstance(e, nltk.tree.Tree)]\n\nOut[104]: [('ORGANIZATION', u'Congress')]\n<\/pre>\n<p>From here, I&#8217;ve defined a couple simple utility functions to extract just the parts we need from the tree. At this point from inspecting the results it becomes clear that there are a few downsides cause be lack of context: it seems to lose some stopwords (&#8220;House OF Representatives&#8221;) and we can&#8217;t correlate this back to which law the text was in.<\/p>\n<pre lang=\"python\">def entityStr(e):\n  return \" \".join([word for (word, pos) in e.leaves()])\n\ndef getEntities(nodes):\n  return [(e.node, entityStr(e)) \\\n    for e in nodes if isinstance(e, nltk.tree.Tree)]\n\ne = [entity for entity in \\ \n    [getEntities(node) for node in entities] if len(entity) &gt; 0 ]\n\nfinal = []\nfor lst in e:\n  final = final + lst\n<\/pre>\n<p>There are a few interesting examples here- in some cases NLTK was able to combine multi-word names successfully, but not all cases. I think it loses track of the &#8220;of&#8221; in the center of some of them.<\/p>\n<pre>('ORGANIZATION', u'General Services')\n('ORGANIZATION', u'Congress')\n('ORGANIZATION', u'Representatives')\n('GPE', u'United States')\n('ORGANIZATION', u'Internal Revenue Code')\n<\/pre>\n<p>At last, we can count these and see who shows up the most:<\/p>\n<pre lang=\"python\">Counter(final).most_common(20)\nOut[152]: \n[(('ORGANIZATION', u'House'), 185),\n (('GPE', u'United States Code'), 127),\n (('ORGANIZATION', u'Congress'), 126),\n (('ORGANIZATION', u'Representatives'), 107),\n (('GPE', u'United States'), 96),\n (('ORGANIZATION', u'Committee'), 89),\n (('ORGANIZATION', u'OBRA'), 56),\n (('ORGANIZATION', u'Clerk'), 45),\n (('ORGANIZATION', u'Large'), 45),\n (('ORGANIZATION', u'Archivist'), 44),\n (('GPE', u'United States Statutes'), 43),\n (('ORGANIZATION', u'Senate'), 37),\n (('ORGANIZATION', u'House Administration'), 34),\n (('ORGANIZATION', u'Social'), 32),\n (('GPE', u'Pub'), 27),\n (('ORGANIZATION', u'PARCHMENT'), 20),\n (('ORGANIZATION', u'REQUIREMENT FOR'), 20),\n (('PERSON', u'Tables'), 17),\n (('ORGANIZATION', u'Public'), 17),\n (('ORGANIZATION', u'SUBSEQUENT'), 12)]\n<\/pre>\n<p>You can see there is some noise at the end there &#8211; &#8220;PARCHMENT&#8221;, &#8220;SUBSEQUENT&#8221;, etc. This is likely due to the legal profession&#8217;s obsession with using capital letters to represent bold text (where NLTK assumes a more standard use of English). This would likely be improved with some pre-processing on the texts. Notably &#8220;Committee&#8221; and &#8220;Clerk&#8221; and &#8220;Archivist&#8221; are popular &#8211; likely the &#8220;Committee&#8221; would drop out into specific committees if this were improved.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>U.S. Law periodically names specific institutions; historically it is possible for Congress to write a law naming an individual, although I think that has become less common. I expect the most common entities named in Federal Law to be groups like Congress. It turns out this is true, but the other most common entities are &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/finding-parties-named-in-u-s-law-using-python-and-nltk\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Finding Parties Named in U.S. Law using Python and NLTK&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[4,6],"tags":[335,385,386,447,604],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1753"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=1753"}],"version-history":[{"count":1,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1753\/revisions"}],"predecessor-version":[{"id":6474,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1753\/revisions\/6474"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=1753"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=1753"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=1753"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}