{"id":1547,"date":"2013-08-05T12:52:58","date_gmt":"2013-08-05T12:52:58","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=1547"},"modified":"2020-03-31T00:46:31","modified_gmt":"2020-03-31T00:46:31","slug":"counting-citations-in-u-s-law","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/counting-citations-in-u-s-law\/","title":{"rendered":"Counting Citations in U.S. Law"},"content":{"rendered":"<p>The U.S. Congress recently released a series of XML documents containing U.S. Laws. The structure of these documents allow us to find which sections of the law are most commonly cited. Examining which citations occur most frequently allows us to see what Congress has spent the most time thinking about.<\/p>\n<p>Citations occur for many reasons: a justification for addition or omission in subsequent laws, clarifications, or amendments, or repeals. As we might expect, the most commonly cited sections involve the IRS (Income Taxes, specifically), Social Security, and Military Procurement.<\/p>\n<p>To arrive at this result, we must first see how U.S. Code is laid out. The laws are divided into a hierarchy of units, which allows anything from an entire title to individual sentences to cited. These sections have an ID and an identifier &#8211; &#8220;identifier&#8221; is used an an citation reference within the XML documents, and has a different form from the citations used by the legal community, comes in a form like &#8220;25 USC Chapter 21 \u00a7\u20091901&#8221;.<\/p>\n<p>The XML hierarchy defines seventeen different levels which can be cited: &#8216;title&#8217;, &#8216;subtitle&#8217;, &#8216;chapter&#8217;, &#8216;subchapter&#8217;, &#8216;part&#8217;, &#8216;subpart&#8217;, &#8216;division&#8217;, &#8216;subdivision&#8217;, &#8216;article&#8217;, &#8216;subarticle&#8217;, &#8216;section&#8217;, &#8216;subsection&#8217;, &#8216;paragraph&#8217;, &#8216;subparagraph&#8217;, &#8216;clause&#8217;, &#8216;subclause&#8217;, and &#8216;item&#8217;.<\/p>\n<p>We can use a simple XPath expression to retrieve one of these, like section:<\/p>\n<pre .=\"\" {http:=\"\" xml.house.gov=\"\" schemas=\"\" uslm=\"\" 1.0}section=\"\" <section=\"\" id=\"id23c4f9a6-f5a0-11e2-8dfe-b6d89e949a2c\" identifier=\"\/us\/usc\/t49\/s104\"><num value=\"104\">\u00a7?104.<\/num>\n<heading> Federal Highway Administration<\/heading>\n<subsection class=\"indent0\" id=\"id23c4f9a7-f5a0-11e2-8dfe-b6d89e949a2c\" identifier=\"\/us\/usc\/t49\/s104\/a\">\n<num value=\"a\">(a)<\/num>\n<content> The Federal Highway Administration is an administration\nin the Department of Transportation.<\/content>\n<\/subsection><\/pre>\n<p>A portion of the human readable citation is contained in &#8220;num&#8221;. In order to retrieve a citation that a lawyer would recognize, we need to look at &#8220;num&#8221; for the parent element as well.<\/p>\n<pre lang=\"python\">from elementtree import ElementTree as ET\nimport os\n\ndir = \"G:\\\\us_code\\\\xml_uscAll@113-21\"\n\ndef getParent(parent_map, elt, idx):\n  try:\n    parent = elt\n    for i in range(idx):\n      parent = parent_map.get(parent)\n    \n    return \\\n      parent.findall('{http:\/\/xml.house.gov\/schemas\/uslm\/1.0}num')[0].text + \n      ' ' + \n      parent.findall('{http:\/\/xml.house.gov\/schemas\/uslm\/1.0}heading')[0].text\n  except:\n    return \"--No Heading--\"\n<\/pre>\n<p>Once we find the parent, we need to traverse all the way up the tree:<\/p>\n<pre lang=\"python\">def getTree(parent_map, t):\n  tree = []\n  parent = \"\"\n  idx = 0\n  while (parent != \"--No Heading--\"):\n    parent = getParent(parent_map, t, idx)\n    tree.append(parent)\n    idx += 1\n  return tree\n\nusc26.xml: Title 26\u2014 Subtitle A\u2014 CHAPTER 1\u2014 \n<\/pre>\n<p>This forms the basis for a function which builds a citation index &#8211; a list of every XML node that can be used in a citation, along with it&#8217;s human-readable citation and name. This takes some time, so if you reproduce this effort, you may want to save the results to a file.<\/p>\n<pre lang=\"python\">dir = \"G:\\\\us_code\\\\xml_uscAll@113-21\"\nurls = {}\n\ndef findElements(xpath, urls):\n  for root, dirs, files in os.walk(dir):\n    for f in files:\n      if f.endswith('.xml'):\n        tree = ET.parse(dir + \"\\\\\" + f)\n        parent_map = dict((c, p) for p in tree.getiterator() for c in p)\n        sections = tree.findall(xpath)\n        for t in sections:  \n          urls[t.attrib.get('identifier')] = \\\n            (t.attrib.get('id'), \n            getTree(parent_map, t),\n            f)\n\nrefs = {}\nrefTypes = ['title', 'subtitle', 'chapter', \\\n  'subchapter', 'part', 'subpart', 'division', \\\n  'subdivision', 'article', 'subarticle', 'section', \\\n  'subsection', 'paragraph', 'subparagraph', 'clause', \\\n  'subclause', 'item']\n\nfor ref in refTypes:\n  findElements('.\/\/{http:\/\/xml.house.gov\/schemas\/uslm\/1.0}' + ref, refs)\n\nrefs.items()[20]\n('\/us\/usc\/t2\/s2102\/b',\n ('id8a923648-f59b-11e2-8dfe-b6d89e949a2c',\n  ['(b)  Issuance and publication of regulations',\n   u'\\xa7\\u202f2102.  Duties of Commission',\n   u'Part B\\u2014 Senate Commission on Art',\n   u'SUBCHAPTER V\\u2014 HISTORICAL PRESERVATION AND FINE ARTS',\n   u'CHAPTER 30\\u2014 OPERATION AND MAINTENANCE OF CAPITOL COMPLEX',\n   u'Title 2\\u2014 THE CONGRESS',\n   '--No Heading--'],\n  'usc02.xml'))\n<\/pre>\n<p>Now that we know how to look up a citation we need to find the actual citations. Like HTML, the U.S. code documents use the &#8220;a href=&#8221; tag to reference a node, as well as &#8220;ref href=&#8221;. The same XPath technique used above allows us to find refs:<\/p>\n<pre lang=\"python\">hrefs = {}\ntitles = {}\nrefpath = '.\/\/{http:\/\/xml.house.gov\/schemas\/uslm\/1.0}ref'\nfor root, dirs, files in os.walk(dir):\n  for f in files:\n    if f.endswith('.xml'):\n      tree = ET.parse(dir + \"\\\\\" + f)\n      root = tree.getroot()\n      h = {t.attrib.get('href'): f + ' ' + t.text \\\n          for t in tree.findall(refpath)}\n      hrefs = dict(hrefs.items() + h.items())\n\n\nhrefs.items()[0]\nOut[55]: \n('\/us\/pl\/109\/280\/s601\/a\/3',\n u'usc29.xml Pub. L. 109\\u2013280, title VI, \\xa7\\u202f601(a)(3)')\n<\/pre>\n<p>We have everything we need to find which sections are commonly cited, we just need to combine them. Most of the complexity here is dealing with missing entries (e.g. due to the fact that a citation can point anywhere in the hierarchy).<\/p>\n<pre lang=\"python\">from collections import Counter\n\ndef countCitations(urls, hrefs):\n  titles = Counter()\n  subtitles = Counter()\n  chapters = Counter()\n  not_found = []\n  for key in hrefs.keys():\n    found = urls.get(key)\n \n    title = \"None\"\n    subtitle = \"None\"\n    chapter = \"None\"\n    file = \"None\"   \n\n    if (found != None):\n      (id, history, file) = found\n      if len(history) &gt;= 2:\n        title = history[-2]\n        if len(history) &gt;= 3:\n          subtitle = history[-3]\n          if len(history) &gt;= 4:\n            chapter = history[-4]     \n    else:\n      not_found.append(key)\n      \n    titles[file + \": \" + title] += 1    \n    subtitles[file + \": \" + title + \" - \" + subtitle] += 1    \n    chapters[file + \": \" + title + \" - \" + subtitle + \" - \" + chapter] += 1    \n  return (titles, subtitles, chapters, not_found)\n\n(t, s, c, none) = countCitations(refs, hrefs)\n<\/pre>\n<p>This returns results that are rolled up to titles, subtitles, and chapters. In particular note how as we drill down, the results provide clarity as to what was most important in the priort section. Within &#8220;The Public Health and Welfare,&#8221; we see that Social Security is important, and within &#8220;Armed Forces,&#8221; we see that &#8220;General Military Law &#8211; Personnel&#8221; is important.<\/p>\n<pre>None: None: 359662\nusc42.xml: Title 42\u2014 THE PUBLIC HEALTH AND WELFARE: 6679\nusc10.xml: Title 10\u2014 ARMED FORCES: 2078\nusc16.xml: Title 16\u2014 CONSERVATION: 2068\nusc42.xml: None: 1965\nusc15.xml: Title 15\u2014 COMMERCE AND TRADE: 1796\nusc07.xml: Title 7\u2014 AGRICULTURE: 1689\nusc22.xml: Title 22\u2014 FOREIGN RELATIONS AND INTERCOURSE: 1684\nusc20.xml: Title 20\u2014 EDUCATION: 1660\nusc26.xml: Title 26\u2014 INTERNAL REVENUE CODE: 1610\n<\/pre>\n<pre>None: None - None: 359662\nusc42.xml: None - None: 1965\nusc42.xml: Title 42\u2014 THE PUBLIC HEALTH AND WELFARE - CHAPTER 7\u2014 SOCIAL SECURITY: 1573\nusc10.xml: Title 10\u2014 ARMED FORCES - Subtitle A\u2014 General Military Law: 1490\nusc42.xml: Title 42\u2014 THE PUBLIC HEALTH AND WELFARE - CHAPTER 6A\u2014 PUBLIC HEALTH SERVICE: 1220\nusc26.xml: Title 26\u2014 INTERNAL REVENUE CODE - Subtitle A\u2014 Income Taxes: 841\nusc05.xml: None - None: 736\nusc10.xml: None - None: 639\nusc16.xml: Title 16\u2014 CONSERVATION - CHAPTER 1\u2014 NATIONAL PARKS, MILITARY PARKS, MONUMENTS, AND SEASHORES: 616\nusc20.xml: Title 20\u2014 EDUCATION - CHAPTER 70\u2014 STRENGTHENING AND IMPROVEMENT OF ELEMENTARY AND SECONDARY SCHOOLS: 531\n<\/pre>\n<pre>None: None - None - None: 359662\nusc42.xml: None - None - None: 1965\nusc26.xml: Title 26\u2014 INTERNAL REVENUE CODE - Subtitle A\u2014 Income Taxes - CHAPTER 1\u2014 NORMAL TAXES AND SURTAXES: 817\nusc10.xml: Title 10\u2014 ARMED FORCES - Subtitle A\u2014 General Military Law - PART II\u2014 PERSONNEL: 738\nusc05.xml: None - None - None: 736\nusc42.xml: Title 42\u2014 THE PUBLIC HEALTH AND WELFARE - CHAPTER 7\u2014 SOCIAL SECURITY - SUBCHAPTER XVIII\u2014 HEALTH INSURANCE FOR AGED AND DISABLED: 663\nusc10.xml: None - None - None: 639\nusc38.xml: None - None - None: 497\nusc10.xml: Title 10\u2014 ARMED FORCES - Subtitle A\u2014 General Military Law - PART IV\u2014 SERVICE, SUPPLY, AND PROCUREMENT: 496\nusc15.xml: None - None - None: 428\n<\/pre>\n<p>Future work in this area will involve cleaning up the results to remove some of the &#8220;None&#8221; entries, building a visualization of the results, and training a tagger to recognize the human-readable versions of citation in court documents. In the long run, I hope these developments help make legal information more accessible to everyone, rather than being locked up in expensive databases.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The U.S. Congress recently released a series of XML documents containing U.S. Laws. The structure of these documents allow us to find which sections of the law are most commonly cited. Examining which citations occur most frequently allows us to see what Congress has spent the most time thinking about. Citations occur for many reasons: &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/counting-citations-in-u-s-law\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Counting Citations in U.S. Law&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[4,5,6],"tags":[335,447,495],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1547"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=1547"}],"version-history":[{"count":1,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1547\/revisions"}],"predecessor-version":[{"id":6475,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1547\/revisions\/6475"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=1547"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=1547"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=1547"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}