{"id":1368,"date":"2013-07-11T12:44:25","date_gmt":"2013-07-11T12:44:25","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=1368"},"modified":"2013-07-11T12:44:25","modified_gmt":"2013-07-11T12:44:25","slug":"python-directory-list-index","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/python-directory-list-index\/","title":{"rendered":"Building an Directory Structure Index in Python"},"content":{"rendered":"<p>I&#8217;m working through examples in &#8220;<a href=\"http:\/\/www.amazon.com\/gp\/product\/0596516495\/ref=as_li_ss_tl?ie=UTF8&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0596516495&#038;linkCode=as2&#038;tag=thesecrelifeo-20\">Natural Language Processing with Python<\/a>&#8221; (<a href=\"http:\/\/garysieling.com\/blog\/book-review-natural-language-processing-with-python\">read my review<\/a>) and found that the corpus I have to work with is large enough to require special performance tuning exercises.<\/p>\n<p>If you have a large enough directory structure, it becomes difficult to walk with os.walk &#8211; for instance any failure in longer scripts require starting from scratch. This is a common issue in larger systems &#8211; typically they manage file listings through a relational database, and directory storage is obfuscated in some way.<\/p>\n<p><a href=\"http:\/\/garysieling.com\/blog\/visualizing-six-million-documents\">In this environment<\/a> it takes at an hour for Windows to count the files, and python seems to take longer.<\/p>\n<p>It&#8217;s worth generating a list of the files in advance &#8211; this lists which PDFs and HTML documents exist, and for which a <a href=\"http:\/\/garysieling.com\/blog\/scraping-pdf-text-with-python\">text extract has been generated<\/a> (<a href=\"http:\/\/garysieling.com\/blog\/parsing-pdfs-at-scale-with-node-js-pdf-js-and-lunr-js\">see a Node.JS approach here<\/a>). This supports a few uses, including generating missing text renditions.<\/p>\n<pre lang=\"python\">\n\nimport os\nimport re\nimport datetime\n\nprint datetime.datetime.now()\n\npdf_idx = open('pdfs.idx', 'w')\nrendition_idx = open('txts.idx', 'w')\nhtml_idx = open('htmls.idx', 'w')\nxml_idx = open('xmls.idx', 'w')\n\nfor root, dirs, files in os.walk('.'):\n  for f in files:\n    if f.endswith(\".pdf\"):\n      rend = root + os.sep + f + \".textrendition.txt\"\n      try:\n          with open(rend):\n              rendition_idx.write(rend + \"\\n\")\n      except IOError:\n          pass\n      pdf_idx.write(root + os.sep + f + \"\\n\")\n    if f.endswith(\".html\") or f.endswith('.htm'):\n      html_idx.write(root + os.sep + f + \"\\n\")\n    if f.endswith(\".xml\"):\n      xml_idx.write(root + os.sep + f + \"\\n\")      \n\nrendition_idx.close()\npdf_idx.close()\nhtml_idx.close()\nxml_idx.close()\n\nprint datetime.datetime.now()\n<\/pre>\n<p>What this allows is quite useful &#8211; you can read the file quickly, and select a random subset for training and test data for NLP algorithms. This could also be done by storing all the names in a database, but this is probably the simplest and fastest for my current needs.<\/p>\n<pre lang=\"python\">\nrendition_idx = open('txts1.idx', 'r')\nfiles = [f[:-1] for f in rendition_idx]\nrendition_idx.close()\nlen(files)\n124559\n\n>>> files[:4]\n['.\/00\/00\/gov.uscourts.rid.6064\/gov.uscourts.rid.6064.20.0.pdf.textrendition.txt', \n'.\/00\/01\/gov.uscourts.cacd.547806\/gov.uscourts.cacd.547806.6.0.pdf.textrendition.txt', \n'.\/00\/01\/gov.uscourts.oknd.31699\/gov.uscourts.oknd.31699.21.0.pdf.textrendition.txt', \n'.\/00\/01\/gov.uscourts.paed.406890\/gov.uscourts.paed.406890.19.0.pdf.textrendition.txt']\n\nimport random\nrandom.shuffle(files)\nfiles[:4]\n['.\/16\/63\/gov.uscourts.ded.48575\/gov.uscourts.ded.48575.1.0.pdf.textrendition.txt', \n'.\/09\/51\/gov.uscourts.casd.273674\/gov.uscourts.casd.273674.1.0.pdf.textrendition.txt', \n'.\/09\/aa\/gov.uscourts.hid.14739\/gov.uscourts.hid.14739.73.0.pdf.textrendition.txt', \n'.\/09\/57\/gov.uscourts.casd.268361\/gov.uscourts.casd.268361.4.0.pdf.textrendition.txt']\n\ncontents = [open(f[2:]).read() for f in files[:100]]\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;m working through examples in &#8220;Natural Language Processing with Python&#8221; (read my review) and found that the corpus I have to work with is large enough to require special performance tuning exercises. If you have a large enough directory structure, it becomes difficult to walk with os.walk &#8211; for instance any failure in longer scripts &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/python-directory-list-index\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Building an Directory Structure Index in Python&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[4,5,7],"tags":[385,421,447],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1368"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=1368"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1368\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=1368"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=1368"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=1368"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}