{"id":999,"date":"2013-05-06T23:18:51","date_gmt":"2013-05-06T23:18:51","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=999"},"modified":"2013-05-06T23:18:51","modified_gmt":"2013-05-06T23:18:51","slug":"extracting-pdf-text-with-scala","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/extracting-pdf-text-with-scala\/","title":{"rendered":"Extracting PDF text with Scala"},"content":{"rendered":"<p>This example extracts the text contents of a PDF for use in other systems. This demonstrates some basic differences from Java: multi-line strings (hooray!), imports, primitive arrays, and what implementing an interface looks like. The big downside to this is that the Eclipse Scala plugin doesn&#8217;t seem to have the ability to fill in interface methods on an object.<\/p>\n<pre lang=\"Scala\">\nimport java.io._\n\nimport org.apache.tika.parser.pdf._\nimport org.apache.tika.metadata._\nimport org.apache.tika.parser._\nimport org.xml.sax._\n\nobject pdfHandler extends ContentHandler {\n\tdef characters(ch : Array[Char], start: Int, length: Int) {\n\t\tprintln(new String(ch))\n\t} \n    \n\tdef endDocument() {\n\t} \n    \n\tdef endElement(uri: String, localName: String, qName: String) {\n\t} \n    \n\tdef endPrefixMapping(prefix: String) {\n\t} \n      \n\tdef ignorableWhitespace(ch: Array[Char], start: Int, length: Int) {\n\t} \n      \n\tdef processingInstruction(target: String, data: String) {\n\t} \n      \n\tdef setDocumentLocator(locator: Locator) {\n\t} \n      \n\tdef skippedEntity(name: String) {\n\t} \n      \n\tdef startDocument() {\n\t} \n      \n\tdef startElement(uri: String, localName: String, qName: String, atts: Attributes) {\n\t} \n      \n\tdef startPrefixMapping(prefix: String, uri: String) {\n\t}      \n}\n\nobject pdf extends App {\n\tval folder = \"\"\"\\\\nas\\Files\\Data\\pacer2\\\"\"\"\n\tval subfolder = \"\"\"\\00\\00\\gov.uscourts.rid.6064\\\"\"\"\n\tval file = \"\"\"gov.uscourts.rid.6064.20.0.pdf\"\"\"\n\t  \n\tval pdf : PDFParser = new PDFParser();\n\t\n\tval stream : InputStream = new FileInputStream(folder + subfolder + file)\n\tval handler : ContentHandler = pdfHandler\n\tval metadata : Metadata = new Metadata()\n\tval context : ParseContext = new ParseContext()\n\t\n\tpdf.parse(stream,\n         handler,\n         metadata,\n         context)\n         \n    stream.close()\n}\n<\/pre>\n<p>Output:<\/p>\n<pre>\nUNITED STATES DISTRICT COURT \nFOR THE DISTRICT OF RHODE ISLAND \n...\nIt is hereby agreed by and between the parties that the above-captioned matter be \ndismissed, with prejudice, no interest, no costs.\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>This example extracts the text contents of a PDF for use in other systems. This demonstrates some basic differences from Java: multi-line strings (hooray!), imports, primitive arrays, and what implementing an interface looks like. The big downside to this is that the Eclipse Scala plugin doesn&#8217;t seem to have the ability to fill in interface &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/extracting-pdf-text-with-scala\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Extracting PDF text with Scala&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[4],"tags":[300,417,419,480,495,546],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/999"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=999"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/999\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=999"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=999"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=999"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}