Extracting PDF text with Scala

This example extracts the text contents of a PDF for use in other systems. This demonstrates some basic differences from Java: multi-line strings (hooray!), imports, primitive arrays, and what implementing an interface looks like. The big downside to this is that the Eclipse Scala plugin doesn’t seem to have the ability to fill in interface methods on an object.

import java.io._
 
import org.apache.tika.parser.pdf._
import org.apache.tika.metadata._
import org.apache.tika.parser._
import org.xml.sax._
 
object pdfHandler extends ContentHandler {
	def characters(ch : Array[Char], start: Int, length: Int) {
		println(new String(ch))
	} 
 
	def endDocument() {
	} 
 
	def endElement(uri: String, localName: String, qName: String) {
	} 
 
	def endPrefixMapping(prefix: String) {
	} 
 
	def ignorableWhitespace(ch: Array[Char], start: Int, length: Int) {
	} 
 
	def processingInstruction(target: String, data: String) {
	} 
 
	def setDocumentLocator(locator: Locator) {
	} 
 
	def skippedEntity(name: String) {
	} 
 
	def startDocument() {
	} 
 
	def startElement(uri: String, localName: String, qName: String, atts: Attributes) {
	} 
 
	def startPrefixMapping(prefix: String, uri: String) {
	}      
}
 
object pdf extends App {
	val folder = """\\nas\Files\Data\pacer2\"""
	val subfolder = """\00\00\gov.uscourts.rid.6064\"""
	val file = """gov.uscourts.rid.6064.20.0.pdf"""
 
	val pdf : PDFParser = new PDFParser();
 
	val stream : InputStream = new FileInputStream(folder + subfolder + file)
	val handler : ContentHandler = pdfHandler
	val metadata : Metadata = new Metadata()
	val context : ParseContext = new ParseContext()
 
	pdf.parse(stream,
         handler,
         metadata,
         context)
 
    stream.close()
}

Output:

UNITED STATES DISTRICT COURT 
FOR THE DISTRICT OF RHODE ISLAND 
...
It is hereby agreed by and between the parties that the above-captioned matter be 
dismissed, with prejudice, no interest, no costs.

Interested in Scala? I send out weekly, personalized emails with articles and conference talks. Click here to see an example and subscribe.

3 replies
  1. Ronan LG
    Ronan LG says:

    I don’t understand where is the core code.
    All the functions fof pdfHandler are empty.
    Am I missing something ?

    Reply
    • Gary
      Gary says:

      The “characters” function is filled in – you need the others to satisfy the interface definition. The purpose of this example is to generate a text rendition of the PDF, and in satisfying that end there is a lot of other data you can ignore.

      Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *