Parsing PDFs at Scale with Node.js, PDF.js, and Lunr.js

Technologies used:
Vagrant + Virtualbox, Node.js, node-static, Lunr.js, node-lazy, phantomjs

Much information is trapped inside PDFs, and if you want to analyze it you’ll need a tool that extracts the text contents. If you’re processing many PDFs (XX millions), this takes time but parallelizes naturally. I’ve only seen this done on the JVM, and decided to do a proof of concept with new Javascript tools. This runs Node.js as a backend and uses PDF.js, from Mozilla Labs, to parse PDFs. A full-text index is also built, the beginning of a larger ingestion process.

This task splits into three pieces. I run a separate server for each – I’m not sure whether the Node.js community has a preferred architecture, but this feels like a natural fit.

  • Queue server
  • Static file share for PDFs
  • Saving results

There is a fourth server; a Python server which serves static Javascript pages that coordinate the work, but outside of development this would be run as a console application with PhantomJS.

The Node.js servers all run as virtual machines on my developer workstation, configured using Vagrant and Virtualbox – these could easily be moved onto separate hosts. The virtual machines communicate with fixed IPs, and each is assigned a different port on the same physical host, so that any laptop on the network can join in the fun.

Once each of the simple servers is running, the coordination code can be written in a browser, which lets you work using the Chrome developer tools. This comes at a cost – you have to configure all the servers to send Access-Control-Allow-Origin, to allow cross-domain host. This is a nuisance that would not be present in other architectures. One alternative is to put everything behind a proxy like Squid, but I haven’t tried this yet.

The PDFs are an extract from PACER (court cases), stored on a NAS, about 1.2 MM PDFs, ½ TB of storage. They are randomly distributed among numbered folders, each two levels deep, and two digits of hex: e.g. \0f\2a. Without this innovation there would be one folder with many files in it, which is slow to traverse on NTFS and impossible on ext4, the default on my NAS (ext4 lets you have 64,000 files / directory). As an added benefit, this structure partitions the data evenly, as the partitions are generated at random.

pacer

I generated a list of just the PDFs in this data set, which a Javascript replacement for ls that I wrote, which took several hours. This list informs the queueing server which files need to be process. All it does is print out the next document to parse:

var http = require('http'),
    fs = require('fs'),
    lazy = require("lazy"),
 
var files = [];
var lines = 
     new lazy(fs.createReadStream('files.txt'))
       .lines
       .map(function (line) {
	files[files.length] = line.toString();
});
 
var lineNum = 0;
 
http.createServer(function (req, res) {
  res.writeHead(200, {
    'Content-Type': 'text/plain',
    'Access-Control-Allow-Origin' : '*'
  });
 
  res.write(files[lineNum]);
  lineNum++;
  res.end('\n');
}).listen(80);

This document can be then be retrieved from a second server, which uses a library called node-static to serve files:

var static = require('node-static');
 
var headers = {
    'Access-Control-Allow-Origin' : '*'
};
 
var file = 
  new(static.Server)
     ('/vagrant/', 
     {'headers': headers});
 
require('http').createServer(
  function (request, response) {
    request.addListener('end', 
      function () {
        file.serve(request, response, 
          function(req, res) {
            file.serve(request, response);
        });
    }).resume();
}).listen(80);

Finally, a third server saves the results in any format needs (errors, json, text output):

var http = require('http'),
      fs = require('fs'),
      qs = require('querystring');
 
http.createServer(function (req, res) {
  var body = '';
  req.on('data', function (data) {
    body += data;
  });
 
  req.on('end', function() {
    res.writeHead(200, {
      'Content-Type': 'text/plain',
      'Access-Control-Allow-Origin' : '*'
    });
 
    var loc = data.loc || '';
    var key = data.key || '';
    var ext = data.ext || '';
    var data = qs.parse(body);
 
    var filename = '/vagrant_data/' + loc + '/' + key + '.' + ext;
 
    fs.writeFile(filename, data.data || '', function (err) {
      if (err) throw err;
    });
 
    res.write(filename);
    res.end('\n');
  });
}).listen(80);

Note that what port a server listens on is unrelated to the port you connect on – when Vagrant builds the virtual machine, it lets you set port forwarding.

Once the above servers are configured, any machine that joins your network can participate – conceivably enabling old laptops to become an inefficient server farm. The queueing server could distribute packets of work in JSON instead of urls, which would support heterogeneous work payloads (text extraction + map + reduce work).

The processing code does all the interesting work – opening a PDF, extracting text, building full text data with lunr.js, and saving processed results to the server. I like to use a browser to test this so that it’s easy to debug, but the intent is to run within phantomjs. The code below communicates with pdf.js to extract text:

var queueUrl = 'http://192.168.11.37:8002/';
 
xmlhttp = new XMLHttpRequest();
xmlhttp.open("GET", queueUrl, false);
xmlhttp.send();
 
var pdfName = xmlhttp.responseText;
var pdfUrl = 'http://192.168.11.37:8001/' + pdfName; 
var pdf = PDFJS.getDocument(pdfUrl);
 
var data = '';
 
pdf.errbacks[0] = 
function() {
  document.location.reload(true);
};
 
pdf.then(function(pdf) {
 var maxPages = pdf.pdfInfo.numPages;
 for (var j = 1; j <= maxPages; j++) {
    var page = pdf.getPage(j);
 
    // the callback function - we create one per page
    var processPageText = function processPageText(pageIndex) {
      return function(pageData, content) {
        return function(text) {
          // bidiTexts has a property identifying whether this
          // text is left-to-right or right-to-left
 
          // Defect here - sometimes the output has words
          // concatenated where they shouldn't be. But, if
          // you just add spaces you'll get spaces within 
          // words.
          for (var i = 0; i < text.bidiTexts.length; i++) {
            data += text.bidiTexts[i].str;
          }
 
	  data += ' ';
 
          if (pageData.pageInfo.pageIndex ===
              maxPages - 1) {
             ... output processing goes here ...
          }
        }
      }
    }(j);
 
    var processPage = function processPage(pageData) {
      var content = pageData.getTextContent();
 
      content.then(processPageText(pageData, content));
    }
 
    page.then(processPage);
 }
});

And, finally, the code which saves results back to the server:

var loc = pdfName.substr(0, pdfName.lastIndexOf('/')); 
loc = 'data/'; // can configure where final data is stored
 
var key = pdfName.substr(0, pdfName.lastIndexOf('.'))
                 .substr(pdfName.lastIndexOf('/') + 1) + 
                 '.text-rendition';
 
$.post(
  'http://192.168.11.37:8005',
  { loc : loc, key: key, ext: 'txt', data: data }
).done(
  function() { 
   document.location.reload(true);
 }
);

This can then be put into a full text index, and stored, to be combine at at a later date:

var index = lunr(function () {
    this.field('text');
    this.ref('id');
});
 
index.add({
  text: data,
  id: pdfName
});
 
var serializedIndex = 
  JSON.stringify(index.toJSON());
 
$.post(
  'http://192.168.11.37:8005',
  { loc : loc, key: pdfName + '-idx', ext: 'json', data: serializedIndex }
).done(
  function() { 
   document.location.reload(true);
 }
);

In my initial test this was really slow – only 120 PDFs per minute. For a 1.2mm PDF data set – this will take about ten days. For 20 years of U.S. litigation this would take four months, without any additional processing. Surprisingly this is a CPU-bound operation; unsurprisingly that generates a ton of heat. Cursory inspection suggests the biggest performance gains would come from improving the NAS connection and caching of pdf.js libraries.

From a developer perspective, it is challenging to identify which code is synchronous and which isn’t, creating many subtle bugs. Each Javascript library models this a bit differently, and Javascript’s lack of typing information makes the problem worse.

Error handling is also difficult: not only do these need to be logged, but you need to be able to re-run the ingestion process on a subset of the entire document set. Currently this isn’t handled, but should be cautionary to anyone reproducing this.

Operationally it’s easy to add machines: almost all the work is done on the front-end. You can typically open several dozen processes on a machine by simply opening console windows or browser tabs. A system like this ought to be more introspective and watch the load on the host and respond accordingly.

If you enjoyed this, you may also be interested in these articles:

Tags: , , , , ,

6 comments ↓

#1 Geoffrey Booth on 07.16.13 at 3:19 am

Hi Gary,

This is very interesting and useful. In your efforts at understanding PDF.js, did you happen to figure out how to parse not just a PDF’s raw text but also the X and Y coordinates of each block of text?

To give you an example, imagine if instead of court records you were parsing a mountain of PDF invoices. Say the invoices were all generated from the same form, and the invoicing company was always in a left column and the invoiced customer was always in a right column. Parsing the X coordinates, to see which blocks of text were close to the left edge of the page and which started near the middle of the page, would enable the parser to pluck out the two addresses correctly; while a stream of text from the entire page would mix them together.

Another property of text boxes that would be useful is basic font formatting, such as font size or bold/italics. That would enable picking out headings, which could be used to create bookmarks or tables of contents of documents, among other uses.

I’m sure this functionality must be within PDF.js somewhere, I just can’t seem to tease it out. Have you had any more luck than I have?

#2 Gary on 07.16.13 at 11:29 am

Yeah, so the reason I didn’t do that it that it’s more difficult. From what I could tell you might be able to get access to a stream of PDF commands and have a callback for each (that’s how they do the rendering).

Having now parsed these PDFs I’m noticing that this technique works fine for most of them, but in some cases there are issues with missing or extra spaces. A possible alternative to what you are describing is to include some/all of the raw PDF commands in the text rendition I generated, then use them as tokens, which might provide data to improve the accuracy of entity recognition similar to what you described.

#3 Rod Boorom on 09.03.13 at 4:35 pm

You really know your stuff. Great post.

#4 Lujaw on 01.16.14 at 6:05 am

thanks for the awesome article..

#5 Mulder on 01.18.14 at 1:28 pm

Hi. Did you end up solving the following problem?
// Defect here – sometimes the output has words
// concatenated where they shouldn’t be. But, if
// you just add spaces you’ll get spaces within
// words.

I have a lot of legal documents that are not so well structured internally. Text objects begin and end at arbitrary places, often within words. This makes properly recovering whitespace crucial.

I noticed pdf.js now has a text search feature and it works great so it must be dealing with that somehow. So, I’m just interested if your project did evolve to take advantage of that.

#6 Gary on 02.01.14 at 3:07 pm

My project has evolved to use pdf2json, which uses pdfjs internally, so to the extent those libraries fix things underneath it gets all the improvements, but I haven’t gone back and re-tested it.

Leave a Comment

Current ye@r *