Parsing PDFs at Scale with Node.js, PDF.js, and Lunr.js

Technologies used:
Vagrant + Virtualbox, Node.js, node-static, Lunr.js, node-lazy, phantomjs

Much information is trapped inside PDFs, and if you want to analyze it you’ll need a tool that extracts the text contents. If you’re processing many PDFs (XX millions), this takes time but parallelizes naturally. I’ve only seen this done on the JVM, and decided to do a proof of concept with new Javascript tools. This runs Node.js as a backend and uses PDF.js, from Mozilla Labs, to parse PDFs. A full-text index is also built, the beginning of a larger ingestion process.

This task splits into three pieces. I run a separate server for each – I’m not sure whether the Node.js community has a preferred architecture, but this feels like a natural fit.

  • Queue server
  • Static file share for PDFs
  • Saving results

There is a fourth server; a Python server which serves static Javascript pages that coordinate the work, but outside of development this would be run as a console application with PhantomJS.

The Node.js servers all run as virtual machines on my developer workstation, configured using Vagrant and Virtualbox – these could easily be moved onto separate hosts. The virtual machines communicate with fixed IPs, and each is assigned a different port on the same physical host, so that any laptop on the network can join in the fun.

Once each of the simple servers is running, the coordination code can be written in a browser, which lets you work using the Chrome developer tools. This comes at a cost – you have to configure all the servers to send Access-Control-Allow-Origin, to allow cross-domain host. This is a nuisance that would not be present in other architectures. One alternative is to put everything behind a proxy like Squid, but I haven’t tried this yet.

The PDFs are an extract from PACER (court cases), stored on a NAS, about 1.2 MM PDFs, ½ TB of storage. They are randomly distributed among numbered folders, each two levels deep, and two digits of hex: e.g. \0f\2a. Without this innovation there would be one folder with many files in it, which is slow to traverse on NTFS and impossible on ext4, the default on my NAS (ext4 lets you have 64,000 files / directory). As an added benefit, this structure partitions the data evenly, as the partitions are generated at random.

pacer

I generated a list of just the PDFs in this data set, which a Javascript replacement for ls that I wrote, which took several hours. This list informs the queueing server which files need to be process. All it does is print out the next document to parse:

var http = require('http'),
    fs = require('fs'),
    lazy = require("lazy"),

var files = [];
var lines = 
     new lazy(fs.createReadStream('files.txt'))
       .lines
       .map(function (line) {
	files[files.length] = line.toString();
});

var lineNum = 0;

http.createServer(function (req, res) {
  res.writeHead(200, {
    'Content-Type': 'text/plain',
    'Access-Control-Allow-Origin' : '*'
  });

  res.write(files[lineNum]);
  lineNum++;
  res.end('\n');
}).listen(80);

This document can be then be retrieved from a second server, which uses a library called node-static to serve files:

var static = require('node-static');

var headers = {
    'Access-Control-Allow-Origin' : '*'
};

var file = 
  new(static.Server)
     ('/vagrant/', 
     {'headers': headers});

require('http').createServer(
  function (request, response) {
    request.addListener('end', 
      function () {
        file.serve(request, response, 
          function(req, res) {
            file.serve(request, response);
        });
    }).resume();
}).listen(80);

Finally, a third server saves the results in any format needs (errors, json, text output):

var http = require('http'),
      fs = require('fs'),
      qs = require('querystring');

http.createServer(function (req, res) {
  var body = '';
  req.on('data', function (data) {
    body += data;
  });

  req.on('end', function() {
    res.writeHead(200, {
      'Content-Type': 'text/plain',
      'Access-Control-Allow-Origin' : '*'
    });

    var loc = data.loc || '';
    var key = data.key || '';
    var ext = data.ext || '';
    var data = qs.parse(body);

    var filename = '/vagrant_data/' + loc + '/' + key + '.' + ext;

    fs.writeFile(filename, data.data || '', function (err) {
      if (err) throw err;
    });

    res.write(filename);
    res.end('\n');
  });
}).listen(80);

Note that what port a server listens on is unrelated to the port you connect on – when Vagrant builds the virtual machine, it lets you set port forwarding.

Once the above servers are configured, any machine that joins your network can participate – conceivably enabling old laptops to become an inefficient server farm. The queueing server could distribute packets of work in JSON instead of urls, which would support heterogeneous work payloads (text extraction + map + reduce work).

The processing code does all the interesting work – opening a PDF, extracting text, building full text data with lunr.js, and saving processed results to the server. I like to use a browser to test this so that it’s easy to debug, but the intent is to run within phantomjs. The code below communicates with pdf.js to extract text:

var queueUrl = 'http://192.168.11.37:8002/';

xmlhttp = new XMLHttpRequest();
xmlhttp.open("GET", queueUrl, false);
xmlhttp.send();

var pdfName = xmlhttp.responseText;
var pdfUrl = 'http://192.168.11.37:8001/' + pdfName; 
var pdf = PDFJS.getDocument(pdfUrl);

var data = '';

pdf.errbacks[0] = 
function() {
  document.location.reload(true);
};

pdf.then(function(pdf) {
 var maxPages = pdf.pdfInfo.numPages;
 for (var j = 1; j <= maxPages; j++) {
    var page = pdf.getPage(j);

    // the callback function - we create one per page
    var processPageText = function processPageText(pageIndex) {
      return function(pageData, content) {
        return function(text) {
          // bidiTexts has a property identifying whether this
          // text is left-to-right or right-to-left

          // Defect here - sometimes the output has words
          // concatenated where they shouldn't be. But, if
          // you just add spaces you'll get spaces within 
          // words.
          for (var i = 0; i < text.bidiTexts.length; i++) {
            data += text.bidiTexts[i].str;
          }

	  data += ' ';

          if (pageData.pageInfo.pageIndex ===
              maxPages - 1) {
             ... output processing goes here ...
          }
        }
      }
    }(j);

    var processPage = function processPage(pageData) {
      var content = pageData.getTextContent();

      content.then(processPageText(pageData, content));
    }

    page.then(processPage);
 }
});

And, finally, the code which saves results back to the server:

var loc = pdfName.substr(0, pdfName.lastIndexOf('/')); 
loc = 'data/'; // can configure where final data is stored

var key = pdfName.substr(0, pdfName.lastIndexOf('.'))
                 .substr(pdfName.lastIndexOf('/') + 1) + 
                 '.text-rendition';

$.post(
  'http://192.168.11.37:8005',
  { loc : loc, key: key, ext: 'txt', data: data }
).done(
  function() { 
   document.location.reload(true);
 }
);

This can then be put into a full text index, and stored, to be combine at at a later date:

var index = lunr(function () {
    this.field('text');
    this.ref('id');
});

index.add({
  text: data,
  id: pdfName
});

var serializedIndex = 
  JSON.stringify(index.toJSON());

$.post(
  'http://192.168.11.37:8005',
  { loc : loc, key: pdfName + '-idx', ext: 'json', data: serializedIndex }
).done(
  function() { 
   document.location.reload(true);
 }
);

In my initial test this was really slow – only 120 PDFs per minute. For a 1.2mm PDF data set – this will take about ten days. For 20 years of U.S. litigation this would take four months, without any additional processing. Surprisingly this is a CPU-bound operation; unsurprisingly that generates a ton of heat. Cursory inspection suggests the biggest performance gains would come from improving the NAS connection and caching of pdf.js libraries.

From a developer perspective, it is challenging to identify which code is synchronous and which isn’t, creating many subtle bugs. Each Javascript library models this a bit differently, and Javascript’s lack of typing information makes the problem worse.

Error handling is also difficult: not only do these need to be logged, but you need to be able to re-run the ingestion process on a subset of the entire document set. Currently this isn’t handled, but should be cautionary to anyone reproducing this.

Operationally it’s easy to add machines: almost all the work is done on the front-end. You can typically open several dozen processes on a machine by simply opening console windows or browser tabs. A system like this ought to be more introspective and watch the load on the host and respond accordingly.

If you enjoyed this, you may also be interested in these articles: