, , , ,

Full-Text Indexing PDFs in Javascript

I once worked for a company that sold access to legal and financial databases (as they call it, “intelligent information“). Most court records are PDFS available through PACER, a website developed specifically to distribute court records. Meaningful database products on this dataset require building a processing pipeline that can extract and index text from the 200+ million PDFs, representing 20+ years of U.S. litigation. These processes can take many months of machine time, which puts a lot of pressure on the software teams that build them. An early step in this process is extracting contents from e-filed PDFs, which is later run through stages of an NLP prcocess – tokenizing, tagging parts of speech, recognizing entities, and then reporting (if you’re interested in what this entails, check out Natural Language Processing in Python for a primer – read my review here)

Mozilla Labs received a lot of attention lately for a project impressive in it’s ambitions: rendering PDFs in a browser using only Javascript. The PDF spec is incredibly complex, so best of luck to the pdf.js team! On a different vein, Oliver Nightingale is implementing a Javascript full-text indexer in the Javascript – combining these two projects allows reproducing the PDF processing pipeline entirely in web browsers.

As a refresher, full text indexing lets a user search unstructured text, ranking resulting documents by a relevance score determined by word frequencies. The indexer counts how often each word occurs per document and makes minor modifications the text, removing grammatical features which are irrelevant to search. E.g. it might subtract “-ing” and change vowels to phonetic common denominators. If a word shows up frequently across the document set it is automatically considered less important, and it’s effect on resulting ranking is minimized. This differs from the basic concept behind Google PageRank, which boosts the rank of documents based on a citation graph.

Most database software provides full-text indexing support, but large scale installations are typically handled in more powerful tools. The predominant open-source product is Solr/Lucene, Solr being a web-app wrapper around the Lucene library. Both are written in Java.

Building a Javascript full-text indexer enables search in places that were previously difficult such as Phonegap apps, end-user machines, or on user data that will be stored encrypted. There is a whole field of research to encrypted search indices, but indexing and encrypting data on a client machine seems like a good way around this naturally challenging problem.

To test building this processing pipeline, we first look at how to extract text from PDFs, which will later be inserted into a full text index. The code for pdf.js is instructive, in that the Mozilla developers use browser features that aren’t in common use. Web Workers, for instance, let you set up background processing threads.

The pdf.js APIS make heavy use of Promises, which hold references in code to operations that haven’t completed yet. You operate on them using callbacks:

var pdf = PDFJS.getDocument('http://www.pacer.gov/documents/pacermanual.pdf');
 
var pdf = PDFJS.getDocument('pacermanual.pdf');
pdf.then(function(pdf) {
 // this code is called once the PDF is ready
});

This API seems immature yet- ideally you should be able to do promise.then(f(x)).then(g(x)).then(h(x)) etc, but that isn’t yet available.

For rendering PDFs the Promise pattern makes a lot of sense, as it leaves room for parallelizing the rendering process. For merely extracting the text from a PDF it feels like a lot of work – you need to be confident that your callbacks run in order and track which one is last.

The following demonstrates how to extract all the PDF text, which is then printed to the browser console log:

‘use strict’;
var pdf = PDFJS.getDocument('http://www.pacer.gov/documents/pacermanual.pdf');
 
var pdf = PDFJS.getDocument('pacermanual.pdf');
pdf.then(function(pdf) {
 var maxPages = pdf.pdfInfo.numPages;
 for (var j = 1; j <= maxPages; j++) {
    var page = pdf.getPage(j);
 
    // the callback function - we create one per page
    var processPageText = function processPageText(pageIndex) {
      return function(pageData, content) {
        return function(text) {
          // bidiTexts has a property identifying whether this
          // text is left-to-right or right-to-left
          for (var i = 0; i < text.bidiTexts.length; i++) {
            str += text.bidiTexts[i].str;
          }
 
          if (pageData.pageInfo.pageIndex === 
              maxPages - 1) {
            // later this will insert into an index
            console.log(str);
          }
        }
      }
    }(j);
 
    var processPage = function processPage(pageData) {
      var content = pageData.getTextContent();
 
      content.then(processPageText(pageData, content));
    }
 
    page.then(processPage);
 }
});

It’s not trivial to identify where headings and images are. This would require hooking into the rendering code, and possibly a deep understanding of PDF commands (PDFs appear to be represented as stream of rendering commands, similar to RTF).

Lunr
Creating a Lunr index and adding text is straightforward- all the APIs operate on JSON bean objects, which is a pleasantly simple API:

doc1 = {
    id: 1,
    title: 'Foo',
    body: 'Foo foo foo!'
  };
 
doc2 = {
    id: 2,
    title: 'Bar',
    body: 'Bar bar bar!'
  } 
 
doc3 = {
    id: 3,
    title: 'gary',
    body: 'Foo Bar bar bar!'
  }
 
index = lunr(function () {
    this.field('title', {boost: 10})
    this.field('body')
    this.ref('id')
  })
 
// Add documents to the index
index.add(doc1)
index.add(doc2)
index.add(doc3)

Searching is simple – one neat tidbit I found is that you can inspect the index easily, since it’s just a JS object:

// Run a search
index.search(“foo”)
 
// Inspect the actual index to see which docs match a term
index2.tokenStore.root.f.o.o.docs

When I was first introduced to full-text indexing, I was confused by what is meant by a “document” – this generalizes beyond a PDF or Office document to any database row, possibly including large blobs of text.

Full-text search would be pretty dumb if you had to build the index every time, and Lunr makes it really easy to serialize and deserialize the index itself:

var serializedIndex = JSON.stringify(index1.toJSON())
var deserializedIndex = JSON.parse(serializedIndex)
var index2 = lunr.Index.load(deserializedIndex)

Index.toJSON also returns a “bean” style object (not a string). I’ve never seen an API like this, and I really like the idea – it gives you a clean Javascript object with only the data that requires serialization.

The following are attributes of the index:

  • corpusTokens – Sorted list of tokens
  • documentStore – list of each document – catenate
  • fields – The fields used to describe each document (similar to database columns)
  • pipeline – The pipeline object used to process tokens
  • tokenStore – Where and how often words are referenced in each document

One great thing about this type of index is that the work can be done in parallel and then combined as a map-reduce job. Only three entries from the above object need to be combined, as “fields” and “pipeline” are static. The following demonstrates the implementation of the reduction step (note jQuery is referenced):

(function reduce(a, b) { 
  var j1 = a.toJSON(); 
  var j2 = b.toJSON();
 
  // The "unique" function does uniqueness by sorting,
  // which we need here.
  var corpusTokens = 
      $.unique(
          $.merge(
              $.merge([], j1.corpusTokens), 
                           j2.corpusTokens));
 
  // It's important to create new arrays and
  // objects throughout, or else you modify 
  // the source indexes, which is disastrous.
  var documentStore = 
     {store: $.extend({}, 
                      j1.documentStore.store,
                      j2.documentStore.store),
      length: j1.documentStore.length + j2.documentStore.length};
 
  var jt1 = j1.tokenStore;
  var jt2 = j2.tokenStore;
 
  // The 'true' here triggers a deep copy
  var tokenStore = {
    root: $.extend(true, {}, jt1.root, jt2.root),
    length: jt1.length + jt2.length
  };
 
  return {version: j1.version,
          fields: $.merge([], j1.fields), 
          ref: j1.ref, 
          documentStore: documentStore, 
          tokenStore: tokenStore,
          corpusTokens: corpusTokens, 
          pipeline: $.merge([], j1.pipeline)}; 
})(index1, index2)

I tested this by creating three indexes: index1, index2, and index3. index1 is {doc1}, index2 is {doc2, doc3}, and index3 is {doc1, doc2, doc3}. To test the code, you need simply diff:

JSON.stringify(index3.toJSON())
 
JSON.stringify(combine(index1, index2))

Possibilities

Overall this technique has a lot of wasted network I/O, making this seem silly. On the other hand, there are listings on ebay and fiverr selling for “traffic”, which typically comes from pop-unders, botnets, hidden iframes, etc. You can easily find listings like “20,000 hits for $3”, and less in bulk. This is typically cheap because it has little commercial value other than perpetrating various forms of fraud.

You’d need a cheap VM with loads of bandwidth to use as a proxy, as well as publically available data – you couldn’t use this as a scraping technique due to browser protections against cross-domain requests. You’d also need to generate unique document IDs in a unique fashion, perhaps using the original URL.

If a traffic source runs on modern browsers, one could use this as a source of potentially cheap and unlimited processing power, even for point of combining the indexes, although provisions must be made for the natural instability of the system.

If you enjoyed this, you might also enjoy the following:

6 replies
  1. Frode Danielsen
    Frode Danielsen says:

    “PDFs appear to be represented as stream of rendering commands, similar to RTF”

    Well, PDF is basically just a wrapper around PostScript which is a terse set of rendering commands. PDF itself is fairly straightforward, but handling PostScript is much more complex. At it’s very basic a PDF is just marked up objects, which contain dictionaries or byte streams. Simplifying a bit here, but if you think PDF is complex I think it’s PostScript that you’re really thinking of.

    Reply
  2. Brendan Dahl
    Brendan Dahl says:

    @Gary
    We’ll have full promise/a+ spec compliant promises soon.

    @Frode
    PDF and PostScript share some similarities but I wouldn’t call PDF a wrapper around PostScript (if anything it would *kind of* be a subset). PostScript is a full blown programming language.

    At it’s core PDF is simple, but when you get into it, it is a very complex format. The spec itself is 756 pages and links to 20+ other specs.

    Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *