Loading PDFs in PhantomJS using PDF.JS

PhantomJS is a neat webkit wrapper, allowing you to write cross-platform command-line Javascript utilities. Javascript scripting has been common in the Windows world for as long as I can remember through Windows Scripting Host, but PhantomJS provides access to many new libraries worth exploring. One such library is PDF.JS – a product of Mozilla Labs which aims for pixel-perfect PDF rendering. It appears the team intends this as a PDF viewer only, but PDFs are so common, and the licensing so permissive (BSD), that it’s well-worth exploring.

I found in Chrome, if you open the PDF.js in a browser with a file:// URL, you are not allowed to load PDF files with XMLHttpRequest, as it’s considered “cross-domain scripting,” which makes it difficult to run PDF.js as a command line utility.

PDF.JS is typically set up to run inside a webpage, drawing it’s output on a canvas element. This allows some interesting options. In a previous post, I discussed intercepting drawing commands to read the contents of tables in PDFs. PhantomJS scripts are typically a standalone Javascript file which reads and writes to the filesystem, loads webpages, takes screenshots, etc, orchestrating what happens inside the loaded pages. The communication between the Javascript file and the page is primarily limited to passing strings through callback due to security concerns.

For starters, we can write a fairly simple PhantomJS script that will load a PDF into memory:

var system = require('system');

var pdf = system.args[1];

var content = '',
    f = null;

try {
  f = fs.open(system.args[1], "r");
  content = f.read();
} catch (e) {
  console.log(e);
  if (f !== null) {
    f.close();
  }
  phantom.exit();
}

if (f) {
  f.close();
}

From here we can to base-64 encode this, so that it can be passed around relatively safely- We don’t want any methods to throw exceptions when they see non-printable characters. There is a function called btoa1, which turns binary data into a base-64 string. However, it does not like Unicode, but fortunately the Mozilla Documentation shows a workaround, seen below. The documentation also provides alternate implementations for this technique2 which may be faster.

fs = require('fs'),

function utf8_to_b64( str ) {
  return window.btoa(unescape(encodeURIComponent(str)));
}
    
var data = utf8_to_b64(content);

Now that we have this, we have basically everything needed to send the string to PDF.js, and can inject the string into the page. I’m not sure of the ideal location for this (e.g. inside a div or in the global scope of the page), but the technique below works for now. In PhantomJS, “page.evaluate” takes a function which runs in the scope of the page, and in code it looks deceptively like a closure but isn’t. Thankfully they recently added an option to pass a string from the PhantomJS scope to the page scope, which is how we get our PDF in.

Since we will need to retrieve result later, we also need the PhantomJS page.onCallback method (which is “experimental”), but at least now we don’t have to intercept console.log any more. This allows us to test that we can successfully send data in and out of the page.

var page = require('webpage').create();
var url = 'lib/index.html';
  
page.onCallback = function(data) {
  console.log(data);
  phantom.exit();
}

page.open(url, function (status) {
  page.evaluate(function(data) {
    document.getElementById("pdf").innerText = data;
    window.callPhantom("Sample: " + data.substring(0, 100));
  }, data);
  console.log("Finished");
});

Sample output looks like this:

Sample: JVBERi0xLjQKJcOkw7zDtsOfCjIgMCBvYmoKPDwv

Inside the PDF rendering, we can reverse the process using atob3, which turns the base-64 string back into binary data, with one caveat. As PhantomJS normally takes a URL to a PDF, it does type-checking on the data you send it and can identify ArrayBuffer objects as data. These appear to be just another way to wrap a base64 string, but actually identifying that it is binary data instead of masquerading as a string.

var dataElement = document.getElementById("pdf");
var data = dataElement.innerText;
var binary = atob(data);

I found a handy function to convert this binary data into an ArrayBuffer on a StackOverflow post4, with further explanation on the Mozilla Documentation.5

function str2ab(str) {
  var buf = new ArrayBuffer(str.length * 2); // 2 bytes for each char
  var bufView = new Uint16Array(buf);
  for (var i=0, strLen = str.length; i

Now that we've completed this, we have a fairly solid technique for transporting small PDFs, until one of the teams involved finds a way to make this easier. It is reminiscent of QBasic programming, where you had to embed a block of encoded assembler into a program, to add simple functionality like rendering a mouse pointer.

Beyond this, there are further issues requiring a compatibility shim for PDF.JS, which will be covered later, once I figure out how to fix them. If you're interested in the outcome, please consider following my github libary which pulls tables from PDFs into CSV files.

  1. https://developer.mozilla.org/en-US/docs/Web/API/window.btoa []
  2. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Base64_encoding_and_decoding#Solution_.232_.E2.80.93_rewriting_atob()_and_btoa()_using_TypedArrays_and_UTF-8 []
  3. https://developer.mozilla.org/en-US/docs/Web/API/window.atob []
  4. http://stackoverflow.com/questions/6965107/converting-between-strings-and-arraybuffers []
  5. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Base64_encoding_and_decoding#Appendix.3A_Decode_a_Base64_string_to_Uint8Array_or_ArrayBuffer []