Posts

Loading PDFs in PhantomJS using PDF.JS

PhantomJS is a neat webkit wrapper, allowing you to write cross-platform command-line Javascript utilities. Javascript scripting has been common in the Windows world for as long as I can remember through Windows Scripting Host, but PhantomJS provides access to many new libraries worth exploring. One such library is PDF.JS – a product of Mozilla Labs […]

Fixing the error “TypeError: ‘undefined’ is not a function (evaluating ‘globalScope[‘console’][‘log’].bind(globalScope[‘console’])’)”

Some libraries, like PDF.js, initialize their own logging function, which wraps console.log. If this runs in a context where function.bind does not exist, you’ll get the following error: TypeError: ‘undefined’ is not a function (evaluating ‘globalScope[‘console’][‘log’]. bind(globalScope[‘console’])’) Fixing this is actually quite simple- Mozilla provides a replacement function you can drop-in (not surprising, considering how […]

, , , , ,

Parsing PDFs at Scale with Node.js, PDF.js, and Lunr.js

Technologies used: Vagrant + Virtualbox, Node.js, node-static, Lunr.js, node-lazy, phantomjs Much information is trapped inside PDFs, and if you want to analyze it you’ll need a tool that extracts the text contents. If you’re processing many PDFs (XX millions), this takes time but parallelizes naturally. I’ve only seen this done on the JVM, and decided […]

Reading the Youtube API from PhantomJS

The following code will retrieve the duration of a video from the Youtube API. The output must be in JSON, because PhantomJS currently doesn’t handle the XML correctly (from email threads it appears this will be fixed in a future release) address = ‘http://gdata.youtube.com/feeds/api/videos/’ + id + ‘?v=2&alt=json’; page.open(address, function (status) { var duration = […]

, ,

Scraping Adsense Ads with PhantomJS

PhantomJS is a headless WebKit, which lets you run Javascript in a browser from the command line. It adds additional API calls which facilitate automated testing, screenshots, and scraping. I thought it would be interesting to write a script to retrieve Adsense destination URLs and text with PhantomJS. Extracting advertisement blocks requires fairly simple CSS […]