Entries by Gary

, , , , ,

Parsing PDFs at Scale with Node.js, PDF.js, and Lunr.js

Technologies used: Vagrant + Virtualbox, Node.js, node-static, Lunr.js, node-lazy, phantomjs Much information is trapped inside PDFs, and if you want to analyze it you’ll need a tool that extracts the text contents. If you’re processing many PDFs (XX millions), this takes time but parallelizes naturally. I’ve only seen this done on the JVM, and decided […]


My First A/B Test… with Results

A/B testing gets a lot of attention on Hacker News, inbound.org, and other forums, and appeals to me as a data analysis exercise. As a software engineer with a practical bent, I like the concept of data analysis techniques which produce useful results while treating a system as a black box. This stands in contrast […]

, , , ,

Full-Text Indexing PDFs in Javascript

I once worked for a company that sold access to legal and financial databases (as they call it, “intelligent information“). Most court records are PDFS available through PACER, a website developed specifically to distribute court records. Meaningful database products on this dataset require building a processing pipeline that can extract and index text from the […]


Lessons Learned from 0 to 40,000 Readers

Starting Out I started writing a little over a year ago, after finding “Technical Blogging” by Antonio Cangiano through Hacker News. Since then, a bit over 40,000 people have read articles I’ve written, not a huge number in the grand scheme of things, but enough to draw a few lessons. The more I write, the […]

ExtJs JSON Reader Example

I received the following email from a reader: Thank you very much for finding time to read my mail. I came across your blog http://garysieling.com/blog/extjs-pie-chart-example It would be greatly helpful, if you could provide me with the code of binding the data dynamically to the DS. I have already generated the data in JSON format via […]

Extracting PDF text with Scala

This example extracts the text contents of a PDF for use in other systems. This demonstrates some basic differences from Java: multi-line strings (hooray!), imports, primitive arrays, and what implementing an interface looks like. The big downside to this is that the Eclipse Scala plugin doesn’t seem to have the ability to fill in interface […]

Scala zipAll Example

The zip function combines two lists into tuples. If the lists are of differing lengths, the shorter length is used. If you don’t like this behavior, the zipAll function will keep all elements, filling in specified values for the blanks (compare this to the recycling rule in R, which lets you continuously cycle through the […]