, , , , ,

Parsing PDFs at Scale with Node.js, PDF.js, and Lunr.js

Technologies used: Vagrant + Virtualbox, Node.js, node-static, Lunr.js, node-lazy, phantomjs Much information is trapped inside PDFs, and if you want to analyze it you’ll need a tool that extracts the text contents. If you’re processing many PDFs (XX millions), this takes time but parallelizes naturally. I’ve only seen this done on the JVM, and decided […]


My First A/B Test… with Results

A/B testing gets a lot of attention on Hacker News, inbound.org, and other forums, and appeals to me as a data analysis exercise. As a software engineer with a practical bent, I like the concept of data analysis techniques which produce useful results while treating a system as a black box. This stands in contrast […]


Building a full-text index of git commits using lunr.js and Github APIs

Github has a nice API for inspecting repositories – it lets you read gists, issues, commit history, files and so on. Git repository data lends itself to demonstrating the power of combining full text and faceted search, as there is a mix of free text fields (commit messages, code) and enumerable fields (committers, dates, committer […]

, , , ,

Full-Text Indexing PDFs in Javascript

I once worked for a company that sold access to legal and financial databases (as they call it, “intelligent information“). Most court records are PDFS available through PACER, a website developed specifically to distribute court records. Meaningful database products on this dataset require building a processing pipeline that can extract and index text from the […]


Lessons Learned from 0 to 40,000 Readers

Starting Out I started writing a little over a year ago, after finding “Technical Blogging” by Antonio Cangiano through Hacker News. Since then, a bit over 40,000 people have read articles I’ve written, not a huge number in the grand scheme of things, but enough to draw a few lessons. The more I write, the […]

ExtJs JSON Reader Example

I received the following email from a reader: Thank you very much for finding time to read my mail. I came across your blog http://garysieling.com/blog/extjs-pie-chart-example It would be greatly helpful, if you could provide me with the code of binding the data dynamically to the DS. I have already generated the data in JSON format via […]

, ,

Entity recognition with Scala and Stanford NLP Named Entity Recognizer

The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it’s fairly good at finding nouns, but not always at identifying the type of each noun. In this example, the entities I’d like to […]

Extracting PDF text with Scala

This example extracts the text contents of a PDF for use in other systems. This demonstrates some basic differences from Java: multi-line strings (hooray!), imports, primitive arrays, and what implementing an interface looks like. The big downside to this is that the Eclipse Scala plugin doesn’t seem to have the ability to fill in interface […]

, ,

Implementing k-means in Scala

To generate sample data, I selected two points, (10, 20) and (25, 5), then generated a list of normally distributed points around those two – the exact points used are in the code below. This implements Lloyd’s algorithm, which tries to cluster points in iterations in a simple manner: 1. Assume a certain number of […]

Scala zipAll Example

The zip function combines two lists into tuples. If the lists are of differing lengths, the shorter length is used. If you don’t like this behavior, the zipAll function will keep all elements, filling in specified values for the blanks (compare this to the recycling rule in R, which lets you continuously cycle through the […]