A History of Philadelphia Churches through Maps, Part II

In the first part of this series, I discussed how distribution of churches across the Philadelphia region ties to population density, suggesting that visual patterns in maps can be used to better understand slices of our history. This material isn’t particularly novel and tells stories that are fairly well known; my interest is driven in […]

Google +1’s and Search Rankings

There’s some debate on SeoMoz / Hacker News. I thought I’d share my experience, since it seems in conflict with the two main arguments. The first being what the correlation between Google+ 1’s and search ranking is, the second being that Google just wants everyone to write quality content. Here is a chart from the […]


Finding Image Boundaries in Python

I’m working my way through Programming Computer Vision with Python, a compact introduction to Computer Vision. Computer Vision is a fascinating subset of computer science that has recently pushed aggressively forward through a combination of Dept of Defense research in self-driving cars, video game development, and rapid improvements in computer hardware. I’m writing a series of […]


Counting Citations in U.S. Law

The U.S. Congress recently released a series of XML documents containing U.S. Laws. The structure of these documents allow us to find which sections of the law are most commonly cited. Examining which citations occur most frequently allows us to see what Congress has spent the most time thinking about. Citations occur for many reasons: […]

Examining Citations in Federal Law using Python

Congress frequently passes laws which amend or repeal sections of prior laws; this produces a series of edits to law which programmers will recognize as bearing resemblance to source control history. In concept this is simple, but in practice this is incredibly complex – for instance like source control, the system must handle renumbering. What […]

U.S. Laws vs. The Human Genome

Since you can download the U.S. Code, I thought it would be interesting to compare the size to that of the Human Genome, operating on the premise that the latter represents the DNA for a living thing, and the former, the DNA for a nation. I’ve charted this below – to reproduce this you need […]

U.S. Code Available in XML Format

I saw today that the U.S. Code is available online now for download in a structured format. Ideas for apps, anyone? It’s worth noting that this was available in some form already, e.g. through Cornell. To give you a taste of what is there, I extracted a few interesting sections. The first part is some […]


Part of Speech Tagging: NLTK vs Stanford NLP

One of the difficulties inherent in machine learning techniques is that the most accurate algorithms refuse to tell a story: we can discuss the confusion matrix, testing and training data, accuracy and the like, but it’s often hard to explain in simple terms what’s really going on. Practically speaking this isn’t a big issue from […]

, ,

Extracting Tables from PDFs in Javascript with PDF.js

A common and difficult problem acquiring data is extracting tables from a PDF. Previously, I described how to extract the text from a PDF with PDF.js, a PDF rendering library made by Mozilla Labs. The rendering process requires an HTML canvas object, and then draws each object (character, line, rectangle, etc) on it. The easiest […]

, ,

Exploring Zipf’s Law with Python, NLTK, SciPy, and Matplotlib

Zipf’s Law states that the frequency of a word in a corpus of text is proportional to it’s rank – first noticed in the 1930’s. Unlike a “law” in the sense of mathematics or physics, this is purely on observation, without strong explanation that I can find of the causes. We can explore this concept […]