U.S. Code Available in XML Format

I saw today that the U.S. Code is available online now for download in a structured format. Ideas for apps, anyone? It’s worth noting that this was available in some form already, e.g. through Cornell. To give you a taste of what is there, I extracted a few interesting sections. The first part is some […]

Converting JSON to a CSV file with Python

In a previous post, I showed how to extract data from the Google Maps API, which leaves a series of JSON files, like this: {“address_components”: [{“long_name”:”576″,”short_name”:”576″,”types”:[“street_number”]}, {“long_name”:”Concord Road”,”short_name”:”Concord Road”,”types”:[“route”]},{“long_name”:”Glen Mills”,”short_name”:”Glen Mills”,”types”:[“locality”,”political”]},{“long_name”:”PA”,”short_name”:”PA”,”types”:… Ideally we want selections from these as a CSV for manual review, and import into mapping software. First, we load a list of files: […]

Generating Randomized Sample Data in Python

If you have access to a production data set, it is helpful to generate testing data which follows a similar format, in varying quantities. By introspecting a database, we can identify stated constraints. Given sufficient data volume, we can also infer implicit business process constraints. If preferred, we can also find records that may generate […]


Part of Speech Tagging: NLTK vs Stanford NLP

One of the difficulties inherent in machine learning techniques is that the most accurate algorithms refuse to tell a story: we can discuss the confusion matrix, testing and training data, accuracy and the like, but it’s often hard to explain in simple terms what’s really going on. Practically speaking this isn’t a big issue from […]

, ,

Extracting Tables from PDFs in Javascript with PDF.js

A common and difficult problem acquiring data is extracting tables from a PDF. Previously, I described how to extract the text from a PDF with PDF.js, a PDF rendering library made by Mozilla Labs. The rendering process requires an HTML canvas object, and then draws each object (character, line, rectangle, etc) on it. The easiest […]

, ,

Exploring Zipf’s Law with Python, NLTK, SciPy, and Matplotlib

Zipf’s Law states that the frequency of a word in a corpus of text is proportional to it’s rank – first noticed in the 1930’s. Unlike a “law” in the sense of mathematics or physics, this is purely on observation, without strong explanation that I can find of the causes. We can explore this concept […]


Self-modifying Javascript objects

I thought it’d be interesting to consider Javascript objects which can modify their own behavior over time. A use case for this (what I’m doing) is pages of data, where data is first looked up, then cached in memory, then cleared. A simple case (demonstrated below) is memoization. The following query will do an ajax […]

Rhyming with NLP and Shakespeare

“Natural Language Processing with Python” (read my review) has lots of motivating examples for natural language processing, focused on NLTK, which among other things, does a nice job of collecting NLP datasets and algorithms into one library. Let’s take one of Shakespeare’s sonnets and see if we can recommend alternate rhymes: import nltk from nltk.corpus […]

, ,

Six Join Implementations in Javascript

A join is an operation between two tables of data, combining the results by looking up keys from one table in a second table. While a simple operation in concept, there are many ways to do this and understanding the variations are important to understanding database behavior (for a discussion of how the algorithms are […]