My First A/B Test… with Results

A/B testing gets a lot of attention on Hacker News, inbound.org, and other forums, and appeals to me as a data analysis exercise. As a software engineer with a practical bent, I like the concept of data analysis techniques which produce useful results while treating a system as a black box. This stands in contrast […]

, , , ,

Full-Text Indexing PDFs in Javascript

I once worked for a company that sold access to legal and financial databases (as they call it, “intelligent information“). Most court records are PDFS available through PACER, a website developed specifically to distribute court records. Meaningful database products on this dataset require building a processing pipeline that can extract and index text from the […]

, ,

Entity recognition with Scala and Stanford NLP Named Entity Recognizer

The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it’s fairly good at finding nouns, but not always at identifying the type of each noun. In this example, the entities I’d like to […]

, ,

Implementing k-means in Scala

To generate sample data, I selected two points, (10, 20) and (25, 5), then generated a list of normally distributed points around those two – the exact points used are in the code below. This implements Lloyd’s algorithm, which tries to cluster points in iterations in a simple manner: 1. Assume a certain number of […]

Extracting Social Media Vote Counts for Reddit, Twitter, Google+ and Hacker News

Ever wonder if your blog posts have been submitted to sites like Reddit or Hacker News, who submitted them, and how well they did? All this data is available through JSON APIs, official or not. This code example collects the submission statistics for a WordPress blog entry – Twitter and Google+ posts, Hacker News posts/votes, […]

, ,

Building a Terabyte-scale Math Platform

Cliff Click, 0xdata Click represents 0xdata, which is building a system that can handle R-style analysis at a large speed/scale, aimed at companies that do advertising or credit card fraud detection, where transaction volume is large, and where money is lost waiting for models to rebuild. Typically these data comes from a variety of sources, file […]

, ,

Job Title Trends in Computing Fields

The Bureau of Labor Statistics creates a listing of job titles, average salary, number of jobs, and projections. Their taxonomy groups people into 750 job title categories, in some odd groupings. Few categories are set to show declines, particularly in any job type even vaguely related to the IT field. There are a few exceptions, […]


Simulating a Line-Following Robot in R

I’ve been reading up on controlling mobile robots, and built a simple robotic movement simulator in R, using R graphing libraries. The motivation for doing this is to practice setting up the math for controlling a robot, without having to build a physical device. Starting with an over-simple model allows learning a bit at a […]

, ,

Finding the beat in R

In a previous article, I described a method for detecting chords in an audio file (also available for Scala). Continuing on this theme, the following will find the onset of a drumbeat in a file, using R. I’m using a single drumstick click, which you can hear on freesound.org. This method detects sudden volume increases- […]

, , ,

Book Review: R Cookbook

The R Cookbook is written by Paul Teetor, a developer with degrees in statistics and computer science, specializing in finance. The programming language R is a specialized language designed for deep statistical research, although it has some support for other mathematical fields, such as matrix algebra and signal processing. True to the O’Reilly cookbook format, […]