Data Exploration in Javascript

Google Analytics has a nice screen which shows alerts for changes that appear interesting – basically any large increase or decrease in traffic from a particular source:

ga

With appropriate API hooks, this screen could be built for any application that models data in a dimensional fashion, e.g. that uses faceted navigation (like Amazon search), or many reporting applications used to measure sales, compliance, market research, etc. Ideally this application should find interesting concepts in the data on it’s own, with minimal intervention – reporting systems with dozens or hundreds of dimensions can quickly become overwhelming, causing the user to miss important information. It may also not be able to explore all options, given the scale of such systems.

Google Analytics has a decent API, so I’m demonstrating it there, but have tested the ideas elsewhere – this could work just as well against and Oracle data warehouse or Solr, although the intent is to work against a JSON API which returns string labels and numerical counts of events.

I modified the “Hello World” sample Google provides to take query arguments and a callback:

function findResults(query, cb) {
  makeApiCall($.extend(query, {   
    'metrics': metrics[0],
    'dimensions' : dimensions[0]
   }),
   cb 
 );
}

The core of this concept requires extracting a few concepts:

  • Dimensions that can be filtered
  • Dimensions that can be grouped (GA calls these metrics)
  • Dates – another dimension, but requires special handling. We want to make special comparisons, e.g. year over year, month over month, 28 day period / 28 day period to make dates line up

Google Analytics separates “metrics” and “dimensions”:

var dimensions = 
  ['ga:visitorType', 'ga:visitCount', 'ga:daysSinceLastVisit'];
var metrics = 
  ['ga:visits', 'ga:newVisits'];

Before proceeding, it’s worth discussing what constitutes an “interesting” event. Realistically, we expect to find several types of events:

  • Natural (milestones, hiring a new vendor)
  • Infrequent (hurricane, lawsuit)
  • Preventable (fraud, human error, software defects
  • Encourable (good hire, research discovery, sales increase)

Some of these may be large (a massive traffic increase), or minor, such as an intermittent error. If you’re doing advertising for instance, you may consider statistical significance important, whereas you may not for compliance or error monitoring purposes. It’s important that different facets of interestingness can be combined easily, without being overly specific.

This system needs to queue up queries to run to explore data. There are two kinds of interesting – interesting enough to report, and interesting enough to drill down. Depending whether you want deep detail or broad insight would dictate a depth vs. breadth first approach.

A simple increase might be interesting, or a percentage based increase, or an increase from a particular source. Once we write a function to note this, we can detect all of these easily:

function testIncrease() {
    if (first > second) {
        test(first, second, "overall",
            function (a, b) {
                var metricName1 = a[0];
                var metricValue1 = a[1];
                var metricValue2 = b[1];
                if (metricValue2 > metricValue1) {
                    return "Increased " + metricName1 + ":" +
                           metricValue2 + " > " + metricValue1 + 
                           "(" + Math.round(100 * 
                           (metricValue2 - metricValue1) /
                           metricValue2) + "%)";
                }
            }
        );
    }
}

Another useful concept is the ending digits of numbers – we expect 0’s to occur very frequently, 1’s somewhat less, and so on – 9’s are the least frequent. The probabilities are logarithmic, and we can use this to detect unusual patterns in data. I found that a certain table in a QA environment was generated programmatically this way (all groups in the table had the same number of entries, so it’s not that special of a discovery).

function testFrequency() {
  var bins = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

  $.each(x, function (i, v) {
    var val = v + '';
    bins[
      parseInt(
        val.substring(val.length - 1), 
        10)]++
  })

  var sum = 0;
  $.each(values, function(i, v) {
    sum += v;
  })

  $.each(bins, function (index, value) {
    var multiplier = Math.round(sum / (1 + index));
    if (multiplier > value) {
        console.log("Found interesting data: " + value);
    }
  })
}

A third category – rare events. This is to detect minor things – Google analytics tends to miss out on showing you when some blog links to you. This is a good way to find the ‘long tail’ of marketing events.

function done() {
    if (first > second) {
        test(first, second, "overall",
            function (a, b) {
                var metricName1 = a[0];
                var metricValue1 = a[1];
                if (metricValue1 < 10) {
                    return "Small event: " + metricValue1;
                }
            }
        );
    }
}

The source is available on github.