“Postgres for Developers” – Notes from PGConf NYC 2014

I saw a talk by one of the core Postgres developers, which showed a bunch of interesting tricks to handle business rules in Postgres specific SQL. These are all things you could find by reading the documentation, but they are interesting enough to write up examples to highlight some interesting things you can do. A lot of these end up being useful for writing systems with immutable data (especially auditing, and sometimes reporting systems).

Example 1: Array Aggregation

“array_agg” can be used to combine rows, which sort of resembles a pivot table operation (this is the same set of values that would be passed as arguments to other aggregation functions)

SELECT y, array_agg(x) FROM (
  SELECT 1 x, 2 y
  UNION ALL
  SELECT 2 x, 2 y
  UNION ALL 
  SELECT 3 x, 3 y
) a
GROUP BY y
 
2;"{1,2}"
3;"{3}"

If you use the above table as a common table expression, you can also rename the columns in the with block. You can even join on the arrays:

WITH t(a, b) AS
(
  SELECT y, array_agg(x) FROM (
    SELECT 1 x, 2 y
    UNION ALL
    SELECT 2 x, 2 y
    UNION ALL 
    SELECT 3 x, 3 y
  ) a
  GROUP BY y
)
SELECT * 
FROM t t1 JOIN t t2 ON t1.b[2] = t2.a
 
2;"{1,2}";2;"{1,2}"

Example 2: Named Window Functions

I’m not sure yet whether this is just syntactic sugar or has real value, but you can set up named “windows.”

By way of explanation, a lot of times when you start using aggregate functions (min, max, array_agg, etc), you end up using window functions, which resemble the following:

SELECT a, MAX(b) OVER (partition BY a) 
FROM (
  SELECT 1 a, 1 b
  UNION ALL 
  SELECT 2 a, 1 b
  UNION ALL 
  SELECT 1 a, 2 b
) t1
 
1;2
1;2
2;1

These allow you do calculate aggregate functions (like min/max) without combining all the rows.

For instance, if you sort these values, you can find the “next” or “previous” row in the partition, which is pretty standard sql stuff:

SELECT a, lag(b) OVER (partition BY a ORDER BY b) 
FROM (
  SELECT 1 a, 1 b
  UNION ALL 
  SELECT 2 a, 1 b
  UNION ALL 
  SELECT 1 a, 2 b
) t1
 
1;
1;1
2;

If you use the above table as a common table expression, you can then rename the columns in the WITH block. You can even join on the arrays:

WITH t(a, b) AS
(
  SELECT y, array_agg(x) FROM (
    SELECT 1 x, 2 y
    UNION ALL
    SELECT 2 x, 2 y
    UNION ALL 
    SELECT 3 x, 3 y
  ) a
  GROUP BY y
)
SELECT * 
FROM t t1 JOIN t t2 ON t1.b[2] = t2.a
 
2;"{1,2}";2;"{1,2}"

What’s cool is you can move the “over partition by” part out of the query to the end as a named window, which presumably would be really nice if you had a lot of them, or wanted to re-use the same window for multiple fields:

SELECT a, lag(b) OVER w
FROM (
  SELECT 1 a, 1 b
  UNION ALL 
  SELECT 2 a, 1 b
  UNION ALL 
  SELECT 1 a, 2 b
) t1
window w AS (partition BY a ORDER BY b) 
 
1;
1;1
2;

Example 3: Ranges
Postgres has a really cool feature, as of 9.2, where you can query whether something is in a range (ranges are a special type, kind of like the arrays above). This example is a bit contrived, to show that you could combine array_agg and range creation:

WITH _data AS (
  SELECT 1 a, 1 b
  UNION ALL 
  SELECT 2 a, 1 b
  UNION ALL 
  SELECT 1 a, 2 b
  UNION ALL
  SELECT 2 a, 2 b
),
_history AS (
  SELECT a, array_agg(b) _start, array_agg(b) _end
  FROM _data
  GROUP BY a
)
SELECT a, 
       _start[1], 
       _end[1], 
       int4range(_start[1]::INTEGER, _end[2]::INTEGER, '(]'::text) 
FROM _history
 
1;1;1;"[2,3)"
2;1;1;"[2,3)"

There are a bunch of range types built in (based on numerics, timestamps). Note that you can specify whether the endpoints on ranges are inclusive or exclusive.

You can find out if a data value is within a range with the @> operator and see if two ranges overlap with &&. This set of functionality is great for exploring audit records – if you make a range with “[valid_from, valid_to)” you can query to find out what rows were effective on a particular date/time, for instance.

If you’re in this area of functionality, also check out btree_gist indexes, which may be helpful for tuning this.

Example 4: DISTINCT ON
Postgres has a feature to pull back the first value for a row in a group by. I assume this is a performance feature, but at the least it’s a very concise syntax for something that would otherwise require the use of RANK(). I imagine that you’d always want to use an ORDER BY with it.

The example from the docs for this one is pretty clear:

SELECT DISTINCT ON (location) location, TIME, report
FROM weather_reports
ORDER BY location, TIME DESC;

There are a few other features that got a lot of play at the conference (e.g. foreign data wrappers) – more to come.

Postgres High Availability (Talk Notes, PGConf 2014)

I saw a talk by some operations engineers who work for ARIN at PGConf 2014.

They presented their architecture for high availability, which uses CMAN, Corosync, and Pacemaker, which handle message queueing, a quorum system and Postgres communication, respectively (Corosync is functioning similar to Zookeeper, as I understand it).

Their architecture was interesting for a couple reasons, it uses a three node system (master, synchronous slave and asynchronous slave), and it uses the network switch to cut off bad nodes. Nodes are tasked with monitoring other nodes, and when one node can’t communicate with another, it sends a message to others to request they check as well. A different node would be responsible to agree that some node is down, in which case, it notifies the switch that it needs to “fence” the bad node (i.e. cut off all network traffic).

In a failure scenario, the synchronous node can immediately become master, and the async must catch up and become synchronous – once the bad node is fixed, it re-enters the pool as asynchronous, which allows it to catch up slowly. It’s worth noting that if there is no synchronous node after a failure, the async node will be forced to catch up before the system can start running again, which will trigger a pause for end users.

Other ways to implement this (without a switch) would include iptables. Interestingly, the ARIN team came up with a list of ~45 failure scenarios, which they apparently tested in about three weeks (clearly this can be a lot of effort). Also interesting, a member of the audience mentioned an alternate design using dark fiber to connect SCSI connections on hard drives in what sounded like a RAID, for replication, but I don’t think most of my readers are going to be doing that.

PGConfNYC Keynote Notes (Gilt)

One of the keynote presentations for PG Conf NYC was a founder of Gilt Groupe (a site for flash sales of fashion items). This is an interesting business arena, in that the application has an inherent spiky nature to load.

The speaker observed that people tend to form small groups based on how many people we can keep track of (100-150), and that corporate org structures tend to influence software architecture. Consequently they introduced a concept within their corporate IT called “micro services”, where teams have the ability to provision resources (e.g. a docker instance running postgres + some application) – the cluster of dependent services then resembles clusters of human relationships, where some powerful people are at the center, fanning out toward newer or more rural notions.

The nice consequences of this approach are that it assumes a natural level of failure (so you have to plan for it), and forces everyone to build relatively independent services, a similar model to how AWS was formed. Assuming failure also forces you to build in things like alerting, which (apparently) makes auditors happy, while also forcing you to run your infrastructure correctly. It sounds like they also have a concept where they can grant someone a temporary privilege to view a production database with an expiring account, which is a neat idea.

Like the other keynote he had a few interesting ideas for areas to improve postgres: append-only tables (in a similar vein to immutable data) and tables that store diffs (again with concepts around immutable data).

PGConfNYC: KeyNote notes (Goldman Sachs)

One of the the keynotes addresses to PGConf NYC was a talk by a technology manager at Goldman Sachs who had a bunch of interesting insights into the ecosystem. In their environment, Postgres is one of a number of supported platforms, so they have a good position for comparison (including to a nameless “commercial database”, the name of which can probably be easily inferred).

The most interesting insight to me is that for large companies, the marginal cost of installing a database is very low: most of the cost of Oracle is having it at all, because you typically aren’t provisioning hardware, etc just to install a new application. On the other side, even if you remove the licensing costs, the total cost of ownership of an open-source database is still quite significant, given that someone still needs to build tooling around the database (e.g. for replication, auditing, training, monitoring, and the like). It’s also useful for a large company to have someone to call one way or the other, as you can’t post anything that might give away secrets in public lists.

Also of interest, there is a lot of talk at the conference around audit trails (and Auditors and Regulators) which dovetails nicely with the notion of immutable data. As this is a topic of interest in parts of the wider tech community, I’d expect a lot of developments in this area in the future.

He suggested a few new features that might be valuable: more programmatic configuration (vs. just editing files), a service name concept (similar to Oracle’s tnsnames, I assume) where you can address a database as a service without necessarily knowing where it is (similar to DNS), and one of my favorites, APIs that run alongside SQL (other ways to access the contents of the DB).

Overall I find it very insightful to see how different people are running their infrastructure, with varying scale and business constraints.

Mining Association Rules with R and Postgres

In one of my earlier pieces I explored decision trees in python, which lets you to train a machine learning algorithm to predict or classify data.

I like this style of model because the model itself is valuable; I’m more interested in finding underlying patterns than attempting to predict the future. Decision trees are nice if you want to predict one particular feature; however they aren’t as good for exploratory analysis and they force fit a hierarchy onto data.

Association rules are an alternative model which have a somewhat similar characteristic (they produce probabilistic logic models), but they do not focus on a specific attribute. This is more the style of what Amazon does when it says “people who bought this also bought this” – it doesn’t matter which item you start or end with. We have to use R instead of python – the only mention of association rules in sci-kit learn was a discussion of bumping it from the product to allow them to release in a timely manner, so maybe it will be available there at some point in the future.

To construct this, I’ve build database views that have only the values I want included in my model – this makes it relatively easy to pick and choose which features are included in the resulting model (if you find yourself running the code below a lot, it makes sense to materialize the view).

This uses the RPostgreSQL library, nothing special here.

library("RPostgreSQL")
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, 
  host="dbhost", 
  user="postgres", 
  password="postgres", 
  port="5432", 
  dbname="dbname")
rs <- dbSendQuery(con,"select * from schema.my_view")
results <- fetch(rs,n=-1)

After a slight lag you’ll have all the data loaded into memory (hopefully you’re doing “small” data :) ) – if not there are other R libraries that claim to help (I’m in the ~100k row area).

One of the nice things about association rules is that they are designed to work with categorical data, so if you bring down numeric or date attributes, you need to bin them. This is a nice contrast to decision trees in python, where you to force-fit all categorical attributes into boolean features (e.g. a feature might be “country_is_USA”, “country_is_Canada”, etc with a numeric value of 0 or 1).

You can see here that we need to do very little to transform my data – while the arules paper has binning examples, I don’t have any attributes that require that treatment.

for (column in names(results)) {
  results[column] <- as.factor(results[[column]])
}
 
lapply(results, class)
results_matrix <- as(results, "transactions")

Here is the key – we merely call the arules library with a bunch of arguments. The two most important ones are “minlength” and “maxlength” which identify how many conditions you’d like in your rules. If you pick “big” numbers (the default is 10) this will create astronomical numbers of rules, as each additional size will increase the rule count by orders of magnitude. Not only will this cause you to run out of RAM, it counter-intuitively fails after it claims to have generated all the rules, and is writing the results to some un-named file.

library("arules")
rules <- apriori(results_matrix, 
  parameter = list(support = 0.1, confidence = 0.6, maxlen = 5, minlen=2))

If you want to actually see these rules, I found I was forced to write them out to a file. The reason being a) it’s nearly impossible to generate less then tens or hundreds of thousands of rules and b) nearly every operation on them will run out of memory trying to complete.

One example of this problem is the built-in “save to file” methods, which, oddly, convert the entire structure to a string before writing it to a file. I find it baffling that the R community can get these complex algorithms to work, but then fail to handle “large” file I/O in a graceful way. After messing around with various API methods from both arules in the R core, I ended up writing this to file myself:

batchSize = 10000
maxSize = length(rules)
i <- 1
 
while(i < maxSize)
{
  print(i)
  j <- i+batchSize
  if (j > maxSize) {
    j <- maxSize
  }
 
  write(rules[i:j], 
        file = "D:\\projects\\tree\\apriori-large.csv", 
        quote=FALSE, 
        sep = ",", 
        append=(i > 1), 
        col.names = FALSE)
 
  i <- i + batchSize
}

What I found is that in addition to giving me rules which predict certain behaviors, this also uncovered hidden business rules in my data, which I hadn’t appreciated (there are also some obvious ones: attributes which are filled in by a common subsystem are filled in together – more interesting here would be the exception cases)

As one final consideration, it is relatively easy to make a dataframe that masks the original values with booleans, i.e. “this row has a value or doesn’t have a value”, which in some cases is more interesting than knowing patterns than the actual data (especially when the original data is unique IDs). These also train super-fast. Note here that I’m using several different representations of null, as these are all in my data.

targets<-c(0, " ", "", NA, "-1")
mask<-apply(t,2, "%in%" , targets)
rules <- apriori(mask, 
                 parameter = list(support = 0.2, 
                                  confidence = 0.7, 
                                  maxlen = 5, 
                                  minlen=2))

The nice thing about association rules is that in addition to predicting outcomes, we can use them to explore concepts in the data, although without much depth, and they don’t force the data into a hierarchy. One problem remains though, which is filtering the rules in an appropriate way- since these resemble what Prolog does, it might be possible to use it to build concept trees from the data. Unfortunately I’m yet to get SWI-Prolog to load my ruleset without running out of RAM – there is also a probabilistic Prolog, which looks promising.

It’s worth noting that while you can filter these rules based on the confidence of the rule, it’s not actually helpful: your application may force two attributes to be filled out at the same time, which would give them a high correlation, but not a useful one. Association rules will also give you strange variations on this: if attribute A1 and A2 are filled in at the same time, you get a lot of rules like {A1} => {A2} and {A2} => {A1} (expected), then also {A1, B} => A2, {A2, B}= > A1, and so on.

The arules library has a fair number of options, which I will explore in future posts. I also intend to approach this from a different angle, to find outlier data (sort of the opposite of what we’re doing here).

Testing the output of tuned Postgres queries

Tuning SQL queries is a useful skill, and while many people struggle to manage complex SQL, the work is actually a series of simple tricks.

For instance, refactoring a query often brings about algorithmic improvements, and if you tune enough queries, finding mathematically equivalent forms becomes muscle memory. This includes operations like inlining common table expressions, and knowing how to apply factoring (which you may need to try to remember from grade school).

When applying these operations, a few defects arise regularly. Consider the following:

SELECT 'Gary' AS customerName, '321 Street' AS customerAddress
UNION ALL
SELECT '111 Road' AS customerAddress, 'Bob' AS customerName

This is a valid query in Postgres – while it does type checking on each set in the union, it does not force you to name the columns the same.

This may look contrived, but it’s easier to make these happen in larger queries (e.g. long select lists)- the key point to understand is how easy it is to create subtle bugs.

Fortunately databases permit us to treat tables as sets, which allows you do do something like the following (note this assumes you have a unique key)

WITH a AS (
  ... BEFORE ...
), b AS (
  ... after ...
) 
SELECT COUNT(*) FROM a
UNION ALL
SELECT COUNT(*) FROM b
UNION ALL
SELECT * FROM (
   SELECT * FROM a 
   UNION
   SELECT * FROM b
)

The “union” causes the inputs to be treated as sets, so they will be de-duplicated, whereas “union all” treats the inputs as a list, which forces the ordering.

For this query, you expect to see the count(*) to be the same from the “before”, “after” and combined sets.

To see the difference in rows, you can also do set differences:

WITH a AS (
  ... BEFORE ...
), b AS (
  ... after ...
) 
SELECT * FROM a EXCEPT SELECT * FROM b
UNION 
SELECT * FROM b EXCEPT SELECT * FROM a

And this will show you every row that is unique to either before or after – if this returns no results, you’ve left the query the same. If you have a lot of data, this can take some time to run, although it is typically far less than manual testing, but if you wish, you can add an equivalent where clause to the before and after queries.

Ideally you should test this on a copy of production data. There are still classes of problems you could miss – but at least you know no one will be able to find them (yet).

There is one more problem:

Lets say you swapped two columns, as in my original example. The set difference query will give you all the rows in each query. If you start with 100k rows, now you have 200k, and how are you supposed to figure that out?

It turns out to be pretty simple:

WITH a AS (
  ... BEFORE ...
), b AS (
  ... after ...
) 
SELECT * FROM a EXCEPT SELECT * FROM b WHERE id = 12345
UNION 
SELECT * FROM b EXCEPT SELECT * FROM a WHERE id = 12345

Find one that might be of interest, then apply your unique identifier. Now, you have two rows:

Then scroll across the result until you spot the difference:
pgadmin

This works well, whether you have dozens or hundreds of columns.

This is a very simple manual workflow, requiring no special tools- while I’ve used Postgres, it should work essentially the same in any relational database. If you don’t work in this style, it will save you tons of testing time, re-opened tickets, and remove the fear of changing areas of your product. As a developer, it can take you from the point where you struggle with “large” queries to where the size doesn’t matter.

Extracting Dates and Times from Text with Stanford NLP and Scala

Stanford NLP is a library for text manipulation, which can parse and tokenize natural language texts. Typically applications which operate on text first split the text into words, then annotate the words with their part of speech, using a combination of heuristics and statistical rules. Other operations on the text build upon these results with the same techniques (heuristics and statistical algorithms on earlier data), which results in a pipeline model.

Here, for instance, we see two techniques for constructing a pipeline, one based on configuration, and one manual. Since this example is going to extract dates and times from text, we add the TimeAnnotator class to the end of the pipeline:

object Main {
  def main (args: Array[String]) {
    val props = new Properties();
    props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
 
    val pipeline = new StanfordCoreNLP(props);
 
    val timeAnnotator = new TimeAnnotator()
    pipeline.addAnnotator(timeAnnotator)
 
    ...
  }
}

Once this is working, you simply tell the pipeline to annotate the text, and then wait for a bit.

val text =
  "Last summer, they met every Tuesday afternoon, from 1:00 pm to 3:00 pm."
val doc = new Annotation(text)
pipeline.annotate(doc)

The reason this takes time is that doing the actual work loads a handful of files from disk and works with them, and while they are small, they have large numbers of pre-defined rules. Consider the following sample, which is a small piece of the time matching piece of the library (there are around a thousand lines of this sort of thing).

 BASIC_NUMBER_MAP = {
    "one": 1,
    "two": 2,
    "three": 3,
    ...
 }
 
BASIC_ORDINAL_MAP = {
  "first": 1,
  "second": 2,
  "third": 3,
  ...
}
 
PERIODIC_SET = {
  "centennial": TemporalCompose(MULTIPLY, YEARLY, 100),
  "yearly": YEARLY,
  "annually": YEARLY,
  "annual": YEARLY,
  ...
}

This sample demonstrates two things – you can pull out more than just exact times (e.g. “last summer”, “next century”, ranges, times without dates), and the library handles a large number of equivalence classes for you.

One of the most likely issues you’ll run into trying to get this working is getting the classpath and parsing pipeline set up right – while simple to look at, if you try to customize it, you’ll need to develop an understanding of how the library is actually structured.

Once you run it, you can get dates out:

    val timexAnnotations = doc.get(classOf[TimeAnnotations.TimexAnnotations])
    for (timexAnn <- timexAnnotations) {
      val timeExpr = timexAnn.get(classOf[TimeExpression.Annotation])
      val temporal = timeExpr.getTemporal()
      val range = temporal.getRange()
 
      println(temporal)
      println(range)
    }

For the above example, this gives you the following (note alternating “temporal” and “range”). Note the “offset” lines as well – you can set a reference date for these, if you wish, so that

XXXX-SU OFFSET P-1Y
(XXXX-SU OFFSET P-1Y,XXXX-SU OFFSET P-1Y,)
XXXX-WXX-2TAF
null
T13:00
(T13:00:00.000,T13:00:59.999,PT1M)
T15:00
(T15:00:00.000,T15:00:59.999,PT1M)

If you add the following line, these ranges will fix themselves:

doc.set(classOf[CoreAnnotations.DocDateAnnotation], "2013-07-14")

Which gives you:

2012-SU
(2012-06-01,2012-09,P3M)

One thing I haven’t figured out yet is whether you scan specify locale settings for this – presumably in parsing dates, you at least want to know which of month/day are intended to be first, even if you treat the text as English as a whole (that said, this may require first identifying a language and building a different pipeline – e.g. if the library can’t handle French, Spanish, etc, being able to handle their dates is irrelevant).

For more examples, and information on customizing this, see the Stanford NLP documentation.

Exception in thread “main” java.lang.RuntimeException: Error parsing file: edu/stanford/nlp/models/sutime/defs.sutime.txt

The following error:

Exception in thread "main" java.lang.RuntimeException: Error parsing file: edu/stanford/nlp/models/sutime/defs.sutime.txt

Is caused by using the Stanford NLP jars, but not the entire project on the classpath. To fix it, you want to download the “full” distribution and add that folder to the classpath.

Generating Machine Learning Models with Scikit-Learn

I’ve written previously about the mechanics of building decision trees: extract data from some system, build a model on it, then save the model in a file for later:

tree-output

Once you get that far, you’ll likely find that you want to try different models or change the parameters on the ones you’re using.

Ideally, you want to write code like this, which runs several different experiments in sequence, saving the results to a file:

args = {'criterion': 'entropy', 'max_depth': 5, 'min_samples_split': 10}
createModels(treeParams=args, folds = 3)
 
args = {'criterion': 'entropy', 'max_depth': 10, 'min_samples_split': 10}
createModels(treeParams=args, folds = 3)
 
args = {'criterion': 'entropy', 'max_depth': 15, 'min_samples_split': 10}
createModels(treeParams=args, folds = 3)

Once you start running experiments, you’ll get loads of files, so it’s important to name them based on what you’re running. Having saved off the results, you’ll be able to save models in source control and keep a permanent record of what experiments you ran.

Like unit tests, you can write out the results of accuracy tests as well, which lets you see how well you’re doing.

trees

To make the filing system clear, the key is to take a python dictionary and turn it into a file name. With that in hand, we convert the model to JSON, and write it and the test results to a file.

def save(trees, features, args):
  print "Saving..."
  idx = 0
  params = ""
  for k in sorted(args.keys()):
    params = params + k + "-" + str(args[k]) + " "
 
  for t, report in trees:
    f = open('D:\\projects\\tree\\tree ' + params + \
             " fold " + str(idx) + ".json", 'w')
    f.write(report)
    f.write(treeToJson(t, features))
    f.close()
    idx = idx + 1

When we define the createModels function, it forces us to define and extract what this model-building process requires – for instance, what database tables to pull data from, a function to extract the value we want to predict, and how many folds we want to run.

The database operation in particular can be pretty heavy (pulling hundreds of thousands of records into memory), so memoization is key here.

def createModels(treeParams = None, folds = None, _cache = {}):
  print "Creating models..."
 
  def f(a):
    if '>' in a:
      return 1
    else:
      return 0
 
  if len(_cache) == 0:
    _cache['data'], _cache['outputs'] = \
      extract('income_trn', f, 'category')
    _cache['intData'], _cache['features'] = \
      transform(_cache['data'],  _cache['outputs'])
 
  trees = runTests(_cache['intData'], _cache['outputs'], \
                   treeParams, folds)
  save(trees, _cache['features'], treeParams)

In any project where you reshape real data, you typically spend an inordinate amount of time on data transformations, and much less on fun steps like model training or performance tuning. This I’ve found true whether you do machine learning, data warehousing, or data migration projects.

Fortunately most commercial products and libraries that work in those areas provide utilities for common operations, and this is an area where scikit-learn really shines.

For instance, when we run tests we might train the model on 75% of the data, then validate it on 25% of the data. There are different strategies like this, but in this case I’d like to do that four times (each time removing a different 25% of the data). If I had to write the code to do that, it would be a pain – one more source of defects.

Scikit-learn, provide around a half dozen different implementations for splitting data, so if you run into a problem, chances are all you need to do is change which class you instantiate in the cross_validation package:

def runTests(intData, classify, treeParams, folds):
  print "Testing data..."
 
  rs = cross_validation.ShuffleSplit(len(intData), n_iter=folds,
    test_size=.25, random_state=0)
 
  foldNumber = 1
  trees = []
  for train_index, test_index in rs:
    print "Fold %s..." % (foldNumber)
    train_data = [intData[idx] for idx in train_index]
    test_data = [intData[idx] for idx in test_index]
 
    train_v = [classify[idx] for idx in train_index]
    test_v = [classify[idx] for idx in test_index]
 
    clf = tree.DecisionTreeClassifier(**treeParams)
    clf = clf.fit(train_data, train_v)
 
    predictions = [clf.predict(r)[0] for r in test_data]
    report = classification_report(test_v, predictions)
 
    trees.append( (clf, report) )
    foldNumber = foldNumber + 1
 
  return trees

Similarly, one of the painful parts of the tree implementation is figuring out how to use data that has both text and numeric attributes. This is a specific case of the general problem of transforming data, and scikit-learn has dozens of classes to help in this area.

def transform(measurements, classify):
  print "Transforming data..."
 
  vec = DictVectorizer()
  intData = vec.fit_transform(measurements).toarray()
  featureNames = vec.get_feature_names()
 
  return intData, featureNames

As a digression, there is an interesting technical problem that lies hidden under the surface of the API. Trees specifically require that you make all your rows of data arrays of numbers – while this is very frustrating when you come in with strings, this is by design and not by accident and the code below using DictVectorizer will transform the data as you need.

When I first began researching python for data analysis, I was doing a broad search for tools, products, and libraries that might replace parts of the SQL + Relational database ecosystem. This includes tools that fall under the NoSQL umbrella, like Cassandra, Hadoop, and Solr, but also some ETL and data wrangling tools, like R, SAS, Informatica, and python.

If someone specifically wanted to avoid SQL as a language, for instance, there are quite a few options: swap in a different database underneath, use an ORM library to not have to see SQL, or use some sort of in-memory structure.

The python ORM (sqlalchemy) is quite nice for the purposes used in this article, as we can trivially read an entire table into RAM, converting each row into a dictionary:

def extract(tableName, resultXForm, resultColumn):
  print "Loading from database..."
 
  from sqlalchemy import create_engine, MetaData, Table
  from sqlalchemy.sql import select
  engine = create_engine(
                "postgresql+pg8000://postgres:postgres@localhost/pacer",
                isolation_level="READ UNCOMMITTED"
            )
  conn = engine.connect()
  meta = MetaData()
 
  table = Table(tableName, meta, autoload=True, autoload_with=engine)
 
  def toDict(row, columns):
    return {columns[i]: x for i, x in enumerate(row) \
      if columns[i] != resultColumn}
 
  s = select([table])
  result = conn.execute(s)
 
  columns = [c.name for c in table.columns]
  resultIdx = columns.index(resultColumn)
 
  data = [[c for c in row] for row in result]
  split = [toDict(row, columns) for row in data]
  classes = [resultXForm(row[resultIdx]) for row in data]
  return (split, classes)

There’s lots of code out there that looks like this, and it’s kind of kludgy, because eventually you’ll have so much data you wish you could switch algorithms as it grows, while serializing infrequently used structures to disk…which is exactly what relational databases give you.

One of the key value points of the scientific python stack is NumPy, which bridges a part of that gap (if you read “NumPy Internals“, it sounds an awful lot like something out of a database textbook).

This internal storage, then, is the reason for the above mentioned issue with Trees: internally, they use a massive array of floats, using NumPy underneath, so that it can be really fast on large datasets.

And with that, with a little bit of effore, we have a script that is almost exactly two hundred lines of code, allowing us to harness hundreds of thousands of engineering and grad student hours to find patterns out in the world.

conclusion-image

UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xe1′ in position 8: ordinal not in range(128)

If you do the following:

[str(x) for x in data]

You’ll sometimes get this error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 8: ordinal not in range(128)

This indicates that you have non-ascii characters in the data, and should use a wider type. You can fix it by doing the following (alternately, you can do something like base64 encoding, although I’m not sure why you’d want to do that):

[unicode(x) for x in data]