Generating Machine Learning Models with Scikit-Learn

I’ve written previously about the mechanics of building decision trees: extract data from some system, build a model on it, then save the model in a file for later:

tree-output

Once you get that far, you’ll likely find that you want to try different models or change the parameters on the ones you’re using.

Ideally, you want to write code like this, which runs several different experiments in sequence, saving the results to a file:

args = {'criterion': 'entropy', 'max_depth': 5, 'min_samples_split': 10}
createModels(treeParams=args, folds = 3)

args = {'criterion': 'entropy', 'max_depth': 10, 'min_samples_split': 10}
createModels(treeParams=args, folds = 3)

args = {'criterion': 'entropy', 'max_depth': 15, 'min_samples_split': 10}
createModels(treeParams=args, folds = 3)

Once you start running experiments, you’ll get loads of files, so it’s important to name them based on what you’re running. Having saved off the results, you’ll be able to save models in source control and keep a permanent record of what experiments you ran.

Like unit tests, you can write out the results of accuracy tests as well, which lets you see how well you’re doing.

trees

To make the filing system clear, the key is to take a python dictionary and turn it into a file name. With that in hand, we convert the model to JSON, and write it and the test results to a file.

def save(trees, features, args):
  print "Saving..."
  idx = 0
  params = ""
  for k in sorted(args.keys()):
    params = params + k + "-" + str(args[k]) + " "

  for t, report in trees:
    f = open('D:\\projects\\tree\\tree ' + params + \
             " fold " + str(idx) + ".json", 'w')
    f.write(report)
    f.write(treeToJson(t, features))
    f.close()
    idx = idx + 1

When we define the createModels function, it forces us to define and extract what this model-building process requires – for instance, what database tables to pull data from, a function to extract the value we want to predict, and how many folds we want to run.

The database operation in particular can be pretty heavy (pulling hundreds of thousands of records into memory), so memoization is key here.

def createModels(treeParams = None, folds = None, _cache = {}):
  print "Creating models..."

  def f(a):
    if '>' in a:
      return 1
    else:
      return 0

  if len(_cache) == 0:
    _cache['data'], _cache['outputs'] = \
      extract('income_trn', f, 'category')
    _cache['intData'], _cache['features'] = \
      transform(_cache['data'],  _cache['outputs'])

  trees = runTests(_cache['intData'], _cache['outputs'], \
                   treeParams, folds)
  save(trees, _cache['features'], treeParams)

In any project where you reshape real data, you typically spend an inordinate amount of time on data transformations, and much less on fun steps like model training or performance tuning. This I’ve found true whether you do machine learning, data warehousing, or data migration projects.

Fortunately most commercial products and libraries that work in those areas provide utilities for common operations, and this is an area where scikit-learn really shines.

For instance, when we run tests we might train the model on 75% of the data, then validate it on 25% of the data. There are different strategies like this, but in this case I’d like to do that four times (each time removing a different 25% of the data). If I had to write the code to do that, it would be a pain – one more source of defects.

Scikit-learn, provide around a half dozen different implementations for splitting data, so if you run into a problem, chances are all you need to do is change which class you instantiate in the cross_validation package:

def runTests(intData, classify, treeParams, folds):
  print "Testing data..."

  rs = cross_validation.ShuffleSplit(len(intData), n_iter=folds,
    test_size=.25, random_state=0)

  foldNumber = 1
  trees = []
  for train_index, test_index in rs:
    print "Fold %s..." % (foldNumber)
    train_data = [intData[idx] for idx in train_index]
    test_data = [intData[idx] for idx in test_index]

    train_v = [classify[idx] for idx in train_index]
    test_v = [classify[idx] for idx in test_index]

    clf = tree.DecisionTreeClassifier(**treeParams)
    clf = clf.fit(train_data, train_v)

    predictions = [clf.predict(r)[0] for r in test_data]
    report = classification_report(test_v, predictions)

    trees.append( (clf, report) )
    foldNumber = foldNumber + 1

  return trees

Similarly, one of the painful parts of the tree implementation is figuring out how to use data that has both text and numeric attributes. This is a specific case of the general problem of transforming data, and scikit-learn has dozens of classes to help in this area.

def transform(measurements, classify):
  print "Transforming data..."

  vec = DictVectorizer()
  intData = vec.fit_transform(measurements).toarray()
  featureNames = vec.get_feature_names()

  return intData, featureNames

As a digression, there is an interesting technical problem that lies hidden under the surface of the API. Trees specifically require that you make all your rows of data arrays of numbers – while this is very frustrating when you come in with strings, this is by design and not by accident and the code below using DictVectorizer will transform the data as you need.

When I first began researching python for data analysis, I was doing a broad search for tools, products, and libraries that might replace parts of the SQL + Relational database ecosystem. This includes tools that fall under the NoSQL umbrella, like Cassandra, Hadoop, and Solr, but also some ETL and data wrangling tools, like R, SAS, Informatica, and python.

If someone specifically wanted to avoid SQL as a language, for instance, there are quite a few options: swap in a different database underneath, use an ORM library to not have to see SQL, or use some sort of in-memory structure.

The python ORM (sqlalchemy) is quite nice for the purposes used in this article, as we can trivially read an entire table into RAM, converting each row into a dictionary:

def extract(tableName, resultXForm, resultColumn):
  print "Loading from database..."

  from sqlalchemy import create_engine, MetaData, Table
  from sqlalchemy.sql import select
  engine = create_engine(
                "postgresql+pg8000://postgres:postgres@localhost/pacer",
                isolation_level="READ UNCOMMITTED"
            )
  conn = engine.connect()
  meta = MetaData()

  table = Table(tableName, meta, autoload=True, autoload_with=engine)

  def toDict(row, columns):
    return {columns[i]: x for i, x in enumerate(row) \
      if columns[i] != resultColumn}

  s = select([table])
  result = conn.execute(s)

  columns = [c.name for c in table.columns]
  resultIdx = columns.index(resultColumn)

  data = [[c for c in row] for row in result]
  split = [toDict(row, columns) for row in data]
  classes = [resultXForm(row[resultIdx]) for row in data]
  return (split, classes)

There’s lots of code out there that looks like this, and it’s kind of kludgy, because eventually you’ll have so much data you wish you could switch algorithms as it grows, while serializing infrequently used structures to disk…which is exactly what relational databases give you.

One of the key value points of the scientific python stack is NumPy, which bridges a part of that gap (if you read “NumPy Internals“, it sounds an awful lot like something out of a database textbook).

This internal storage, then, is the reason for the above mentioned issue with Trees: internally, they use a massive array of floats, using NumPy underneath, so that it can be really fast on large datasets.

And with that, with a little bit of effore, we have a script that is almost exactly two hundred lines of code, allowing us to harness hundreds of thousands of engineering and grad student hours to find patterns out in the world.

conclusion-image