Warning: Invalid argument supplied for foreach() in /srv/www/garysieling.com/public_html/blog/wp-content/themes/copyblogger/header.php on line 52

Decision Tree Testing Lessons

I’m running some tests on sklearn decision trees1, and the lessons learned so far may be interesting.

I’ve put my measurement code at the end – I’m tracking % correct, number of tests that are positive, negative, and false positives and negatives.

  • When running predictions, if you have a defect where you include the ‘answer’ column in the test columns, the above code gives you a division by zero, which is a good check.
  • For my data, when I run with criterion=’entropy’2 I get 2% increase in accuracy, but other people I talked to on twitter have had the opposite
  • criterion=’entropy’ is noticeably slower than the default (‘gini’)
  • The default decision tree settings create trees that are very deep (~20k nodes for ~100k data points)
  • For my use case, I found that limiting the depth of trees and forcing each node to have a large number of samples (50-500) made much simpler trees with only a small decrease in accuracy.
  • In forcing nodes to have more samples, the accuracy decreased ~0-5%, roughly along the range of how many samples were included at each node (50-500)
  • I found that I needed to remove a lot of my database columns to get a meaningful result. For instance originally I had ID columns, which lets sklearn pick up data created in a certain time window (since the IDs are sequential) but I don’t think this is useful for what I want to do.
  • You have to turn class based attribute values into integers (it appears to be using a numpy float class internally for performance reasons)
  • SKLearn appears to only use range based rules. Combine this with the above and you get a lot of rules like “status > 1.5″
  • The tree could conceivably generate equality conditions within the structure, although it’d be hard to tell (e.g. “status > 1.5″, “status < 2.5" would be equivalent to "status = 2" if status is an integer)
  • I’m more interesting in discovering useful rules than in future predictions; it helps a lot to generate JSON3
  • Within the JSON, the “entropy” and “impurity” field shows you how clean the rule is (0 = good). The “value” field shows how many items fit the rule (small numbers are probably not useful, at least for me)
  • testsRun = 0
    testsPassed = 0
    testsFalseNegative = 0
    testsFalsePositive = 0
    testsPositive = 0
    testsNegative = 0
    for t in test:
      prediction = clf.predict(t)[0]
      if prediction == 0:
        testsNegative = testsNegative + 1
      else:
        testsPositive = testsPositive + 1
     
      if prediction == test_v[testsRun]:
        testsPassed = testsPassed + 1
      else: 
        if prediction == 0:
          testsFalseNegative = testsFalseNegative + 1
        else:
          testsFalsePositive = testsFalsePositive + 1
     
      testsRun = testsRun + 1
     
    print "Percent Pass: {0}".format(100 * testsPassed / testsRun)
    print "Percent Positive: {0}".format(100 * testsPositive / testsRun)
    print "Percent Negative: {0}".format(100 * testsNegative / testsRun)
    print "Percent False positive: {0}".format(100 * testsFalseNegative / (testsFalsePositive + testsFalseNegative))
    print "Percent False negative: {0}".format(100 * testsFalsePositive / (testsFalsePositive + testsFalseNegative))
    Citations:
    1. http://www.garysieling.com/blog/building-decision-tree-python-postgres-data []
    2. http://www.garysieling.com/blog/sklearn-gini-vs-entropy-criteria []
    3. http://www.garysieling.com/blog/convert-scikit-learn-decision-trees-json []

0 comments ↓

There are no comments yet...Kick things off by filling out the form below.

Leave a Comment

Current ye@r *