I’m running some tests on sklearn decision trees1, and the lessons learned so far may be interesting.
I’ve put my measurement code at the end – I’m tracking % correct, number of tests that are positive, negative, and false positives and negatives.
When running predictions, if you have a defect where you include the ‘answer’ column in the test columns, the above code gives you a division by zero, which is a good check.
For my data, when I run with criterion=’entropy’2 I get 2% increase in accuracy, but other people I talked to on twitter have had the opposite
criterion=’entropy’ is noticeably slower than the default (‘gini’)
The default decision tree settings create trees that are very deep (~20k nodes for ~100k data points)
For my use case, I found that limiting the depth of trees and forcing each node to have a large number of samples (50-500) made much simpler trees with only a small decrease in accuracy.
In forcing nodes to have more samples, the accuracy decreased ~0-5%, roughly along the range of how many samples were included at each node (50-500)
I found that I needed to remove a lot of my database columns to get a meaningful result. For instance originally I had ID columns, which lets sklearn pick up data created in a certain time window (since the IDs are sequential) but I don’t think this is useful for what I want to do.
You have to turn class based attribute values into integers (it appears to be using a numpy float class internally for performance reasons)
SKLearn appears to only use range based rules. Combine this with the above and you get a lot of rules like “status > 1.5”
The tree could conceivably generate equality conditions within the structure, although it’d be hard to tell (e.g. “status > 1.5”, “status < 2.5" would be equivalent to "status = 2" if status is an integer)
I’m more interesting in discovering useful rules than in future predictions; it helps a lot to generate JSON3
Within the JSON, the “entropy” and “impurity” field shows you how clean the rule is (0 = good). The “value” field shows how many items fit the rule (small numbers are probably not useful, at least for me)