{"id":2105,"date":"2014-03-03T13:54:12","date_gmt":"2014-03-03T13:54:12","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=2105"},"modified":"2014-03-03T13:54:12","modified_gmt":"2014-03-03T13:54:12","slug":"decision-tree-testing-lessons","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/decision-tree-testing-lessons\/","title":{"rendered":"Decision Tree Testing Lessons"},"content":{"rendered":"<p>I&#8217;m running some tests on sklearn decision trees<sup><a href=\"#footnote_0_2105\" id=\"identifier_0_2105\" class=\"footnote-link footnote-identifier-link\" title=\"http:\/\/www.garysieling.com\/blog\/building-decision-tree-python-postgres-data\">1<\/a><\/sup>, and the lessons learned so far may be interesting.<\/p>\n<p>I&#8217;ve put my measurement code at the end &#8211; I&#8217;m tracking % correct, number of tests that are positive, negative, and false positives and negatives.<\/p>\n<ul>\n<li>When running predictions, if you have a defect where you include the &#8216;answer&#8217; column in the test columns, the above code gives you a division by zero, which is a good check.<\/li>\n<li>For my data, when I run with criterion=&#8217;entropy&#8217;<sup><a href=\"#footnote_1_2105\" id=\"identifier_1_2105\" class=\"footnote-link footnote-identifier-link\" title=\"http:\/\/www.garysieling.com\/blog\/sklearn-gini-vs-entropy-criteria\">2<\/a><\/sup> I get 2% increase in accuracy, but other people I talked to on twitter have had the opposite<\/li>\n<li>criterion=&#8217;entropy&#8217; is noticeably slower than the default (&#8216;gini&#8217;)<\/li>\n<li>The default decision tree settings create trees that are very deep (~20k nodes for ~100k data points)<\/li>\n<li>For my use case, I found that limiting the depth of trees and forcing each node to have a large number of samples (50-500) made much simpler trees with only a small decrease in accuracy.<\/li>\n<li>In forcing nodes to have more samples, the accuracy decreased ~0-5%, roughly along the range of how many samples were included at each node (50-500)<\/li>\n<li>I found that I needed to remove a lot of my database columns to get a meaningful result. For instance originally I had ID columns, which lets sklearn pick up data created in a certain time window (since the IDs are sequential) but I don&#8217;t think this is useful for what I want to do.<\/li>\n<li>You have to turn class based attribute values into integers (it appears to be using a numpy float class internally for performance reasons)\n<li>SKLearn appears to only use range based rules. Combine this with the above and you get a lot of rules like &#8220;status > 1.5&#8221;<\/li>\n<li>The tree could conceivably generate equality conditions within the structure, although it&#8217;d be hard to tell (e.g. &#8220;status > 1.5&#8221;, &#8220;status < 2.5\" would be equivalent to \"status = 2\" if status is an integer)<\/li>\n<li>I&#8217;m more interesting in discovering useful rules than in future predictions; it helps a lot to generate JSON<sup><a href=\"#footnote_2_2105\" id=\"identifier_2_2105\" class=\"footnote-link footnote-identifier-link\" title=\"http:\/\/www.garysieling.com\/blog\/convert-scikit-learn-decision-trees-json\">3<\/a><\/sup><\/li>\n<li>Within the JSON, the &#8220;entropy&#8221; and &#8220;impurity&#8221; field shows you how clean the rule is (0 = good). The &#8220;value&#8221; field shows how many items fit the rule (small numbers are probably not useful, at least for me)<\/li>\n<pre lang=\"python\">\ntestsRun = 0\ntestsPassed = 0\ntestsFalseNegative = 0\ntestsFalsePositive = 0\ntestsPositive = 0\ntestsNegative = 0\nfor t in test:\n  prediction = clf.predict(t)[0]\n  if prediction == 0:\n    testsNegative = testsNegative + 1\n  else:\n    testsPositive = testsPositive + 1\n\n  if prediction == test_v[testsRun]:\n    testsPassed = testsPassed + 1\n  else: \n    if prediction == 0:\n      testsFalseNegative = testsFalseNegative + 1\n    else:\n      testsFalsePositive = testsFalsePositive + 1\n \n  testsRun = testsRun + 1\n \nprint \"Percent Pass: {0}\".format(100 * testsPassed \/ testsRun)\nprint \"Percent Positive: {0}\".format(100 * testsPositive \/ testsRun)\nprint \"Percent Negative: {0}\".format(100 * testsNegative \/ testsRun)\nprint \"Percent False positive: {0}\".format(100 * testsFalseNegative \/ (testsFalsePositive + testsFalseNegative))\nprint \"Percent False negative: {0}\".format(100 * testsFalsePositive \/ (testsFalsePositive + testsFalseNegative))\n<\/pre>\n<ol class=\"footnotes\"><li id=\"footnote_0_2105\" class=\"footnote\">http:\/\/www.garysieling.com\/blog\/building-decision-tree-python-postgres-data<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_0_2105\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_1_2105\" class=\"footnote\">http:\/\/www.garysieling.com\/blog\/sklearn-gini-vs-entropy-criteria<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_1_2105\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_2_2105\" class=\"footnote\">http:\/\/www.garysieling.com\/blog\/convert-scikit-learn-decision-trees-json<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_2_2105\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><\/ol>","protected":false},"excerpt":{"rendered":"<p>I&#8217;m running some tests on sklearn decision trees1, and the lessons learned so far may be interesting. I&#8217;ve put my measurement code at the end &#8211; I&#8217;m tracking % correct, number of tests that are positive, negative, and false positives and negatives. When running predictions, if you have a defect where you include the &#8216;answer&#8217; &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/decision-tree-testing-lessons\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Decision Tree Testing Lessons&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[1],"tags":[],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/2105"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=2105"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/2105\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=2105"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=2105"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=2105"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}