{"id":2094,"date":"2014-03-02T13:55:57","date_gmt":"2014-03-02T13:55:57","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=2094"},"modified":"2020-03-31T00:46:31","modified_gmt":"2020-03-31T00:46:31","slug":"sklearn-gini-vs-entropy-criteria","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/sklearn-gini-vs-entropy-criteria\/","title":{"rendered":"Decision Trees: &#8220;Gini&#8221; vs. &#8220;Entropy&#8221; criteria"},"content":{"rendered":"<p>The scikit-learn documentation<sup><a href=\"#footnote_0_2094\" id=\"identifier_0_2094\" class=\"footnote-link footnote-identifier-link\" title=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.tree.DecisionTreeClassifier.html\">1<\/a><\/sup> has an argument to control how the decision tree algorithm splits nodes:<\/p>\n<pre>\ncriterion : string, optional (default=\u201dgini\u201d)\nThe function to measure the quality of a split. \nSupported criteria are \u201cgini\u201d for the Gini impurity \nand \u201centropy\u201d for the information gain.\n<\/pre>\n<p>It seems like something that could be important since this determines the formula used to partition your dataset at each point in the dataset. <\/p>\n<p>Unfortunately the documentation tells you nothing about what you should use, other than trying each to see what happens, so here&#8217;s what I found (spoiler: it doesn&#8217;t appear to matter):<\/p>\n<ul>\n<li>Gini is intended for continuous attributes, and Entropy for attributes that occur in classes (e.g. colors<sup><a href=\"#footnote_1_2094\" id=\"identifier_1_2094\" class=\"footnote-link footnote-identifier-link\" title=\"http:\/\/paginas.fe.up.pt\/~ec\/files_1011\/week%2008%20-%20Decision%20Trees.pdf\">2<\/a><\/sup><\/li>\n<li>&#8220;Gini&#8221; will tend to find the largest class, and &#8220;entropy&#8221; tends to find groups of classes that make up ~50% of the data((http:\/\/paginas.fe.up.pt\/~ec\/files_1011\/week%2008%20-%20Decision%20Trees.pdf))<\/li>\n<li>&#8220;Gini&#8221; to minimize misclassification<sup><a href=\"#footnote_2_2094\" id=\"identifier_2_2094\" class=\"footnote-link footnote-identifier-link\" title=\"http:\/\/www.quora.com\/Machine-Learning\/Are-gini-index-entropy-or-classification-error-measures-causing-any-difference-on-Decision-Tree-classification\">3<\/a><\/sup><\/li>\n<li>&#8220;Entropy&#8221; for exploratory analysis<sup><a href=\"#footnote_2_2094\" id=\"identifier_3_2094\" class=\"footnote-link footnote-identifier-link\" title=\"http:\/\/www.quora.com\/Machine-Learning\/Are-gini-index-entropy-or-classification-error-measures-causing-any-difference-on-Decision-Tree-classification\">3<\/a><\/sup><\/li>\n<li>Some studies show this doesn&#8217;t matter &#8211; these differ less than 2% of the time<sup><a href=\"#footnote_3_2094\" id=\"identifier_4_2094\" class=\"footnote-link footnote-identifier-link\" title=\"https:\/\/rapid-i.com\/rapidforum\/index.php?topic=3060.0\">4<\/a><\/sup><\/li>\n<\/li>\n<li>Entropy may be a little slower to compute<sup><a href=\"#footnote_4_2094\" id=\"identifier_5_2094\" class=\"footnote-link footnote-identifier-link\" title=\"http:\/\/stats.stackexchange.com\/questions\/19639\/which-is-a-better-cost-function-for-a-random-forest-tree-gini-index-or-entropy\">5<\/a><\/sup><\/li>\n<\/ul>\n<ol class=\"footnotes\"><li id=\"footnote_0_2094\" class=\"footnote\">http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.tree.DecisionTreeClassifier.html<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_0_2094\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_1_2094\" class=\"footnote\">http:\/\/paginas.fe.up.pt\/~ec\/files_1011\/week%2008%20-%20Decision%20Trees.pdf<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_1_2094\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_2_2094\" class=\"footnote\">http:\/\/www.quora.com\/Machine-Learning\/Are-gini-index-entropy-or-classification-error-measures-causing-any-difference-on-Decision-Tree-classification<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_2_2094\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>] [<a href=\"#identifier_3_2094\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_3_2094\" class=\"footnote\">https:\/\/rapid-i.com\/rapidforum\/index.php?topic=3060.0<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_4_2094\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><li id=\"footnote_4_2094\" class=\"footnote\">http:\/\/stats.stackexchange.com\/questions\/19639\/which-is-a-better-cost-function-for-a-random-forest-tree-gini-index-or-entropy<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_5_2094\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><\/ol>","protected":false},"excerpt":{"rendered":"<p>The scikit-learn documentation1 has an argument to control how the decision tree algorithm splits nodes: criterion : string, optional (default=\u201dgini\u201d) The function to measure the quality of a split. Supported criteria are \u201cgini\u201d for the Gini impurity and \u201centropy\u201d for the information gain. It seems like something that could be important since this determines the &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/sklearn-gini-vs-entropy-criteria\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Decision Trees: &#8220;Gini&#8221; vs. &#8220;Entropy&#8221; criteria&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[5,6],"tags":[140,152,352,447,550],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/2094"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=2094"}],"version-history":[{"count":1,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/2094\/revisions"}],"predecessor-version":[{"id":6499,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/2094\/revisions\/6499"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=2094"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=2094"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=2094"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}