Gary Sieling

Decision Trees: “Gini” vs. “Entropy” criteria

The scikit-learn documentation1 has an argument to control how the decision tree algorithm splits nodes:

criterion : string, optional (default=”gini”)
The function to measure the quality of a split. 
Supported criteria are “gini” for the Gini impurity 
and “entropy” for the information gain.

It seems like something that could be important since this determines the formula used to partition your dataset at each point in the dataset.

Unfortunately the documentation tells you nothing about what you should use, other than trying each to see what happens, so here’s what I found (spoiler: it doesn’t appear to matter):

  1. http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html []
  2. http://paginas.fe.up.pt/~ec/files_1011/week%2008%20-%20Decision%20Trees.pdf []
  3. http://www.quora.com/Machine-Learning/Are-gini-index-entropy-or-classification-error-measures-causing-any-difference-on-Decision-Tree-classification [] []
  4. https://rapid-i.com/rapidforum/index.php?topic=3060.0 []
  5. http://stats.stackexchange.com/questions/19639/which-is-a-better-cost-function-for-a-random-forest-tree-gini-index-or-entropy []
Exit mobile version