Skip to content

DecisionTreeClassifier behaviour when there are 2 or more best splitter (a tie among splitters)  #12259

@GMarzinotto

Description

@GMarzinotto

Description

My issue has been discussed in the past issues #2386 and #8443

It concerns DecisionTreeClassifier and the splitter parameter
In the docs this parameter has two possible values splitter="best" and splitter="random"

The current behaviour when splitter="best" is to shuffle the features at each step and take the best feature to split. In case there is a tie, we take a random one.

In some cases, there is a prior on the feature importance or in the ease of the interpretation.

For example, some features can be noisy (due to possible data errors or any other cause) and shouldn't be used for splitting unless there is no 'good quality' feature available that does the job. Similarly, some features can be easy to understand and some others can be obscure and shouldn't be used for splitting when possible.

The current implementation doesn't allow having this kind of prior. So when two features tie as the best splitter one get chosen at random. I believe this could be an important improvement. Specially when the trees do not have a max_depth, because they tend to over-fit on random features while there may be some "better" features (in terms of prior) that do the work.

Having an option such as splitter="best_no_shuffle" would allow the user to provide the features in order of importance and then when two or more features tie as the best splitters, it would systemically chose the features with the lowest index. This was once proposed here

If you believe that this can be a valuable improvement I can try to implement it.
Or maybe you know a better workaround that solves my problem !

Steps/Code to Reproduce

I have recovered the code from #8443 because it is simple and explains the issue

import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
x,y  = iris.data, iris.target

dtc1 = DecisionTreeClassifier(random_state=1)
dtc2 = DecisionTreeClassifier(random_state=2)

rs = np.random.RandomState(1234)
itr = rs.rand(x.shape[0]) < 0.75

dtc1.fit(x[itr],y[itr])
dtc2.fit(x[itr],y[itr])

print(  (dtc1.predict(x[~itr]) != dtc2.predict(x[~itr])).sum() )

Expected Results

Should print 0

Actual Results

Prints 1

Versions

System

machine: Linux-4.4.0-135-generic-x86_64-with-Ubuntu-16.04-xenial
executable: /usr/bin/python3
python: 3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609]

BLAS

cblas_libs: openblas, openblas
macros: HAVE_CBLAS=None
lib_dirs: /usr/lib

Python deps

 scipy: 0.19.0

sklearn: 0.21.dev0
pandas: 0.20.3
numpy: 1.13.3
Cython: 0.26
setuptools: 36.7.2
pip: 9.0.1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions