-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Description
My issue has been discussed in the past issues #2386 and #8443
It concerns DecisionTreeClassifier
and the splitter
parameter
In the docs this parameter has two possible values splitter="best"
and splitter="random"
The current behaviour when splitter="best"
is to shuffle the features at each step and take the best feature to split. In case there is a tie, we take a random one.
In some cases, there is a prior on the feature importance or in the ease of the interpretation.
For example, some features can be noisy (due to possible data errors or any other cause) and shouldn't be used for splitting unless there is no 'good quality' feature available that does the job. Similarly, some features can be easy to understand and some others can be obscure and shouldn't be used for splitting when possible.
The current implementation doesn't allow having this kind of prior. So when two features tie as the best splitter one get chosen at random. I believe this could be an important improvement. Specially when the trees do not have a max_depth
, because they tend to over-fit on random features while there may be some "better" features (in terms of prior) that do the work.
Having an option such as splitter="best_no_shuffle"
would allow the user to provide the features in order of importance and then when two or more features tie as the best splitters, it would systemically chose the features with the lowest index. This was once proposed here
If you believe that this can be a valuable improvement I can try to implement it.
Or maybe you know a better workaround that solves my problem !
Steps/Code to Reproduce
I have recovered the code from #8443 because it is simple and explains the issue
import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
x,y = iris.data, iris.target
dtc1 = DecisionTreeClassifier(random_state=1)
dtc2 = DecisionTreeClassifier(random_state=2)
rs = np.random.RandomState(1234)
itr = rs.rand(x.shape[0]) < 0.75
dtc1.fit(x[itr],y[itr])
dtc2.fit(x[itr],y[itr])
print( (dtc1.predict(x[~itr]) != dtc2.predict(x[~itr])).sum() )
Expected Results
Should print 0
Actual Results
Prints 1
Versions
System
machine: Linux-4.4.0-135-generic-x86_64-with-Ubuntu-16.04-xenial
executable: /usr/bin/python3
python: 3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609]
BLAS
cblas_libs: openblas, openblas
macros: HAVE_CBLAS=None
lib_dirs: /usr/lib
Python deps
scipy: 0.19.0
sklearn: 0.21.dev0
pandas: 0.20.3
numpy: 1.13.3
Cython: 0.26
setuptools: 36.7.2
pip: 9.0.1