Skip to content

[WIP] priority of features in decision to splitting #12364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

GMarzinotto
Copy link
Contributor

Reference Issues/PRs

Fixes #12259 . See also issues #2386 , #8443 and #12188

What does this implement/fix? Explain your changes.

Basically, Scikit-learn implements decision trees with 2 different types of splitters. A RandomSplitter and a BestSplitter. I focus on BestSplitter, which currently has two flaws:

  1. It is random (provide different outputs on the default config). This happens when there are two or more 'best' features with equal criterion value.
  2. It doesn't allow to give priors on the features, for example, when there is a Tie. It would be a nice-to-have if we could chose a features that we prefer. Either because it is easier to compute, or because it is less noisy or because it is easy to interpret. The ideal would be to encode it in the index of the column of the feature. Lower index means higher priority.

We have a supplementary constraint which is the Shuffle constrain. Because currently there is a max_features parameter which allows us to do not test all the features at each node if there are many. So this is a nice feature and I believe we should always shuffle. The problem is that shuffling means non deterministic...

So we have:

  • Sorted / Unsorted -> stands for the fact of giving priority to lower index features if and only if there is a Tie.
  • Deterministic / Non -> stands for the fact of always obtaining the same result, at least for the default configuration.
  • Shuffled / Looped / No -> stands for the fact of shuffling the features when we try them. We can shuffle them randomly, we can loop through them or can test them in order (from 0 to N). This is important only if max_featues < nb_features or if we do not sort the features.

So for me the ideal is to have a a prior on the features (Sorted, saying lower index means I prefer the feature).
We need to do Shuffle or at least Loop, in order to support max_features, which is often useful, and necessary for sparse cases.

Then it should be deterministic in the default configuration, so if we want this we have the following possibilities:

  • we can do looping instead of shuffling: this would make random_state useless here... so is not viable
  • we can do shuffle and sort the features to the best of our knowledge: So we do shuffling and if 2 features from the tested features are 'best' we keep the lower index. This will make the default config where max_features=n_features stable regardless of the random_state because we will test all the features.

So I implemented this BestSplitter2 that shuffles, and is deterministic when max_features=n_features and gives priority to the lower index features when there is a tie.

Here you can see an example of the previous and the new one:
Previous BestSplitter is on the left and New BestSplitter2 is on the right

Screenshot-from-2018-10-12-11-31-55.png

We observe that BestSplitter2 give priority to lower index features.
And when we test stability

import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
x,y  = iris.data, iris.target

dtc1 = DecisionTreeClassifier(random_state=1,splitter='best2')
dtc2 = DecisionTreeClassifier(random_state=2,splitter='best2')

rs = np.random.RandomState(1234)
itr = rs.rand(x.shape[0]) < 0.75

dtc1.fit(x[itr],y[itr])
dtc2.fit(x[itr],y[itr])

print(  (dtc1.predict(x[~itr]) != dtc2.predict(x[~itr])).sum() )

We obtain 0 as expected in #12259

Any other comments?

I don't know what do you think, I think having many splitters is not a solution either.
Personally I would replace the current bestSplitter with the one described above, as they are essentially the same but with some improvements.

Thank you very much

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bigger change than I expected. Is it not possible to reuse the existing code?

@jnothman
Copy link
Member

This needs tests also

@jnothman jnothman changed the title fix decisionTree issue #12259 [WIP] priority of features in decision to splitting Oct 14, 2018
@GMarzinotto
Copy link
Contributor Author

Actually there are that many lines because I created a new splitter type.
But the new splitter time is basically BestSplitter just changing one line of code.

I updated pull request, now you can see what I change just here.
https://github.com/scikit-learn/scikit-learn/pull/12364/files

Basically I said that the best feature to split can be updated if two features have the same criterion value and the new feature has a lower index than the previous best feature.

I put my code in a new split in case you'd want to test both at the same time.
But I agree there is no need for two 'BestSplitters'.

@GMarzinotto
Copy link
Contributor Author

GMarzinotto commented Oct 27, 2018

I apologize for the time I've taken with this pull request
currently my modifications are failing one test with Travis, with one of the virtual environment

DISTRIB="ubuntu" PYTHON_VERSION="2.7" CYTHON_VERSION="0.23.5" COVERAGE=true

The test is

____________________ test_gradient_boosting_early_stopping _____________________
def test_gradient_boosting_early_stopping():
X, y = make_classification(n_samples=1000, random_state=0)

    gbc = GradientBoostingClassifier(n_estimators=1000,
                                     n_iter_no_change=10,
                                     learning_rate=0.1, max_depth=3,
                                     random_state=42)

    gbr = GradientBoostingRegressor(n_estimators=1000, n_iter_no_change=10,
                                    learning_rate=0.1, max_depth=3,
                                    random_state=42)

    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                        random_state=42)
    # Check if early_stopping works as expected
    for est, tol, early_stop_n_estimators in ((gbc, 1e-1, 24), (gbr, 1e-1, 13),
                                              (gbc, 1e-3, 36),
                                              (gbr, 1e-3, 28)):
        est.set_params(tol=tol)
        est.fit(X_train, y_train)
      assert_equal(est.n_estimators_, early_stop_n_estimators)

E AssertionError: 39 != 36

I have had a hard time trying to debug this because I'm not able to install scikit learn using Cython=0.23.5 as Travis does.

I get the following error

cc1: some warnings being treated as errors
error: Command "x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/home/gabriel/TEST_SKLEARN_FORK/local/lib/python2.7/site-packages/numpy/core/include -I/home/gabriel/TEST_SKLEARN_FORK/local/lib/python2.7/site-packages/numpy/core/include -I/usr/include/python2.7 -I/usr/include/python2.7 -c sklearn/neighbors/quad_tree.c -o build/temp.linux-x86_64-2.7/sklearn/neighbors/quad_tree.o -MMD -MF build/temp.linux-x86_64-2.7/sklearn/neighbors/quad_tree.o.d" failed with exit status 1

Would you have any idea of why this could be happening?

My python version is 2.7.12
pip freeze gives:

atomicwrites==1.2.1
attrs==18.2.0
certifi==2018.10.15
chardet==3.0.4
codecov==2.0.15
coverage==4.5.1
Cython==0.23.5
funcsigs==1.0.2
idna==2.7
more-itertools==4.3.0
numpy==1.16.0.dev0+c0d4c74
pandas==0.24.0.dev0+824.g437f31cb6
pathlib2==2.3.2
pkg-resources==0.0.0
pluggy==0.8.0
py==1.7.0
pytest==3.9.2
pytest-cov==2.6.0
python-dateutil==2.7.4
pytz==2018.6
requests==2.20.0
scandir==1.9.0
scipy==1.2.0.dev0+3516199
six==1.11.0
urllib3==1.24

As I haven't been able to debug I would like to ask if you have any insights on why I could be failing the test when it passes on all the other virtual environments.

@amueller
Copy link
Member

amueller commented Oct 29, 2018

This is quite weird. Please add a regression test for you change. The test that's failing seems to be hard-coding results, we should check where they came from. Having different results on different architectures is not impossible but also a bit surprising.

@Superhzf
Copy link

Superhzf commented Oct 19, 2020

Hi @GMarzinotto, any progress on this PR?

@WDee
Copy link

WDee commented Oct 29, 2020

Interested to hear if this issue has been resolved, would be very useful for my research

@GMarzinotto
Copy link
Contributor Author

Hello!
So I have updated the problematic test
Actually in the meantime the hard coded tests that was failing had already been modified
So I just changed the random seed and hardcoded some new values

I also updated few parts of the automatically generated documentation
Performance never degrades, but structures of the trees do (this is normal and works as expected)

Please let me know if there is any question or remark !

@GMarzinotto
Copy link
Contributor Author

I would like to add a suggestion for the test

test_gradient_boosting_early_stopping

I believe it, instead of hardcoding the results, one should :

  • check that the number of estimator increases as tolerance decreases.
  • check that the number of estimators is smaller than the number_of_estimators parameter
  • check that performances remains above a threshold

That way the test will continue to validate early_stopping and be robust to changes in the random states/platforms

@WDee
Copy link

WDee commented Nov 2, 2020

Thanks. i am super excited to test this out. I might be very useful for some applied genomic research that i am doing. This might be a stupid or beginners question, but i have never done this before, but what is the easiest way to install this changed version? should i follow e.g. https://scikit-learn.org/stable/developers/advanced_installation.html ? Many thanks in advance!

@GMarzinotto GMarzinotto requested a review from jnothman December 14, 2020 17:36
Base automatically changed from master to main January 22, 2021 10:50
@jmradecka
Copy link

Hi, what's the current status of this PR? Is it going to be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DecisionTreeClassifier behaviour when there are 2 or more best splitter (a tie among splitters)
8 participants