[WIP] priority of features in decision to splitting #12364

GMarzinotto · 2018-10-12T09:37:35Z

Reference Issues/PRs

Fixes #12259 . See also issues #2386 , #8443 and #12188

What does this implement/fix? Explain your changes.

Basically, Scikit-learn implements decision trees with 2 different types of splitters. A RandomSplitter and a BestSplitter. I focus on BestSplitter, which currently has two flaws:

It is random (provide different outputs on the default config). This happens when there are two or more 'best' features with equal criterion value.
It doesn't allow to give priors on the features, for example, when there is a Tie. It would be a nice-to-have if we could chose a features that we prefer. Either because it is easier to compute, or because it is less noisy or because it is easy to interpret. The ideal would be to encode it in the index of the column of the feature. Lower index means higher priority.

We have a supplementary constraint which is the Shuffle constrain. Because currently there is a max_features parameter which allows us to do not test all the features at each node if there are many. So this is a nice feature and I believe we should always shuffle. The problem is that shuffling means non deterministic...

So we have:

Sorted / Unsorted -> stands for the fact of giving priority to lower index features if and only if there is a Tie.
Deterministic / Non -> stands for the fact of always obtaining the same result, at least for the default configuration.
Shuffled / Looped / No -> stands for the fact of shuffling the features when we try them. We can shuffle them randomly, we can loop through them or can test them in order (from 0 to N). This is important only if max_featues < nb_features or if we do not sort the features.

So for me the ideal is to have a a prior on the features (Sorted, saying lower index means I prefer the feature).
We need to do Shuffle or at least Loop, in order to support max_features, which is often useful, and necessary for sparse cases.

Then it should be deterministic in the default configuration, so if we want this we have the following possibilities:

we can do looping instead of shuffling: this would make random_state useless here... so is not viable
we can do shuffle and sort the features to the best of our knowledge: So we do shuffling and if 2 features from the tested features are 'best' we keep the lower index. This will make the default config where max_features=n_features stable regardless of the random_state because we will test all the features.

So I implemented this BestSplitter2 that shuffles, and is deterministic when max_features=n_features and gives priority to the lower index features when there is a tie.

Here you can see an example of the previous and the new one:
Previous BestSplitter is on the left and New BestSplitter2 is on the right

We observe that BestSplitter2 give priority to lower index features.
And when we test stability

import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
x,y  = iris.data, iris.target

dtc1 = DecisionTreeClassifier(random_state=1,splitter='best2')
dtc2 = DecisionTreeClassifier(random_state=2,splitter='best2')

rs = np.random.RandomState(1234)
itr = rs.rand(x.shape[0]) < 0.75

dtc1.fit(x[itr],y[itr])
dtc2.fit(x[itr],y[itr])

print(  (dtc1.predict(x[~itr]) != dtc2.predict(x[~itr])).sum() )

We obtain 0 as expected in #12259

Any other comments?

I don't know what do you think, I think having many splitters is not a solution either.
Personally I would replace the current bestSplitter with the one described above, as they are essentially the same but with some improvements.

Thank you very much

jnothman

This is a bigger change than I expected. Is it not possible to reuse the existing code?

jnothman · 2018-10-14T01:11:06Z

This needs tests also

GMarzinotto · 2018-10-14T20:16:00Z

Actually there are that many lines because I created a new splitter type.
But the new splitter time is basically BestSplitter just changing one line of code.

I updated pull request, now you can see what I change just here.
https://github.com/scikit-learn/scikit-learn/pull/12364/files

Basically I said that the best feature to split can be updated if two features have the same criterion value and the new feature has a lower index than the previous best feature.

I put my code in a new split in case you'd want to test both at the same time.
But I agree there is no need for two 'BestSplitters'.

GMarzinotto · 2018-10-27T15:54:46Z

I apologize for the time I've taken with this pull request
currently my modifications are failing one test with Travis, with one of the virtual environment

DISTRIB="ubuntu" PYTHON_VERSION="2.7" CYTHON_VERSION="0.23.5" COVERAGE=true

The test is

____________________ test_gradient_boosting_early_stopping _____________________
def test_gradient_boosting_early_stopping():
X, y = make_classification(n_samples=1000, random_state=0)

    gbc = GradientBoostingClassifier(n_estimators=1000,
                                     n_iter_no_change=10,
                                     learning_rate=0.1, max_depth=3,
                                     random_state=42)

    gbr = GradientBoostingRegressor(n_estimators=1000, n_iter_no_change=10,
                                    learning_rate=0.1, max_depth=3,
                                    random_state=42)

    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                        random_state=42)
    # Check if early_stopping works as expected
    for est, tol, early_stop_n_estimators in ((gbc, 1e-1, 24), (gbr, 1e-1, 13),
                                              (gbc, 1e-3, 36),
                                              (gbr, 1e-3, 28)):
        est.set_params(tol=tol)
        est.fit(X_train, y_train)

      assert_equal(est.n_estimators_, early_stop_n_estimators)

E AssertionError: 39 != 36

I have had a hard time trying to debug this because I'm not able to install scikit learn using Cython=0.23.5 as Travis does.

I get the following error

cc1: some warnings being treated as errors
error: Command "x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/home/gabriel/TEST_SKLEARN_FORK/local/lib/python2.7/site-packages/numpy/core/include -I/home/gabriel/TEST_SKLEARN_FORK/local/lib/python2.7/site-packages/numpy/core/include -I/usr/include/python2.7 -I/usr/include/python2.7 -c sklearn/neighbors/quad_tree.c -o build/temp.linux-x86_64-2.7/sklearn/neighbors/quad_tree.o -MMD -MF build/temp.linux-x86_64-2.7/sklearn/neighbors/quad_tree.o.d" failed with exit status 1

Would you have any idea of why this could be happening?

My python version is 2.7.12
pip freeze gives:

atomicwrites==1.2.1
attrs==18.2.0
certifi==2018.10.15
chardet==3.0.4
codecov==2.0.15
coverage==4.5.1
Cython==0.23.5
funcsigs==1.0.2
idna==2.7
more-itertools==4.3.0
numpy==1.16.0.dev0+c0d4c74
pandas==0.24.0.dev0+824.g437f31cb6
pathlib2==2.3.2
pkg-resources==0.0.0
pluggy==0.8.0
py==1.7.0
pytest==3.9.2
pytest-cov==2.6.0
python-dateutil==2.7.4
pytz==2018.6
requests==2.20.0
scandir==1.9.0
scipy==1.2.0.dev0+3516199
six==1.11.0
urllib3==1.24

As I haven't been able to debug I would like to ask if you have any insights on why I could be failing the test when it passes on all the other virtual environments.

amueller · 2018-10-29T02:17:25Z

This is quite weird. Please add a regression test for you change. The test that's failing seems to be hard-coding results, we should check where they came from. Having different results on different architectures is not impossible but also a bit surprising.

Superhzf · 2020-10-19T11:53:09Z

Hi @GMarzinotto, any progress on this PR?

WDee · 2020-10-29T11:00:30Z

Interested to hear if this issue has been resolved, would be very useful for my research

GMarzinotto · 2020-11-02T08:15:28Z

Hello!
So I have updated the problematic test
Actually in the meantime the hard coded tests that was failing had already been modified
So I just changed the random seed and hardcoded some new values

I also updated few parts of the automatically generated documentation
Performance never degrades, but structures of the trees do (this is normal and works as expected)

Please let me know if there is any question or remark !

GMarzinotto · 2020-11-02T09:29:32Z

I would like to add a suggestion for the test

test_gradient_boosting_early_stopping

I believe it, instead of hardcoding the results, one should :

check that the number of estimator increases as tolerance decreases.
check that the number of estimators is smaller than the number_of_estimators parameter
check that performances remains above a threshold

That way the test will continue to validate early_stopping and be robust to changes in the random states/platforms

WDee · 2020-11-02T09:39:47Z

Thanks. i am super excited to test this out. I might be very useful for some applied genomic research that i am doing. This might be a stupid or beginners question, but i have never done this before, but what is the easiest way to install this changed version? should i follow e.g. https://scikit-learn.org/stable/developers/advanced_installation.html ? Many thanks in advance!

jmradecka · 2021-09-20T08:45:45Z

Hi, what's the current status of this PR? Is it going to be merged?

jnothman reviewed Oct 14, 2018

View reviewed changes

jnothman changed the title ~~fix decisionTree issue #12259~~ [WIP] priority of features in decision to splitting Oct 14, 2018

GMarzinotto added 3 commits October 16, 2018 18:43

fix decisionTree issue scikit-learn#12259

510f138

merge splitters

fb674c9

relaunch tests

9900f20

GMarzinotto force-pushed the master branch from a08671a to 9900f20 Compare October 16, 2018 16:45

add regression test

43476e2

amueller added Bug Waiting for Reviewer labels Aug 6, 2019

github-actions bot added the module:tree label Mar 2, 2020

GMarzinotto added 3 commits October 31, 2020 17:56

Merge https://github.com/scikit-learn/scikit-learn

d64269b

fix tests and doc

ed0023e

change tests

5588c9b

GMarzinotto requested a review from jnothman December 14, 2020 17:36

Base automatically changed from master to main January 22, 2021 10:50

thomasjpfan added the cython label Apr 13, 2021

cmarmo added help wanted Stalled and removed Waiting for Reviewer labels Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] priority of features in decision to splitting #12364

[WIP] priority of features in decision to splitting #12364

Uh oh!

GMarzinotto commented Oct 12, 2018

Uh oh!

jnothman left a comment

Uh oh!

jnothman commented Oct 14, 2018

Uh oh!

GMarzinotto commented Oct 14, 2018

Uh oh!

GMarzinotto commented Oct 27, 2018 •

edited

Loading

Uh oh!

amueller commented Oct 29, 2018 •

edited

Loading

Uh oh!

Superhzf commented Oct 19, 2020 •

edited

Loading

Uh oh!

WDee commented Oct 29, 2020

Uh oh!

GMarzinotto commented Nov 2, 2020

Uh oh!

GMarzinotto commented Nov 2, 2020

Uh oh!

WDee commented Nov 2, 2020 •

edited

Loading

Uh oh!

jmradecka commented Sep 20, 2021

Uh oh!

Uh oh!

Uh oh!

[WIP] priority of features in decision to splitting #12364

Are you sure you want to change the base?

[WIP] priority of features in decision to splitting #12364

Uh oh!

Conversation

GMarzinotto commented Oct 12, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

So we have:

Any other comments?

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Oct 14, 2018

Uh oh!

GMarzinotto commented Oct 14, 2018

Uh oh!

GMarzinotto commented Oct 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Oct 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Superhzf commented Oct 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WDee commented Oct 29, 2020

Uh oh!

GMarzinotto commented Nov 2, 2020

Uh oh!

GMarzinotto commented Nov 2, 2020

Uh oh!

WDee commented Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmradecka commented Sep 20, 2021

Uh oh!

Uh oh!

GMarzinotto commented Oct 27, 2018 •

edited

Loading

amueller commented Oct 29, 2018 •

edited

Loading

Superhzf commented Oct 19, 2020 •

edited

Loading

WDee commented Nov 2, 2020 •

edited

Loading