-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[WIP] priority of features in decision to splitting #12364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bigger change than I expected. Is it not possible to reuse the existing code?
This needs tests also |
Actually there are that many lines because I created a new splitter type. I updated pull request, now you can see what I change just here. Basically I said that the best feature to split can be updated if two features have the same criterion value and the new feature has a lower index than the previous best feature. I put my code in a new split in case you'd want to test both at the same time. |
I apologize for the time I've taken with this pull request
The test is
I have had a hard time trying to debug this because I'm not able to install scikit learn using Cython=0.23.5 as Travis does. I get the following error
Would you have any idea of why this could be happening? My python version is 2.7.12
As I haven't been able to debug I would like to ask if you have any insights on why I could be failing the test when it passes on all the other virtual environments. |
This is quite weird. Please add a regression test for you change. The test that's failing seems to be hard-coding results, we should check where they came from. Having different results on different architectures is not impossible but also a bit surprising. |
Hi @GMarzinotto, any progress on this PR? |
Interested to hear if this issue has been resolved, would be very useful for my research |
Hello! I also updated few parts of the automatically generated documentation Please let me know if there is any question or remark ! |
I would like to add a suggestion for the test test_gradient_boosting_early_stopping I believe it, instead of hardcoding the results, one should :
That way the test will continue to validate early_stopping and be robust to changes in the random states/platforms |
Thanks. i am super excited to test this out. I might be very useful for some applied genomic research that i am doing. This might be a stupid or beginners question, but i have never done this before, but what is the easiest way to install this changed version? should i follow e.g. https://scikit-learn.org/stable/developers/advanced_installation.html ? Many thanks in advance! |
Hi, what's the current status of this PR? Is it going to be merged? |
Reference Issues/PRs
Fixes #12259 . See also issues #2386 , #8443 and #12188
What does this implement/fix? Explain your changes.
Basically, Scikit-learn implements decision trees with 2 different types of splitters. A RandomSplitter and a BestSplitter. I focus on BestSplitter, which currently has two flaws:
We have a supplementary constraint which is the Shuffle constrain. Because currently there is a max_features parameter which allows us to do not test all the features at each node if there are many. So this is a nice feature and I believe we should always shuffle. The problem is that shuffling means non deterministic...
So we have:
So for me the ideal is to have a a prior on the features (Sorted, saying lower index means I prefer the feature).
We need to do Shuffle or at least Loop, in order to support max_features, which is often useful, and necessary for sparse cases.
Then it should be deterministic in the default configuration, so if we want this we have the following possibilities:
max_features=n_features
stable regardless of therandom_state
because we will test all the features.So I implemented this BestSplitter2 that shuffles, and is deterministic when
max_features=n_features
and gives priority to the lower index features when there is a tie.Here you can see an example of the previous and the new one:
Previous BestSplitter is on the left and New BestSplitter2 is on the right
We observe that BestSplitter2 give priority to lower index features.
And when we test stability
We obtain
0
as expected in #12259Any other comments?
I don't know what do you think, I think having many splitters is not a solution either.
Personally I would replace the current bestSplitter with the one described above, as they are essentially the same but with some improvements.
Thank you very much