-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Feature Request: Hellinger split criterion for classificaiton trees #9947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The papers you reference are not highly cited. Are there alternatives that
have garnered greater interest?
Perhaps it is possible for you to implement these criteria and publish them
in a package external to scikit-learn, as we try to stick to popular
algorithms that have shown the test of time, in order to avoid excessive
maintenance burden.
…On 18 October 2017 at 10:22, DrEhrfurchtgebietend ***@***.***> wrote:
Currently tree classifiers like sklearn.tree.DecisionTreeClassifier have
two options for split criterion, “gini” and "entropy". These are sensitive
to imbalanced datasets. I would like to request the addition of the
Hellinger split criterion since it is insensitive to imbalanced datasets.
The documentation and motivation is outlined in the following papers.
https://www.researchgate.net/publication/220451886_
Hellinger_distance_decision_trees_are_robust_and_skew-insensitive
https://www.researchgate.net/publication/262225473_
Hellinger_Distance_Trees_for_Imbalanced_Streams
Example code (non-python) is contained within. There are some
implementations in other languages as well. For example
https://www3.nd.edu/~dial//software.html
We have a parameter "class_weight" and the use of Hellinger Trees has been
shown to be more useful than reweighting so it makes sense to add it.
This would require an update to
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_
criterion.pxd
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_
criterion.pyx
as well as adding the option to each of the classifiers. These include:
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.ensemble.RandomForestClassifier
- more
I would be willing to offer what help I can but I am yet to develop in
Cython so am unfamiliar with the coding standards
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#9947>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz68syXmobQ9MEmW6IHeBAJK-NeoFUks5stTaqgaJpZM4P89D_>
.
|
Yes this has not gathered a lot of citations in general but a typical amount for a good paper in this subfield. There are other solutions to the issue of imbalanced data. The "class_weight" method which is already implemented is a traditional but inferior solution. Other methods which receive comparable results are mostly ensemble methods which rely on balancing at the time of bagging. Good examples of this are Random Balance or RUSBoost Similar methods are implemented in You could argue that split criterion based solutions to imbalanced data should be in that package. I do not know where to draw the line since this minor feature addition has been shown to outperform the bulk of the methods in that package. There is one other comparable method to the Hellinger method It is another splitting criterion method called Class Confidence Proportion which does an equivalent yet different thing. I would be happy with this instead of the Hellinger method. I am not sure using citations as a metric for the usefulness of a feature is best. This is a very specific problem that occurs quite frequently in practice. Many users of scikit-learn are happy to use traditional methods or the ensemble methods in the imbalanced-learn package. It is easy to see there is general interest in solving this issue in this way https://www.reddit.com/r/MachineLearning/comments/2m72hv/hellinger_distance_decision_trees/ In any case, please leave the ticket open until some criterion based solution is implemented for the community even if this ticket needs to be passed to imbalanced-learn. Thanks! |
I think that we could integrate it in imbalanced-learn if it shows some improvement. The criterion in the tree in scikit-learn are classes; therefore I think that we could implement our own criterion on our side. We will just need scikit-learn to allow us for duck-typing. @jnothman Do you think this is something reasonable? |
I think it's very reasonable if we consider the Criterion API public...
|
That is a great solution. It would allow to have all the imbalanced classification tools in one package. It would seem the change to Scikit-learn is relatively small. Could a ticket for this change be made to supplant this one and this be passed to imbalanced learn? What is the release schedule like? Would getting this into 0.19.1 be possible? |
it would not be appropriate for a bug fixes release, and 0.19.1 should be
released today
…On 22 Oct 2017 6:25 am, "DrEhrfurchtgebietend" ***@***.***> wrote:
That is a great solution. It would allow to have all the imbalanced
classification tools in one package. It would seem the change to
Scikit-learn is relatively small. Could a ticket for this change be made to
supplant this one and this be passed to imbalanced learn? What is the
release schedule like? Would getting this into 0.19.1 be possible?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9947 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xLijobhq2ADWKIy21jaNnSi8lriks5sukUlgaJpZM4P89D_>
.
|
@DrEhrfurchtgebietend @glemaitre did this get implemented? I can't find it in imbalanced-learn. I'm trying to subclass ( |
It is on my pile but I did not look at it yet.
|
I've implemented Hellinger Distance as a split criterion for sklearn DecisionTreeClassifier and RandomForestClassifier. |
This is great @EvgeniDubov . Unless I misunderstood your documentation. The necessary change to scikit_learn is already implemented. I am running 0.19.1 which does not have it yet, is there an idea about when it will be added to the production version? The criterion is then an independent package to get the criteria. It would make a lot of sense for this to be added to imbalanced-learn with the rest of the imbalanced data handling methods. @glemaitre do you have concerns other than dependencies on scikit_learn? |
Thanks @DrEhrfurchtgebietend . |
Hi @EvgeniDubov. Are there many other split criteria already available? I am only aware of one other which has been discuss but not implemented, Class Confidence Proportion. CCP is similar to Hellinger in that it is designed for imbalanced data. Clearly both can be used for problems which are not imbalanced but I am unaware they have shown utility over Gini or Entropy. imbalanced-learn covers all common imbalanced classification solutions (undersampling, oversampling, algorithmic, SMOTE) except split criterion methods. It makes sense to add it for completeness. However, if there are enough other split criteria to justify a package that makes sense too. My major interest is when sklearn will put the #10325 update into their standard release. |
#10325 is merged in master and we are heading for a release pretty soon (in the next month).
We are actually looking at diversify the algorithm until that we are still solving imbalanced issue. Lately I had less time since I focused on scikit-learn and side project but we will make a release after the scikit-learn one and new methods and tools we are happy to include tools there. |
@DrEhrfurchtgebietend @glemaitre thanks a lot for your inputs guys. |
Hello guys, I copied you're whole repository to ~\site-packages\sklearn\tree python setup.py build_ext --inplace there was an error that command 'build_ext' does not exist so I tried python setup.py build_clib which actually worked but back in my jupyter notebook the command from hellinger_distance_criterion import HellingerDistanceCriterion didnt work, so I'm stuck now at this point. Hope this is the right place to post this issue, or should I rather pass it to stackoverflow?! |
You might be better off waiting a month or two if you are not in a rush. sklearn 0.20.0 should come out very shortly. This will enable the use of external split criteria. The Hellinger criteria is being developed in imbalanced-learn, have a look here: scikit-learn-contrib/imbalanced-learn#437 Releases of imbalanced-learn normally come out shortly after the new sklearn version they are built on. |
Actually i'd like to use this for my thesis but it seems not realizable for me so it seems i have to wait for it and maybe use it in later projects. |
@woizaterminator I've a repo with Hellinger criteria implementation and I'm working now on adding it to imbalanced-learn scikit-learn-contrib/imbalanced-learn#437. |
Following the comment of @EvgeniDubov, I want it to be in the next release of imbalanced-learn which will follow the release of scikit-learn. |
@EvgeniDubov thanks for the implementation! It failed to compile for me in a windows enviroment, because visual studio build tools didnt work for me...compiling it in a linux enviroment worked out for me. |
@EvgeniDubov , I have followed your instruction in https://github.com/EvgeniDubov/hellinger-distance-criterion and successfully installed hellinger distance criterion. However, when I try to run the example in my pycharm or python IDLE, it shows "no module named hellinger_distance_criterion", then I tried the code in "Example" in the python of my terminal console, it works. I don't know if I have missed something, looking forward to your help. |
The implementation by @EvgeniDubov seems to be working consistently in the case of binary classification but in multi-class classification, I am always getting an error. Any help in this regard will be very much appreciated. |
when it will be added? |
See discussion in #16478 (comment). |
Currently tree classifiers like sklearn.tree.DecisionTreeClassifier have two options for split criterion, “gini” and "entropy". These are sensitive to imbalanced datasets. I would like to request the addition of the Hellinger split criterion since it is insensitive to imbalanced datasets. The documentation and motivation is outlined in the following papers.
https://www.researchgate.net/publication/220451886_Hellinger_distance_decision_trees_are_robust_and_skew-insensitive
https://www.researchgate.net/publication/262225473_Hellinger_Distance_Trees_for_Imbalanced_Streams
Example code (non-python) is contained within. There are some implementations in other languages as well. For example
https://www3.nd.edu/~dial//software.html
We have a parameter "class_weight" and the use of Hellinger Trees has been shown to be more useful than reweighting so it makes sense to add it.
This would require an update to
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pxd
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx
as well as adding the option to each of the classifiers. These include:
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.ensemble.RandomForestClassifier
The text was updated successfully, but these errors were encountered: