Skip to content

Feature Request: Hellinger split criterion for classificaiton trees #9947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Gitman-code opened this issue Oct 17, 2017 · 25 comments
Closed

Comments

@Gitman-code
Copy link

Gitman-code commented Oct 17, 2017

Currently tree classifiers like sklearn.tree.DecisionTreeClassifier have two options for split criterion, “gini” and "entropy". These are sensitive to imbalanced datasets. I would like to request the addition of the Hellinger split criterion since it is insensitive to imbalanced datasets. The documentation and motivation is outlined in the following papers.

https://www.researchgate.net/publication/220451886_Hellinger_distance_decision_trees_are_robust_and_skew-insensitive
https://www.researchgate.net/publication/262225473_Hellinger_Distance_Trees_for_Imbalanced_Streams

Example code (non-python) is contained within. There are some implementations in other languages as well. For example
https://www3.nd.edu/~dial//software.html

We have a parameter "class_weight" and the use of Hellinger Trees has been shown to be more useful than reweighting so it makes sense to add it.

This would require an update to
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pxd
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx

as well as adding the option to each of the classifiers. These include:
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.ensemble.RandomForestClassifier

@jnothman
Copy link
Member

jnothman commented Oct 17, 2017 via email

@Gitman-code
Copy link
Author

Gitman-code commented Oct 19, 2017

Yes this has not gathered a lot of citations in general but a typical amount for a good paper in this subfield.

There are other solutions to the issue of imbalanced data. The "class_weight" method which is already implemented is a traditional but inferior solution. Other methods which receive comparable results are mostly ensemble methods which rely on balancing at the time of bagging. Good examples of this are Random Balance

https://www.researchgate.net/publication/276152298_Random_Balance_Ensembles_of_variable_priors_classifiers_for_imbalanced_data

or RUSBoost
http://ieeexplore.ieee.org/document/5299216/

Similar methods are implemented in
http://contrib.scikit-learn.org/imbalanced-learn/stable/api.html#module-imblearn.ensemble

You could argue that split criterion based solutions to imbalanced data should be in that package. I do not know where to draw the line since this minor feature addition has been shown to outperform the bulk of the methods in that package. There is one other comparable method to the Hellinger method
https://www3.nd.edu/~nchawla/papers/SDM10.pdf

It is another splitting criterion method called Class Confidence Proportion which does an equivalent yet different thing. I would be happy with this instead of the Hellinger method.

I am not sure using citations as a metric for the usefulness of a feature is best. This is a very specific problem that occurs quite frequently in practice. Many users of scikit-learn are happy to use traditional methods or the ensemble methods in the imbalanced-learn package. It is easy to see there is general interest in solving this issue in this way

https://www.reddit.com/r/MachineLearning/comments/2m72hv/hellinger_distance_decision_trees/

In any case, please leave the ticket open until some criterion based solution is implemented for the community even if this ticket needs to be passed to imbalanced-learn. Thanks!

@glemaitre
Copy link
Member

I think that we could integrate it in imbalanced-learn if it shows some improvement.

The criterion in the tree in scikit-learn are classes; therefore I think that we could implement our own criterion on our side. We will just need scikit-learn to allow us for duck-typing.

@jnothman Do you think this is something reasonable?

@jnothman
Copy link
Member

jnothman commented Oct 21, 2017 via email

@Gitman-code
Copy link
Author

That is a great solution. It would allow to have all the imbalanced classification tools in one package. It would seem the change to Scikit-learn is relatively small. Could a ticket for this change be made to supplant this one and this be passed to imbalanced learn? What is the release schedule like? Would getting this into 0.19.1 be possible?

@jnothman
Copy link
Member

jnothman commented Oct 22, 2017 via email

@camilstaps
Copy link
Contributor

@DrEhrfurchtgebietend @glemaitre did this get implemented? I can't find it in imbalanced-learn. I'm trying to subclass (Classification)Criterion myself and am looking for an example, so if you know of anything like that, that would be very much appreciated.

@glemaitre
Copy link
Member

glemaitre commented Dec 4, 2017 via email

@EvgeniDubov
Copy link

EvgeniDubov commented Jul 9, 2018

I've implemented Hellinger Distance as a split criterion for sklearn DecisionTreeClassifier and RandomForestClassifier.
It performs great in my use cases of imbalanced data classification, beats RandomForestClassifier with gini and XGBClassifier.
You are welcome to check it out on https://github.com/EvgeniDubov/hellinger-distance-criterion

@Gitman-code
Copy link
Author

This is great @EvgeniDubov . Unless I misunderstood your documentation. The necessary change to scikit_learn is already implemented. I am running 0.19.1 which does not have it yet, is there an idea about when it will be added to the production version?

The criterion is then an independent package to get the criteria. It would make a lot of sense for this to be added to imbalanced-learn with the rest of the imbalanced data handling methods. @glemaitre do you have concerns other than dependencies on scikit_learn?

@EvgeniDubov
Copy link

Thanks @DrEhrfurchtgebietend .
Hellinger Distance helps in imbalanced data classification but also works with balanced data sets.
Looks like imbalanced-learn is mainly focused on data sampling.
It might be a good idea to open a new sklearn-contrib repository where current and future Cython implementations of both ClassificationCriterion and RegressionCriterion will be placed.
I'm willing to open such a project and submit it for your review and feedbacks. What do you say @glemaitre

@Gitman-code
Copy link
Author

Hi @EvgeniDubov. Are there many other split criteria already available? I am only aware of one other which has been discuss but not implemented, Class Confidence Proportion. CCP is similar to Hellinger in that it is designed for imbalanced data. Clearly both can be used for problems which are not imbalanced but I am unaware they have shown utility over Gini or Entropy.

imbalanced-learn covers all common imbalanced classification solutions (undersampling, oversampling, algorithmic, SMOTE) except split criterion methods. It makes sense to add it for completeness. However, if there are enough other split criteria to justify a package that makes sense too.

My major interest is when sklearn will put the #10325 update into their standard release.

@glemaitre
Copy link
Member

#10325 is merged in master and we are heading for a release pretty soon (in the next month).

Looks like imbalanced-learn is mainly focused on data sampling.

We are actually looking at diversify the algorithm until that we are still solving imbalanced issue. Lately I had less time since I focused on scikit-learn and side project but we will make a release after the scikit-learn one and new methods and tools we are happy to include tools there.

@EvgeniDubov
Copy link

@DrEhrfurchtgebietend @glemaitre thanks a lot for your inputs guys.
I've submitted a pull request to scikit-learn-contrib/imbalanced-learn

@woizaterminator
Copy link

Hello guys,
thats exactly what im looking for but it fails for me to import.

I copied you're whole repository to ~\site-packages\sklearn\tree
and then tried to compile the setup.py with

python setup.py build_ext --inplace

there was an error that command 'build_ext' does not exist so I tried

python setup.py build_clib

which actually worked but back in my jupyter notebook the command

from hellinger_distance_criterion import HellingerDistanceCriterion

didnt work, so I'm stuck now at this point.

Hope this is the right place to post this issue, or should I rather pass it to stackoverflow?!
I'm already trying like 5 hours but couldn't achieve anything so I hope somebody can resolve this issue.

@Gitman-code
Copy link
Author

Gitman-code commented Aug 18, 2018

You might be better off waiting a month or two if you are not in a rush. sklearn 0.20.0 should come out very shortly. This will enable the use of external split criteria. The Hellinger criteria is being developed in imbalanced-learn, have a look here: scikit-learn-contrib/imbalanced-learn#437 Releases of imbalanced-learn normally come out shortly after the new sklearn version they are built on.

@woizaterminator
Copy link

Actually i'd like to use this for my thesis but it seems not realizable for me so it seems i have to wait for it and maybe use it in later projects.

@EvgeniDubov
Copy link

@woizaterminator I've a repo with Hellinger criteria implementation and I'm working now on adding it to imbalanced-learn scikit-learn-contrib/imbalanced-learn#437.
So in case you are in a rush and need it in your current project you can find it here EvgeniDubov/hellinger-distance-criterion

@glemaitre
Copy link
Member

Following the comment of @EvgeniDubov, I want it to be in the next release of imbalanced-learn which will follow the release of scikit-learn.
We just need to figure out the generalization of the Hellinger Distance criterion for multi-class problem. Any input would be appreciated ;)

@woizaterminator
Copy link

@EvgeniDubov thanks for the implementation! It failed to compile for me in a windows enviroment, because visual studio build tools didnt work for me...compiling it in a linux enviroment worked out for me.

@Gitman-code
Copy link
Author

@niuwan1
Copy link

niuwan1 commented Jul 22, 2019

@EvgeniDubov , I have followed your instruction in https://github.com/EvgeniDubov/hellinger-distance-criterion and successfully installed hellinger distance criterion. However, when I try to run the example in my pycharm or python IDLE, it shows "no module named hellinger_distance_criterion", then I tried the code in "Example" in the python of my terminal console, it works. I don't know if I have missed something, looking forward to your help.

@neelanjan00
Copy link

Following the comment of @EvgeniDubov, I want it to be in the next release of imbalanced-learn which will follow the release of scikit-learn.
We just need to figure out the generalization of the Hellinger Distance criterion for multi-class problem. Any input would be appreciated ;)

The implementation by @EvgeniDubov seems to be working consistently in the case of binary classification but in multi-class classification, I am always getting an error. Any help in this regard will be very much appreciated.

@Sandy4321
Copy link

when it will be added?

@lorentzenchr lorentzenchr reopened this May 7, 2023
@lorentzenchr
Copy link
Member

See discussion in #16478 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants