Skip to content

ENH add criterion log_loss as alternative to entropy in trees and forests #23047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 14, 2022

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Apr 4, 2022

Reference Issues/PRs

Partially addresses #18248

What does this implement/fix? Explain your changes.

This PR adds criterion="log_loss" as alternative to criterion="entropy" in DecisionTreeClassifier, ExtraTreeClassifier, RandomForestClassifier, ExtraTreesClassifier.

Any other comments?

@lorentzenchr
Copy link
Member Author

I'm not sure, if the deprecated criterion "entropy" should still be listed in the argument list option. #19310 let them stay there, #21314 removed them before the end of the deprecation cycle.

@ogrisel
Copy link
Member

ogrisel commented Apr 6, 2022

I don't think "log_loss" is a good name for a splitting criterion. As explained in: #18248 (comment)

log loss is a cross-entropy (KL divergence between an empirical distribution and the distribution induced by the predictions of the model) while the entropy criterion is an quantity computed on the PMF of a samples at a given node in the tree. The criterion is about a maximum change of entropy values before and after a split, but this is not explicitly computing a cross-entropy at any point.

@lorentzenchr
Copy link
Member Author

Following #18248 (comment), when we have both names, "log_loss" and "entropy", which one should be the default?

@ogrisel
Copy link
Member

ogrisel commented Apr 8, 2022

Following #18248 (comment), when we have both names, "log_loss" and "entropy", which one should be the default?

I am fine with using "log_loss" as the default for consistency with the new project-wide convention for classifiers in scikit-learn. Gini should stay the default.

@lorentzenchr
Copy link
Member Author

I should read first. criterion="gini" is and stays the default, of course.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@lorentzenchr lorentzenchr linked an issue Apr 11, 2022 that may be closed by this pull request
3 tasks
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lorentzenchr lorentzenchr changed the title DEP criterion entropy in favor of log_loss in trees and forests ENH add criterion log_loss as alternative to entropy in trees and forests Apr 13, 2022
@lorentzenchr
Copy link
Member Author

I'll add a what's new entry. It does seem the right thing to do to mention the new equivalent criterion log_loss.

@ogrisel ogrisel merged commit 6ce5c0a into scikit-learn:main Apr 14, 2022
@ogrisel
Copy link
Member

ogrisel commented Apr 14, 2022

Merged!

@lorentzenchr lorentzenchr deleted the call_it_log_loss_tree branch April 14, 2022 06:29
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Apr 29, 2022
eddiebergman added a commit to automl/auto-sklearn that referenced this pull request Nov 15, 2022
eddiebergman added a commit to automl/auto-sklearn that referenced this pull request Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

RFC Consistent options/names for loss and criterion
5 participants