Skip to content

ENH Hellinger distance split criterion for classification trees #16478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
368 commits
Select commit Hold shift + click to select a range
9bc7978
changed impurity_improvement function signature in _criterion.pyx to …
EvgeniDubov Oct 23, 2021
a14b889
lint issues fix
EvgeniDubov Oct 23, 2021
e2c2bc0
black issue fix
EvgeniDubov Oct 23, 2021
e08fc4f
Using Cython deref operator to be more semantically accurate in _crit…
EvgeniDubov Jan 10, 2022
8d539c6
added dereference import to _criterion.pyx
EvgeniDubov Jan 10, 2022
c3d5951
rolling back the deref usage in _criterion.pyx
EvgeniDubov Jan 10, 2022
7412351
resetting count_k1 and count_k2 between each output inspection in hel…
EvgeniDubov Jan 10, 2022
4419414
added hellinger distance doc-string
EvgeniDubov Jan 10, 2022
f21ea29
Updated doc/whats_new/v0.23.rst
EvgeniDubov Jan 10, 2022
2013677
removed redundant 'from numpy.math cimport INFINITY' from _criterion.pyx
EvgeniDubov Jan 11, 2022
a85e92e
- added INF def
EvgeniDubov Feb 20, 2020
760d2a4
- more documentation
EvgeniDubov Apr 14, 2020
b92fe7b
lint issues fix
EvgeniDubov Oct 23, 2021
4cd790a
lint issues fix
EvgeniDubov Oct 23, 2021
5c57f77
black issue fix
EvgeniDubov Oct 23, 2021
9ce5426
Using Cython deref operator to be more semantically accurate in _crit…
EvgeniDubov Jan 10, 2022
78ea842
added dereference import to _criterion.pyx
EvgeniDubov Jan 10, 2022
ac2dea7
rolling back the deref usage in _criterion.pyx
EvgeniDubov Jan 10, 2022
8d6bbb2
removed "hellinger" from test_importances to make sure CI green other…
EvgeniDubov Jan 11, 2022
3b3f053
hellinger to use Criterion implementation of proxy_impurity_improveme…
EvgeniDubov Jan 14, 2022
a4add22
hellinger node_impurity return sum of children impurities
EvgeniDubov Jan 14, 2022
cf54101
hellinger changed child impurity to be lower is better
EvgeniDubov Jan 14, 2022
514178b
hellinger removed handling special case of perfect split, it is cover…
EvgeniDubov Jan 14, 2022
e5fbef8
removed hellinger criterion from test_forest to make sure the CI is g…
EvgeniDubov Jan 14, 2022
22ae6e2
added imbalanced classification dataset for hellinger testing
EvgeniDubov Jan 24, 2022
b29de9b
added imbalanced criterions testing
EvgeniDubov Jan 24, 2022
fff1a23
test typo fix
EvgeniDubov Jan 24, 2022
24a773d
test typo fix
EvgeniDubov Jan 24, 2022
7b4a6cd
test typo fix
EvgeniDubov Jan 24, 2022
388ed43
test code reformat
EvgeniDubov Jan 24, 2022
bbad3ac
test code reformat
EvgeniDubov Jan 24, 2022
3c23707
fixed pytest fixture
EvgeniDubov Jan 25, 2022
f264f8e
added imbalanced minority class ratio
EvgeniDubov Jan 25, 2022
30dfd47
hellinger score to compate to imbalanced minority class ratio
EvgeniDubov Jan 25, 2022
43cab10
testing hellinger on all forest classifiers
EvgeniDubov Jan 25, 2022
4d1fdf5
test forest check_importances to assert max top feature importance la…
EvgeniDubov Jan 25, 2022
3653687
forest test test_importances to define X,y according to criterion, pa…
EvgeniDubov Jan 25, 2022
6197177
blocking test_importances in test forest to see if my change failed CI
EvgeniDubov Jan 25, 2022
9635b7e
adding back test_importances in test forest
EvgeniDubov Jan 25, 2022
8455304
added imbalanced classification dataset to test_tree
EvgeniDubov Feb 3, 2022
59fbb7f
added binary classification target to test_tree
EvgeniDubov Feb 3, 2022
46c960e
added datasets with binary target to test_tree
EvgeniDubov Feb 3, 2022
963c177
test_iris refactoring to use pytest fixtures for criterion
EvgeniDubov Feb 3, 2022
ea41708
added test_imbalanced_criterions for hellinger
EvgeniDubov Feb 3, 2022
dfc73e0
fixed test_imbalanced_criterions
EvgeniDubov Feb 3, 2022
f8d06a1
test_sparse refactoring to use fixtures and support hellinger
EvgeniDubov Feb 4, 2022
7133b84
defined the criterions explicitly in tests fixture to separate cases …
EvgeniDubov Feb 4, 2022
0bacb62
fixed lint issues
EvgeniDubov Feb 4, 2022
b9528a9
fixed flake8 issues
EvgeniDubov Feb 4, 2022
be106b7
applied black reformat
EvgeniDubov Feb 4, 2022
cefe596
test_iris comment fix in test_tree.py
EvgeniDubov Feb 4, 2022
31bc121
added my contact in _criterion.pyx
EvgeniDubov Feb 4, 2022
3e8767d
hellinger formula explanation in tree.rst
EvgeniDubov Feb 6, 2022
3016c81
updated hellinger doc in sklearn/ensemble/_forest.py
EvgeniDubov Feb 6, 2022
470e65f
updated hellinger doc in sklearn/tree/_classes.py
EvgeniDubov Feb 6, 2022
05c0605
updated hellinger doc in tree/ _classes.py
EvgeniDubov Feb 6, 2022
0c362c2
expanded hellinger documentation in tree.rst
EvgeniDubov Feb 6, 2022
335f977
removed redundant minority class specification in imbalanced dataset …
EvgeniDubov Feb 7, 2022
2a3df93
check_importances refactoring in test_forest.py
EvgeniDubov Feb 7, 2022
2bcfd3d
reverting check_importances refactoring in test_forest.py
EvgeniDubov Feb 7, 2022
413d589
test_tree.py refactoring
EvgeniDubov Feb 7, 2022
0cca18c
applying sqrt on hellinger criterion score
EvgeniDubov Mar 20, 2022
c436c10
removed sqrt from hellinger criterion score as it fails the feature i…
EvgeniDubov Mar 20, 2022
09b2015
main rebase
EvgeniDubov Apr 18, 2022
287a049
Merge remote-tracking branch 'origin/hellinger_distance_criterion' in…
EvgeniDubov Apr 18, 2022
e495b99
reducing hellinger distance score calculation code
EvgeniDubov Apr 18, 2022
0816ef9
fixed pointers
EvgeniDubov Apr 18, 2022
dfd9f21
black fixes
EvgeniDubov Apr 18, 2022
a1258bc
moved hellinger feature announcement from 0.23 to 1.1 release notes
EvgeniDubov Apr 19, 2022
937f763
- added documentation and release notes
EvgeniDubov Apr 14, 2020
47ac904
- more documentation
EvgeniDubov Apr 14, 2020
b7f82a7
Using Cython deref operator to be more semantically accurate in _crit…
EvgeniDubov Jan 10, 2022
5395f9b
added dereference import to _criterion.pyx
EvgeniDubov Jan 10, 2022
a49b794
rolling back the deref usage in _criterion.pyx
EvgeniDubov Jan 10, 2022
559e6e5
lint issues fix
EvgeniDubov Oct 23, 2021
b9f134e
black issue fix
EvgeniDubov Oct 23, 2021
25a10ae
Using Cython deref operator to be more semantically accurate in _crit…
EvgeniDubov Jan 10, 2022
8038ec7
added dereference import to _criterion.pyx
EvgeniDubov Jan 10, 2022
ce696e6
rolling back the deref usage in _criterion.pyx
EvgeniDubov Jan 10, 2022
6a55dee
resetting count_k1 and count_k2 between each output inspection in hel…
EvgeniDubov Jan 10, 2022
1debc3b
added hellinger distance doc-string
EvgeniDubov Jan 10, 2022
46cea9b
removed blank line from doc/modules/tree.rst
EvgeniDubov Jan 10, 2022
082975e
made hellinger formula consistent with previous notations in doc/modu…
EvgeniDubov Jan 10, 2022
33f3e47
removed redundant explanation in hellinger doc
EvgeniDubov Jan 10, 2022
4492d92
removed "hellinger" from test_importances to make sure CI green other…
EvgeniDubov Jan 11, 2022
d361a77
hellinger to use Criterion implementation of proxy_impurity_improveme…
EvgeniDubov Jan 14, 2022
129ba2b
hellinger node_impurity return sum of children impurities
EvgeniDubov Jan 14, 2022
6cb8b49
hellinger changed child impurity to be lower is better
EvgeniDubov Jan 14, 2022
3249e1b
hellinger removed handling special case of perfect split, it is cover…
EvgeniDubov Jan 14, 2022
fd7d480
removed hellinger criterion from test_forest to make sure the CI is g…
EvgeniDubov Jan 14, 2022
0b650d1
removed hellinger criterion from test_tree to make sure the CI is gre…
EvgeniDubov Jan 15, 2022
8b96387
added imbalanced classification dataset for hellinger testing
EvgeniDubov Jan 24, 2022
fef29a9
added imbalanced criterions testing
EvgeniDubov Jan 24, 2022
9c91d9c
test typo fix
EvgeniDubov Jan 24, 2022
3599979
test typo fix
EvgeniDubov Jan 24, 2022
53b7fe5
test typo fix
EvgeniDubov Jan 24, 2022
18b950d
test code reformat
EvgeniDubov Jan 24, 2022
510b681
test code reformat
EvgeniDubov Jan 24, 2022
5d6b2ce
fixed pytest fixture
EvgeniDubov Jan 25, 2022
dca1e0d
added imbalanced minority class ratio
EvgeniDubov Jan 25, 2022
d11baf8
hellinger score to compate to imbalanced minority class ratio
EvgeniDubov Jan 25, 2022
d2f2dee
testing hellinger on all forest classifiers
EvgeniDubov Jan 25, 2022
b06b3f5
test forest check_importances to assert max top feature importance la…
EvgeniDubov Jan 25, 2022
61cb61c
forest test test_importances to define X,y according to criterion, pa…
EvgeniDubov Jan 25, 2022
c1510cd
blocking test_importances in test forest to see if my change failed CI
EvgeniDubov Jan 25, 2022
049e9c9
adding back test_importances in test forest
EvgeniDubov Jan 25, 2022
e0f7e8c
Update sklearn/tree/_criterion.pyx
EvgeniDubov Jan 30, 2022
fa90e25
added imbalanced classification dataset to test_tree
EvgeniDubov Feb 3, 2022
3a5dda0
added binary classification target to test_tree
EvgeniDubov Feb 3, 2022
e6b6be8
added datasets with binary target to test_tree
EvgeniDubov Feb 3, 2022
eb14cbb
test_iris refactoring to use pytest fixtures for criterion
EvgeniDubov Feb 3, 2022
120e373
added test_imbalanced_criterions for hellinger
EvgeniDubov Feb 3, 2022
44b65e3
fixed test_imbalanced_criterions
EvgeniDubov Feb 3, 2022
85fc45d
test_sparse refactoring to use fixtures and support hellinger
EvgeniDubov Feb 4, 2022
ff98b54
defined the criterions explicitly in tests fixture to separate cases …
EvgeniDubov Feb 4, 2022
d18e191
removed CLF_CRITERIONS as it is not used after refactoring
EvgeniDubov Feb 4, 2022
43d2a5d
fixed lint issues
EvgeniDubov Feb 4, 2022
c82b847
fixed flake8 issues
EvgeniDubov Feb 4, 2022
7a2f49b
applied black reformat
EvgeniDubov Feb 4, 2022
82d0ba2
test_iris comment fix in test_tree.py
EvgeniDubov Feb 4, 2022
1f178ad
added my contact in _criterion.pyx
EvgeniDubov Feb 4, 2022
557386f
hellinger formula explanation in tree.rst
EvgeniDubov Feb 6, 2022
926e787
updated hellinger doc in sklearn/ensemble/_forest.py
EvgeniDubov Feb 6, 2022
828a16c
updated hellinger doc in sklearn/ensemble/_forest.py
EvgeniDubov Feb 6, 2022
810a5c0
updated hellinger doc in sklearn/tree/_classes.py
EvgeniDubov Feb 6, 2022
06cb1f2
updated hellinger doc in tree/ _classes.py
EvgeniDubov Feb 6, 2022
1a0450a
expanded hellinger documentation in tree.rst
EvgeniDubov Feb 6, 2022
10353d1
removed redundant minority class specification in imbalanced dataset …
EvgeniDubov Feb 7, 2022
b249c3f
check_importances refactoring in test_forest.py
EvgeniDubov Feb 7, 2022
917687c
reverting check_importances refactoring in test_forest.py
EvgeniDubov Feb 7, 2022
cc6bd2e
test_tree.py refactoring
EvgeniDubov Feb 7, 2022
28b243f
applying sqrt on hellinger criterion score
EvgeniDubov Mar 20, 2022
36c0658
removed sqrt from hellinger criterion score as it fails the feature i…
EvgeniDubov Mar 20, 2022
81852c5
reducing hellinger distance score calculation code
EvgeniDubov Apr 18, 2022
24a54c5
fixed pointers
EvgeniDubov Apr 18, 2022
f9135a3
black fixes
EvgeniDubov Apr 18, 2022
b02cf78
moved hellinger feature announcement from 0.23 to 1.1 release notes
EvgeniDubov Apr 19, 2022
8a50998
Merge remote-tracking branch 'origin/hellinger_distance_criterion' in…
EvgeniDubov Apr 19, 2022
7c374de
Merge remote-tracking branch 'origin/master' into hellinger_distance_…
EvgeniDubov Apr 19, 2022
08cd53e
Merge remote-tracking branch 'upstream/main' into hellinger_distance_…
EvgeniDubov Apr 19, 2022
625a84f
fixed merge with main
EvgeniDubov Apr 19, 2022
27a5552
fixed lint
EvgeniDubov Apr 19, 2022
d86cb20
changed hellinger score to -score in instead of 1-score in order to m…
EvgeniDubov Apr 20, 2022
a34f5bc
stating hellinger supports single label in the documentation
EvgeniDubov Apr 20, 2022
a0354f6
back to hellinger impurity '1-score' to comply to other criteria scale
EvgeniDubov Apr 20, 2022
41ff7e2
added more documentation to hellinger distance criterion cython code
EvgeniDubov Apr 21, 2022
a6b7fac
added reference to hellinger relevant paper in tree documentation
EvgeniDubov Apr 21, 2022
769e604
main merge fix
EvgeniDubov Apr 21, 2022
d86ae83
Merge remote-tracking branch 'upstream/main' into hellinger_distance_…
EvgeniDubov Apr 22, 2022
f9da824
added sqrt on hellinger score
EvgeniDubov Apr 22, 2022
81c7404
lowered feature importance threshold in forest test check_importances…
EvgeniDubov Apr 22, 2022
30797b7
added a test for hellinger score range
EvgeniDubov Apr 22, 2022
43fcd99
fixed typo
EvgeniDubov Apr 24, 2022
a5312d9
added visual example of imbalanced data classification difference bet…
EvgeniDubov Apr 24, 2022
44c2d2a
added entropy to visual example of imbalanced data classification
EvgeniDubov May 14, 2022
e5ceb41
fixed black issue
EvgeniDubov May 14, 2022
a6f26d4
Update doc/modules/tree.rst
EvgeniDubov Jun 13, 2022
46a68f8
Update doc/modules/tree.rst
EvgeniDubov Jun 13, 2022
b62d82f
Update doc/modules/tree.rst
EvgeniDubov Jun 13, 2022
d29a78d
Update doc/modules/tree.rst
EvgeniDubov Jun 13, 2022
c3aa70b
Update doc/whats_new/v1.1.rst
EvgeniDubov Jun 13, 2022
885c30a
black refactoring
EvgeniDubov Jun 13, 2022
c039271
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jun 13, 2022
f194cf6
moved hellinger feature description from v1.1 to v1.2
EvgeniDubov Jun 13, 2022
eb66a90
fixed redundant splits by hellinger tree by making it provide score o…
EvgeniDubov Jul 16, 2022
ecdacaf
update forest tests for hellinger
EvgeniDubov Jul 16, 2022
972d7e8
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jul 20, 2022
3a77957
added hellinger to tree parameter constraints
EvgeniDubov Jul 20, 2022
bc870db
black fix
EvgeniDubov Jul 20, 2022
fa85e6d
forest test debug print
EvgeniDubov Jul 20, 2022
0d15d6f
forest importances test threshold change for hellinger
EvgeniDubov Jul 20, 2022
c9e510e
forest importances test threshold change for hellinger
EvgeniDubov Jul 20, 2022
187ca9f
forest importances test threshold change for hellinger
EvgeniDubov Jul 20, 2022
f2cddbc
forest importances test dataset change for hellinger
EvgeniDubov Jul 21, 2022
e480b09
deleted debug prints
EvgeniDubov Jul 21, 2022
d69e109
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jan 10, 2023
20d4ca9
naming convention refactoring
EvgeniDubov Jan 10, 2023
05e94ef
test_tree CLF_CRITERIONS refactoring
EvgeniDubov Jan 10, 2023
a17f176
fixed test_tree
EvgeniDubov Jan 10, 2023
9afbfac
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jan 11, 2023
23edd32
tree.rst changing to list format
EvgeniDubov Jan 12, 2023
e255c59
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jan 12, 2023
d0a62ae
added label in tree.rst for referencing
EvgeniDubov Jan 12, 2023
1a1595a
reformated to list the classification tree split criteria docstring
EvgeniDubov Jan 12, 2023
6869d8e
moved hellinger feature to v1.3 whats new
EvgeniDubov Jan 12, 2023
1320eea
fixed docstring long line
EvgeniDubov Jan 13, 2023
c001573
tree.rst doc rephrasing
EvgeniDubov Jan 13, 2023
15452ae
DOC Revert changes to doc/whats_new/v1.2.rst
EvgeniDubov Jan 13, 2023
ad981b0
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jan 13, 2023
b51605a
tree split criterion doc rephrasing
EvgeniDubov Jan 13, 2023
1cb0437
tree split criterion doc rephrasing
EvgeniDubov Jan 13, 2023
433fd59
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jan 13, 2023
413252c
fixed forest doc typo
EvgeniDubov Jan 14, 2023
9b0426b
hellinger doc reformat in _criterion.cyx
EvgeniDubov Jan 14, 2023
19705dc
variables names convention change _criterion.pyx
EvgeniDubov Jan 14, 2023
05720ce
hellnger doc rephrase
EvgeniDubov Jan 14, 2023
ea63f5f
hellinger code refactoring
EvgeniDubov Jan 14, 2023
e0d5b9a
doc typo fix
EvgeniDubov Jan 14, 2023
fd3606c
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jan 14, 2023
7ba096a
hellinger doc change
EvgeniDubov Jan 14, 2023
65ae007
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jan 14, 2023
7189aed
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jan 14, 2023
e5c975a
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jan 14, 2023
ca38862
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jan 16, 2023
5d7c57e
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jan 16, 2023
56c2667
Update sklearn/ensemble/_forest.py
EvgeniDubov Jan 22, 2023
826fb68
Update sklearn/tree/_criterion.pyx
EvgeniDubov Jan 22, 2023
342004f
Update sklearn/tree/_criterion.pyx
EvgeniDubov Jan 22, 2023
4aef69a
Update doc/modules/tree.rst
EvgeniDubov Jan 22, 2023
0be10a1
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jan 22, 2023
0e11ce1
capitalized comments plot_imbalanced_data_classification
EvgeniDubov Jan 22, 2023
7439c8f
split criteria docstring refactoring
EvgeniDubov Jan 22, 2023
bf5e981
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jan 25, 2023
0adbb57
Update sklearn/tree/_criterion.pyx
EvgeniDubov Jan 25, 2023
a2c8519
Update sklearn/tree/_criterion.pyx
EvgeniDubov Jan 25, 2023
265158a
Update examples/tree/plot_imbalanced_data_classification.py
EvgeniDubov Jan 25, 2023
a16003d
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jan 27, 2023
7bee937
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jan 28, 2023
2269b0c
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Jan 30, 2023
dca75a0
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 1, 2023
4c8f4ac
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 3, 2023
187bf58
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 4, 2023
f93d708
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 5, 2023
d87e5ec
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 7, 2023
e2abd47
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 7, 2023
5d9585d
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 8, 2023
cc2b53c
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 8, 2023
fee65eb
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 8, 2023
1717396
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 9, 2023
25073c1
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 10, 2023
e30ea17
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 10, 2023
f9db7fa
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 12, 2023
269eaa0
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 13, 2023
b37a0c8
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 13, 2023
d9271bd
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 17, 2023
aee810a
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 18, 2023
e8d59b5
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 19, 2023
2207ba1
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 20, 2023
9a69223
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 22, 2023
a17cdb1
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 22, 2023
50430a4
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Feb 25, 2023
edf52f7
Merge branch 'main' into hellinger_distance_criterion
EvgeniDubov Mar 5, 2023
93b00e0
Update sklearn/tree/_criterion.pyx
EvgeniDubov May 3, 2023
89618a4
Update sklearn/tree/_criterion.pyx
EvgeniDubov May 3, 2023
cd4f782
Update sklearn/tree/_criterion.pyx
EvgeniDubov May 3, 2023
0b38f85
fixed hellinger formula in _criterion.pyx doc string
EvgeniDubov May 3, 2023
4dcf57f
removed redundant check in _criterion.pyx
EvgeniDubov May 3, 2023
e175cda
Merge remote-tracking branch 'upstream/main' into hellinger_distance_…
EvgeniDubov May 3, 2023
438feb8
ran black and fixed lint
EvgeniDubov May 3, 2023
d66f78c
fixed lint
EvgeniDubov May 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 29 additions & 1 deletion doc/modules/tree.rst
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,8 @@ The disadvantages of decision trees include:

- Decision tree learners create biased trees if some classes dominate.
It is therefore recommended to balance the dataset prior to fitting
with the decision tree.
with the decision tree. When dealing with a binary classification problem,
using the Hellinger distance as a split criterion will alleviate this bias.


.. _tree_classification:
Expand Down Expand Up @@ -394,6 +395,9 @@ Tips on practical use
matrix input compared to a dense matrix when features have zero values in
most of the samples.

* If the target is binary and imbalanced, it is recommended to use
Hellinger distance, `criterion="hellinger"`.


.. _tree_algorithms:

Expand Down Expand Up @@ -531,6 +535,30 @@ Log Loss or Entropy:

\mathrm{LL}(D, T) = \sum_{m \in T} \frac{n_m}{n} H(Q_m)

**HellingerDistance**

Hellinger distance is implemented for the binary case and a single label only.

The score is calculated per split, not per population:

.. math::

H(Q_m) = \sqrt{\left(\sqrt{\frac{N_{m,+}^{left}}{N_{m,+}}}-\sqrt{\frac{N_{m,-}^{left}}{N_{m,-}}}\right)^2+\left(\sqrt{\frac{N_{m,+}^{right}}{N_{m,+}}}-\sqrt{\frac{N_{m,-}^{right}}{N_{m,-}}}\right)^2 }

where:
- :math:`N_{m,+}` is the number of positive samples for the :math:`m` node
- :math:`N_{m,-}` is the number of negative samples for the :math:`m` node
- :math:`N_{m,+}^{left}` and :math:`N_{m,+}^{right}` are the numbers of positive
samples respectively at the left and the right of the :math:`m` node split
- :math:`N_{m,-}^{left}` and :math:`N_{m,-}^{right}` are the numbers of negative
samples respectively at the left and the right of the :math:`m` node split



.. topic:: References:

* :doi:`Cieslak, D. A., Hoens, T. R., Chawla, N. V., & Kegelmeyer, W. P., Hellinger distance decision trees are robust and skew-insensitive, 2012 <10.1007/s10618-011-0222-1>`

Regression criteria
-------------------

Expand Down
5 changes: 5 additions & 0 deletions doc/whats_new/v1.3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -485,6 +485,11 @@ Changelog

:mod:`sklearn.tree`
...................
- |Feature| :class:`tree.DecisionTreeClassifier`, :class:`tree.ExtraTreeClassifier`,
:class:`ensemble.RandomForestClassifier` and :class:`ensemble.ExtraTreesClassifier`
now support the Hellinger distance as a split criterion for
binary classification problem.
:pr:`16478` by :user:`Evgeni Dubov <EvgeniDubov>`.

- |Enhancement| Adds a `class_names` parameter to
:func:`tree.export_text`. This allows specifying the parameter `class_names`
Expand Down
97 changes: 97 additions & 0 deletions examples/tree/plot_imbalanced_data_classification.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
"""
===================================================================
Decision Tree imbalanced data classification
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As proposed, I would improve the following example:

https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html

It would avoid adding a new example. We can make a notebook-style notebook to make it more didactic using the # %% markers and adding interlaced explanationd. You can refer to this example for instance: https://github.com/scikit-learn/scikit-learn/blob/main/examples/inspection/plot_linear_model_coefficient_interpretation.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EvgeniDubov: I think this suggestion from @glemaitre is left to address.

===================================================================

In this example we demonstrate the affect of different split criteria
on decision tree classifier predictions on imbalanced data.

You can see that Gini index is biased towards the majority class while
Hellinger distance is sensitive to the ratio between the classes.

"""

# Import the necessary modules and libraries
import numpy as np
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

# Create imbalanced dataset
minority_class_ratio = 0.001
n_classes = 2
X, y = datasets.make_classification(
n_samples=1000,
n_features=2,
n_informative=2,
n_redundant=0,
n_repeated=0,
n_classes=n_classes,
n_clusters_per_class=1,
weights=[1 - minority_class_ratio],
shuffle=False,
random_state=0,
)

# Criteria to compare
criterions = ["gini", "entropy", "hellinger"]

# Plot parameters
fig = plt.figure(figsize=(10, 6))
max_subplots = len(criterions) * 2 - 1
plot_colors = ["darkgrey", "yellow"]
target_names = ["majority", "minority"]
markers = [".", "o"]

# Create mesh grid on feature space to draw classifier predictions
x_min, x_max = X[:, 0].min() - 0.2, X[:, 0].max() + 0.2
y_min, y_max = X[:, 1].min() - 2, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.01))

# Create plot per criterion
for sub_plot_idx, criterion in enumerate(criterions):
# Add subplot to figure
fig.add_subplot(1, max_subplots, sub_plot_idx * 2 + 1)

# Train classifier
clf = DecisionTreeClassifier(criterion=criterion)
clf.fit(X, y)

# Draw tree classifier prediction probability for minority class on feature space
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1].reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.rainbow)
plt.colorbar(label="minority class prediction probability")
# Set the range for probability on the colorbar.
plt.clim(0, 1)

# Draw axis labels
plt.xlabel("feature_1")
plt.ylabel("feature_2")

# Draw training points
for i, color, marker in zip(range(n_classes), plot_colors, markers):
idx = np.where(y == i)
plt.scatter(
X[:, 0][idx],
X[:, 1][idx],
c=color,
label=target_names[i],
cmap=plt.cm.RdYlBu,
marker=marker,
edgecolor="black",
)

# Draw subplot legend and title
plt.legend(
title="classes",
handletextpad=0,
loc="lower right",
borderpad=0,
scatterpoints=1,
labelspacing=1,
)
plt.title(criterion)

plt.plot()

plt.show()
20 changes: 12 additions & 8 deletions sklearn/ensemble/_forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1105,11 +1105,13 @@ class RandomForestClassifier(ForestClassifier):
The default value of ``n_estimators`` changed from 10 to 100
in 0.22.

criterion : {"gini", "entropy", "log_loss"}, default="gini"
criterion : {"gini", "entropy", "log_loss", "hellinger"}, default="gini"
The function to measure the quality of a split. Supported criteria are
"gini" for the Gini impurity and "log_loss" and "entropy" both for the
Shannon information gain, see :ref:`tree_mathematical_formulation`.
Note: This parameter is tree-specific.
- "gini" for the Gini impurity
- "log_loss" and "entropy" both for the Shannon information gain
- "hellinger" for the Hellinger distance

Note: `criterion` is tree-specific, see :ref:`tree_mathematical_formulation`.

max_depth : int, default=None
The maximum depth of the tree. If None, then nodes are expanded until
Expand Down Expand Up @@ -1794,11 +1796,13 @@ class ExtraTreesClassifier(ForestClassifier):
The default value of ``n_estimators`` changed from 10 to 100
in 0.22.

criterion : {"gini", "entropy", "log_loss"}, default="gini"
criterion : {"gini", "entropy", "log_loss", "hellinger"}, default="gini"
The function to measure the quality of a split. Supported criteria are
"gini" for the Gini impurity and "log_loss" and "entropy" both for the
Shannon information gain, see :ref:`tree_mathematical_formulation`.
Note: This parameter is tree-specific.
- "gini" for the Gini impurity
- "log_loss" and "entropy" both for the Shannon information gain
- "hellinger" for the Hellinger distance

Note: `criterion` is tree-specific, see :ref:`tree_mathematical_formulation`.

max_depth : int, default=None
The maximum depth of the tree. If None, then nodes are expanded until
Expand Down
54 changes: 47 additions & 7 deletions sklearn/ensemble/tests/test_forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,19 @@
random_state=0,
)

# Imbalanced classification sample used for testing imbalanced criteria
IMBL_MINORITY_CLASS_RATIO = 0.05
X_large_imbl, y_large_imbl = datasets.make_classification(
n_samples=500,
n_features=10,
n_informative=3,
n_redundant=0,
n_repeated=0,
weights=[IMBL_MINORITY_CLASS_RATIO],
shuffle=False,
random_state=42,
)

# also load the iris dataset
# and randomly permute it
iris = datasets.load_iris()
Expand Down Expand Up @@ -165,6 +178,27 @@ def test_iris(name, criterion):
check_iris_criterion(name, criterion)


def check_imbalanced_criterion(name, criterion):
ForestClassifier = FOREST_CLASSIFIERS[name]

clf = ForestClassifier(n_estimators=10, criterion=criterion, random_state=1)
clf.fit(X_large_imbl, y_large_imbl)

# score is a mean of minority class predict_proba
score = clf.predict_proba(X_large_imbl)[:, 1].mean()

assert score > IMBL_MINORITY_CLASS_RATIO, (
f"Failed with imbalanced criterion {criterion}, score = {score}, minority class"
f" ratio = {IMBL_MINORITY_CLASS_RATIO}"
)


@pytest.mark.parametrize("name", FOREST_CLASSIFIERS)
@pytest.mark.parametrize("criterion", ["hellinger"])
def test_imbalanced_criteria(name, criterion):
check_imbalanced_criterion(name, criterion)


def check_regression_criterion(name, criterion):
# Check consistency on regression dataset.
ForestRegressor = FOREST_REGRESSORS[name]
Expand Down Expand Up @@ -308,11 +342,7 @@ def test_probability(name):
check_probability(name)


def check_importances(name, criterion, dtype, tolerance):
# cast as dype
X = X_large.astype(dtype, copy=False)
y = y_large.astype(dtype, copy=False)

def check_importances(name, criterion, tolerance, X, y):
ForestEstimator = FOREST_ESTIMATORS[name]

est = ForestEstimator(n_estimators=10, criterion=criterion, random_state=0)
Expand Down Expand Up @@ -350,15 +380,25 @@ def check_importances(name, criterion, dtype, tolerance):
@pytest.mark.parametrize(
"name, criterion",
itertools.chain(
product(FOREST_CLASSIFIERS, ["gini", "log_loss"]),
product(FOREST_CLASSIFIERS, ["gini", "log_loss", "hellinger"]),
product(FOREST_REGRESSORS, ["squared_error", "friedman_mse", "absolute_error"]),
),
)
def test_importances(dtype, name, criterion):
tolerance = 0.01
if name in FOREST_REGRESSORS and criterion == "absolute_error":
tolerance = 0.05
check_importances(name, criterion, dtype, tolerance)

# cast as dtype
# use imbalanced data for testing imbalanced criterion
if criterion == "hellinger":
X = X_large_imbl.astype(dtype, copy=False)
y = y_large_imbl.astype(dtype, copy=False)
else:
X = X_large.astype(dtype, copy=False)
y = y_large.astype(dtype, copy=False)

check_importances(name, criterion, tolerance, X, y)


def test_importances_asymptotic():
Expand Down
24 changes: 17 additions & 7 deletions sklearn/tree/_classes.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@
"gini": _criterion.Gini,
"log_loss": _criterion.Entropy,
"entropy": _criterion.Entropy,
"hellinger": _criterion.HellingerDistance,
}
CRITERIA_REG = {
"squared_error": _criterion.MSE,
Expand Down Expand Up @@ -604,10 +605,13 @@ class DecisionTreeClassifier(ClassifierMixin, BaseDecisionTree):

Parameters
----------
criterion : {"gini", "entropy", "log_loss"}, default="gini"
criterion : {"gini", "entropy", "log_loss", "hellinger"}, default="gini"
The function to measure the quality of a split. Supported criteria are
"gini" for the Gini impurity and "log_loss" and "entropy" both for the
Shannon information gain, see :ref:`tree_mathematical_formulation`.
- "gini" for the Gini impurity
- "log_loss" and "entropy" both for the Shannon information gain
- "hellinger" for the Hellinger distance

Note: `criterion` is tree-specific, see :ref:`tree_mathematical_formulation`.

splitter : {"best", "random"}, default="best"
The strategy used to choose the split at each node. Supported
Expand Down Expand Up @@ -821,7 +825,10 @@ class DecisionTreeClassifier(ClassifierMixin, BaseDecisionTree):

_parameter_constraints: dict = {
**BaseDecisionTree._parameter_constraints,
"criterion": [StrOptions({"gini", "entropy", "log_loss"}), Hidden(Criterion)],
"criterion": [
StrOptions({"gini", "entropy", "log_loss", "hellinger"}),
Hidden(Criterion),
],
"class_weight": [dict, list, StrOptions({"balanced"}), None],
}

Expand Down Expand Up @@ -1291,10 +1298,13 @@ class ExtraTreeClassifier(DecisionTreeClassifier):

Parameters
----------
criterion : {"gini", "entropy", "log_loss"}, default="gini"
criterion : {"gini", "entropy", "log_loss", "hellinger"}, default="gini"
The function to measure the quality of a split. Supported criteria are
"gini" for the Gini impurity and "log_loss" and "entropy" both for the
Shannon information gain, see :ref:`tree_mathematical_formulation`.
- "gini" for the Gini impurity
- "log_loss" and "entropy" both for the Shannon information gain
- "hellinger" for the Hellinger distance

Note: `criterion` is tree-specific, see :ref:`tree_mathematical_formulation`.

splitter : {"random", "best"}, default="random"
The strategy used to choose the split at each node. Supported
Expand Down
Loading