[MRG] ENH add support for multiclass-multioutput to ClassifierChain #14654

agamemnonc · 2019-08-14T15:35:36Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

predict_proba and decision_function of ClassifierChain retuns now a list of n_outputs arrays, each of shape (n_samples, n_classes). This format is consistent with that of MultiOutputClassifier.

Any other comments?

multiouput_only tag has now been removed. I wonder whether we should keep the _skip_test tag. Tests pass even when this tag is removed.

This change will break backward compatibility, since the functions previously returned arrays rather than lists. Any ideas on how to handle this?

agamemnonc · 2019-08-19T18:05:37Z

@jnothman @amueller just pinging you as you were the core developers mostly involved in the discussion of the issues this PR is aiming to fix.

agamemnonc · 2019-11-15T12:03:56Z

@jnothman @glemaitre @amueller
Is this something we are interested in or shall I go ahead and close it?

glemaitre

It seems that this is relevant. However, I am not really comfortable reviewing thing with multiouptut multiclass since I never used it. You could remove _skip_test to check which common tests are failing.
If @jnothman @amueller could have a look, it could be great (I will learn from it:))

doc/whats_new/v0.22.rst

sklearn/multioutput.py

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

jnothman · 2020-01-09T08:16:28Z

Tests are failing @agamemnonc. Let us know if we can assist.

agamemnonc · 2020-01-09T11:00:38Z

Thanks @jnothman.

I think I might actually need some help here as I am not really sure why tests are failing and the reports seem a bit obscure to me.

I have run tests locally us, which pass, so I am not really sure what the issue is?

Thanks again!

doc/whats_new/v0.22.rst

agamemnonc · 2020-01-10T09:49:01Z

@jnothman can you please have a look again?

Any suggestions on how to keep code coverage above the required threshold also welcome!

cmarmo · 2020-05-15T08:14:24Z

Hi @agamemnonc, are you still interested in finalizing this PR? If yes, could you please resolve the conflicts with master? I've checked the codecov report and couldn't find any problem related to your changes, a sync will clarify if there is still work to do in testing. Thanks for your patience!

agamemnonc · 2020-05-22T19:40:27Z

Hi @agamemnonc, are you still interested in finalizing this PR? If yes, could you please resolve the conflicts with master? I've checked the codecov report and couldn't find any problem related to your changes, a sync will clarify if there is still work to do in testing. Thanks for your patience!

Sure, I can do this when I find some time -- hopefully next week.

This reverts commit 95c4bf8.

agamemnonc · 2021-01-31T09:58:54Z

@cmarmo Conflicts with main have now been resolved (apologies for the delay in doing so), so this is now ready to merge as far as I am concerned. You can remove the "stalled" label if you wish.

@thomasjpfan @glemaitre pinging you again to have another look as there has been an update on the docstring example.

cmarmo · 2021-02-01T08:23:43Z

Hi @agamemnonc , thanks for your work! Version 0.24 has been released before Christmas, could you please put the changelog entry in v1.0.rst? Thanks!
Also, the failing check is not related with your pull request (see #19312).

…nto main

agamemnonc · 2021-02-01T17:57:11Z

@cmarmo thanks for looking at this.
I have now moved the changelog to v1.0.rst as per your request -- apologies for the oversight on my side.

Please note that this PR breaks backward compatibility (predict_proba previously returned a np array, now returns a list of arrays). Just noting in case you want to include it in the v1.0 Milestone.

glemaitre · 2021-04-10T13:07:38Z

I have a couple of questions regarding supporting this use case indeed. It would be nice to have some insights if anyone has some.

By adding support to multioutput, I have the impression that we don't solve the same problem as stated in the documentation, we create a link between the output and not between class labels (I also commented there: #19853 (comment)). But this is not really clear to me what this involves (I never played with these classification problems). If there are differences, we should change the documentation as well.

Please note that this PR breaks backward compatibility (predict_proba previously returned a np array, now returns a list of arrays)

I think that we should be able to be back-compatible. We could detect beforehand which type of target we have and we can output a (n_samples, n_classes) for multilabel and list of (n_samples, n_classes) for mutioutput.

agamemnonc · 2021-04-10T19:33:59Z

@glemaitre am not sure I am exactly following your points, but this is likely due to confusion regarding the terminology. To avoid any further confusion, I will follow here the convention outlined in our user guide, that is, multi-output classification problems with binary outputs --> "multilabel" and multi-output problems with number of classes > 2 --> "multi-class multioutput" or simply "multioutput".

we create a link between the output and not between class labels

In multi-output problems (either multi-label or multi-class multi-output), Classifier Chains indeed create links between the outputs (binary or multi-class, respectively). For example you predict the second output by adding the first output in the feature space. I am afraid I am not sure what you mean by links between "class labels" in this case?

You are right in that in the original paper Classifier Chains were proposed for multi-label (binary) problems. However this needs not be restrictive, and indeed it is straightforward to extend them to multi-class multi-output. Personally, I think it would be great to support this case too -- in my own research, I have come across multi-class multi-output classification problems where there is structure between the outputs, and in such cases Classifier Chains came in handy. If we decide to provide such support, I would think that ClassifierChain and MultiOutputClassifier should have exactly the same behaviour (as it would be reasonable to think that in that case ClassifierChain would be just a special case of the more general MultiOutputClassifier). The reason I opened #9245 and this PR was that although ClassifierChain seemed to support multi-class multi-output (as you correctly point out in #19853) the predict_proba methods of MultiOutputClassifier and ClassifierChain did not behave consistently: MultiOutputClassifier returns a list of (n_samples, n_classes) (len=n_outputs), whereas ClassifierChain returns an array of shape (n_samples, n_outputs) where each element (i,j) corresponds to the probability of the positive class for the j-th output (and i-th sample). If ClassifierChain was constrained to multilabel-indicator (as per your suggestion in #19853) then this should be OK, but if multi-class multi-output is supported then it becomes problematic as there is no way of knowing to which classes the probabilities correspond to.

I think that we should be able to be back-compatible. We could detect beforehand which type of target we have and we can output a (n_samples, n_classes) for multilabel and list of (n_samples, n_classes) for mutioutput.

I think what you are suggesting is indeed feasible, however I just wanted to double-check that in the multi-label case the output should be array of shape (n_samples, n_outputs)? In the multi-output multi-class case, I agree with you that the output should be a list (len=n_outputs) of arrays with shape (n_samples, n_classes).
In any case, one problem I can see with this approach would be that as I said earlier MultiOutputClassifier and ClassifierChain would not be consistent: MultiOutputClassifier predict_proba always returns a list (len=n_outputs) of (n_samples, n_classes) arrays, even in the case where n_classes=2.
To me that would be really problematic: image the scenario where one works on a multi-class multi-output problem and wants to investigate whether taking into account output structure improves performance or not. What they would be probably want to do would be to compare the performance of MultiOutputClassifier (naive approach) and ClassifierChain where indeed the output structure is considered. In this case, as a user I would expect to be able to seamlessly switch between the two methods without having to change other bits of the code, depending on which method I use.

Happy to further work on this PR once a consensus has been reached regarding whether multi-class multi-output should be supported for ClassifierChain or not. Thanks!

glemaitre · 2021-04-10T20:21:40Z

I think what you are suggesting is indeed feasible, however I just wanted to double-check that in the multi-label case the output should be array of shape (n_samples, n_outputs)?

You are right and I think that we are going to the rabbit hole: #2451.
I think that we should start to make think consistent and a list of ndarray will be fine. It would not be a problem if our scorer can handle this format as well.

However, we should need to have some deprecation. I will start to write some common tests to check what are the current status in scikit-learn.

My proposal will be to implement the multiclass-multioutput as you propose and keep the array format for multilabel-indicator. A separate PR should take care to introduce a common test to check the consistency of the array format.

agamemnonc · 2021-04-10T20:24:02Z

I just had another thought regarding the support of multi-class multi-output for ClassifierChain (this comment is perhaps more relevant to #19853). In the case of multi-output multi-class (and assuming that the output is encoded using a LabelEncoder), when the output values are added to the feature space, these will take the form of a variable encoded with an OrdinalEncoder (vs. for example OneHotEncoder). This could be problematic for some families of base classifiers, e.g., LogisticRegression, LinearDiscriminantAnalysis etc. as in those cases the order of the classes in the output label encoding scheme (which have now become inputs encoded with OrdinalEncoder) would affect their behaviour..
It might be possible to address this issue by detecting such cases and converting output features to one-hot-encoding, but I am not really sure...

glemaitre · 2021-04-10T20:33:46Z

In multi-output problems (either multi-label or multi-class multi-output), Classifier Chains indeed create links between the outputs (binary or multi-class, respectively). For example you predict the second output by adding the first output in the feature space. I am afraid I am not sure what you mean by links between "class labels" in this case?

My bad, I confused multilabel and one-hot encoded multiclass (multilabel with a single 1 per line). So the extension seems correct.

glemaitre · 2021-04-10T20:37:40Z

It might be possible to address this issue by detecting such cases and converting output features to one-hot-encoding, but I am not really sure...

Yes this is right. In this case, we might want to have a keyword specifying the encoding strategy; for instance encode that could be on of {"ordinal", "onehot", "onehot-sparse"}.

creatorrr · 2022-12-09T05:49:32Z

Has this lagged behind main too much?

creatorrr · 2022-12-10T04:50:39Z

this would be very useful, I would be happy to pitch in effort!

agamemnonc · 2023-01-06T22:52:49Z

@creatorrr Please go ahead and take this forward, I am afraid I do not currently have time to commit to this

WestedCrean · 2023-09-27T12:55:48Z

Hello, happy to contribute if needed. From what I've read it is still needed since #9245, #13338, #13339 are open? @agamemnonc

agamemnonc · 2023-09-27T13:35:11Z

@WestedCrean please feel free to pick it up. I submitted a PR a couple of times but review was either delayed or not made, unfortunately I don't have time to spend on this atm

adrinjalali · 2024-03-06T09:37:00Z

Closing as needs a new PR.

agamemnonc added 6 commits August 14, 2019 08:27

fix predict_proba and decision_function

8fe643a

fix tests

86b263e

Merge branch 'master' into classifier_chain_multiclass

32965fa

remove multioutput_only tag

b4babca

update whats_new

d6b35a7

Merge branch 'master' into classifier_chain_multiclass

cc632f3

Merge branch 'master' into classifier_chain_multiclass

eb3276a

agamemnonc changed the title ~~[Fix] Classifier chain multiclass predict_proba and decision_function~~ [MRG] Classifier chain multiclass predict_proba and decision_function Sep 29, 2019

glemaitre reviewed Nov 15, 2019

View reviewed changes

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved

sklearn/multioutput.py Outdated Show resolved Hide resolved

sklearn/multioutput.py Outdated Show resolved Hide resolved

agamemnonc and others added 2 commits November 15, 2019 14:48

merge and resolve conflicts

22a856d

nitpicks

dccbf6d

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

agamemnonc closed this Jan 3, 2020

agamemnonc reopened this Jan 3, 2020

Merge branch 'master' into classifier_chain_multiclass

11f09eb

thomasjpfan reviewed Jan 9, 2020

View reviewed changes

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved

agamemnonc added 2 commits January 10, 2020 08:53

update docs

408d595

fix whatsnew

95c4bf8

amueller mentioned this pull request May 29, 2020

Bad error messages in ClassifierChain on multioutput multiclass #13339

Open

agamemnonc added 3 commits August 20, 2020 17:03

Merge scikit-learn:master

52a559d

Merge branch 'master' into classifier_chain_multiclass

c271cf4

Revert "fix whatsnew"

6d8cf2a

This reverts commit 95c4bf8.

cmarmo added the Stalled label Dec 29, 2020

Base automatically changed from master to main January 22, 2021 10:51

agamemnonc added 2 commits January 31, 2021 10:21

Fix whatsnew formatting issue

d561756

Update in 0.24 vs. 1.0

f019810

cmarmo removed the Stalled label Jan 31, 2021

agamemnonc added 3 commits February 1, 2021 17:48

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

54c3fdd

…nto main

Merge branch 'main' into classifier_chain_multiclass

67f4986

Move changelog from v0.24.rst to v1.0.rst

e49c043

cmarmo added the Waiting for Reviewer label Feb 1, 2021

cmarmo added the Breaking Change Issue resolution would not be easily handled by the usual deprecation cycle. label Feb 1, 2021

jnothman mentioned this pull request Apr 10, 2021

ClassifierChain should only accept multilabel-indicator #19853

Open

glemaitre changed the title ~~[MRG] Classifier chain multiclass predict_proba and decision_function~~ [MRG] ENH add support for multiclass-multioutput to ClassifierChain Apr 12, 2021

cmarmo added the module:multioutput label Jan 16, 2022

cmarmo added Needs Decision Requires decision and removed Waiting for Reviewer labels Feb 22, 2022

glemaitre removed their request for review January 11, 2023 13:33

adrinjalali closed this Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] ENH add support for multiclass-multioutput to ClassifierChain #14654

[MRG] ENH add support for multiclass-multioutput to ClassifierChain #14654

agamemnonc commented Aug 14, 2019 •

edited

Loading

agamemnonc commented Aug 19, 2019

agamemnonc commented Nov 15, 2019

glemaitre left a comment

jnothman commented Jan 9, 2020

agamemnonc commented Jan 9, 2020

agamemnonc commented Jan 10, 2020

cmarmo commented May 15, 2020

agamemnonc commented May 22, 2020

agamemnonc commented Jan 31, 2021 •

edited

Loading

cmarmo commented Feb 1, 2021

agamemnonc commented Feb 1, 2021 •

edited

Loading

glemaitre commented Apr 10, 2021

agamemnonc commented Apr 10, 2021 •

edited

Loading

glemaitre commented Apr 10, 2021

agamemnonc commented Apr 10, 2021 •

edited

Loading

glemaitre commented Apr 10, 2021

glemaitre commented Apr 10, 2021 •

edited

Loading

creatorrr commented Dec 9, 2022

creatorrr commented Dec 10, 2022

agamemnonc commented Jan 6, 2023

WestedCrean commented Sep 27, 2023

agamemnonc commented Sep 27, 2023

adrinjalali commented Mar 6, 2024

[MRG] ENH add support for multiclass-multioutput to ClassifierChain #14654

[MRG] ENH add support for multiclass-multioutput to ClassifierChain #14654

Conversation

agamemnonc commented Aug 14, 2019 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

agamemnonc commented Aug 19, 2019

agamemnonc commented Nov 15, 2019

glemaitre left a comment

Choose a reason for hiding this comment

jnothman commented Jan 9, 2020

agamemnonc commented Jan 9, 2020

agamemnonc commented Jan 10, 2020

cmarmo commented May 15, 2020

agamemnonc commented May 22, 2020

agamemnonc commented Jan 31, 2021 • edited Loading

cmarmo commented Feb 1, 2021

agamemnonc commented Feb 1, 2021 • edited Loading

glemaitre commented Apr 10, 2021

agamemnonc commented Apr 10, 2021 • edited Loading

glemaitre commented Apr 10, 2021

agamemnonc commented Apr 10, 2021 • edited Loading

glemaitre commented Apr 10, 2021

glemaitre commented Apr 10, 2021 • edited Loading

creatorrr commented Dec 9, 2022

creatorrr commented Dec 10, 2022

agamemnonc commented Jan 6, 2023

WestedCrean commented Sep 27, 2023

agamemnonc commented Sep 27, 2023

adrinjalali commented Mar 6, 2024

agamemnonc commented Aug 14, 2019 •

edited

Loading

agamemnonc commented Jan 31, 2021 •

edited

Loading

agamemnonc commented Feb 1, 2021 •

edited

Loading

agamemnonc commented Apr 10, 2021 •

edited

Loading

agamemnonc commented Apr 10, 2021 •

edited

Loading

glemaitre commented Apr 10, 2021 •

edited

Loading