Skip to content

[MRG] ENH add support for multiclass-multioutput to ClassifierChain #14654

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

agamemnonc
Copy link
Contributor

@agamemnonc agamemnonc commented Aug 14, 2019

Reference Issues/PRs

Fixes #9245
Fixes #13338
Fixes #13339

What does this implement/fix? Explain your changes.

predict_proba and decision_function of ClassifierChain retuns now a list of n_outputs arrays, each of shape (n_samples, n_classes). This format is consistent with that of MultiOutputClassifier.

Any other comments?

multiouput_only tag has now been removed. I wonder whether we should keep the _skip_test tag. Tests pass even when this tag is removed.

This change will break backward compatibility, since the functions previously returned arrays rather than lists. Any ideas on how to handle this?

@agamemnonc
Copy link
Contributor Author

@jnothman @amueller just pinging you as you were the core developers mostly involved in the discussion of the issues this PR is aiming to fix.

@agamemnonc agamemnonc changed the title [Fix] Classifier chain multiclass predict_proba and decision_function [MRG] Classifier chain multiclass predict_proba and decision_function Sep 29, 2019
@agamemnonc
Copy link
Contributor Author

@jnothman @glemaitre @amueller
Is this something we are interested in or shall I go ahead and close it?

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this is relevant. However, I am not really comfortable reviewing thing with multiouptut multiclass since I never used it. You could remove _skip_test to check which common tests are failing.
If @jnothman @amueller could have a look, it could be great (I will learn from it:))

agamemnonc and others added 2 commits November 15, 2019 14:48
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@agamemnonc agamemnonc closed this Jan 3, 2020
@agamemnonc agamemnonc reopened this Jan 3, 2020
@jnothman
Copy link
Member

jnothman commented Jan 9, 2020

Tests are failing @agamemnonc. Let us know if we can assist.

@agamemnonc
Copy link
Contributor Author

Thanks @jnothman.

I think I might actually need some help here as I am not really sure why tests are failing and the reports seem a bit obscure to me.

I have run tests locally us, which pass, so I am not really sure what the issue is?

Thanks again!

@agamemnonc
Copy link
Contributor Author

@jnothman can you please have a look again?

Any suggestions on how to keep code coverage above the required threshold also welcome!

@cmarmo
Copy link
Contributor

cmarmo commented May 15, 2020

Hi @agamemnonc, are you still interested in finalizing this PR? If yes, could you please resolve the conflicts with master? I've checked the codecov report and couldn't find any problem related to your changes, a sync will clarify if there is still work to do in testing. Thanks for your patience!

@agamemnonc
Copy link
Contributor Author

Hi @agamemnonc, are you still interested in finalizing this PR? If yes, could you please resolve the conflicts with master? I've checked the codecov report and couldn't find any problem related to your changes, a sync will clarify if there is still work to do in testing. Thanks for your patience!

Sure, I can do this when I find some time -- hopefully next week.

@cmarmo cmarmo added the Stalled label Dec 29, 2020
Base automatically changed from master to main January 22, 2021 10:51
@agamemnonc
Copy link
Contributor Author

agamemnonc commented Jan 31, 2021

@cmarmo Conflicts with main have now been resolved (apologies for the delay in doing so), so this is now ready to merge as far as I am concerned. You can remove the "stalled" label if you wish.

@thomasjpfan @glemaitre pinging you again to have another look as there has been an update on the docstring example.

@cmarmo cmarmo removed the Stalled label Jan 31, 2021
@cmarmo
Copy link
Contributor

cmarmo commented Feb 1, 2021

Hi @agamemnonc , thanks for your work! Version 0.24 has been released before Christmas, could you please put the changelog entry in v1.0.rst? Thanks!
Also, the failing check is not related with your pull request (see #19312).

@agamemnonc
Copy link
Contributor Author

agamemnonc commented Feb 1, 2021

@cmarmo thanks for looking at this.
I have now moved the changelog to v1.0.rst as per your request -- apologies for the oversight on my side.

Please note that this PR breaks backward compatibility (predict_proba previously returned a np array, now returns a list of arrays). Just noting in case you want to include it in the v1.0 Milestone.

@cmarmo cmarmo added the Breaking Change Issue resolution would not be easily handled by the usual deprecation cycle. label Feb 1, 2021
@glemaitre
Copy link
Member

I have a couple of questions regarding supporting this use case indeed. It would be nice to have some insights if anyone has some.

By adding support to multioutput, I have the impression that we don't solve the same problem as stated in the documentation, we create a link between the output and not between class labels (I also commented there: #19853 (comment)). But this is not really clear to me what this involves (I never played with these classification problems). If there are differences, we should change the documentation as well.

Please note that this PR breaks backward compatibility (predict_proba previously returned a np array, now returns a list of arrays)

I think that we should be able to be back-compatible. We could detect beforehand which type of target we have and we can output a (n_samples, n_classes) for multilabel and list of (n_samples, n_classes) for mutioutput.

@agamemnonc
Copy link
Contributor Author

agamemnonc commented Apr 10, 2021

@glemaitre am not sure I am exactly following your points, but this is likely due to confusion regarding the terminology. To avoid any further confusion, I will follow here the convention outlined in our user guide, that is, multi-output classification problems with binary outputs --> "multilabel" and multi-output problems with number of classes > 2 --> "multi-class multioutput" or simply "multioutput".

we create a link between the output and not between class labels

In multi-output problems (either multi-label or multi-class multi-output), Classifier Chains indeed create links between the outputs (binary or multi-class, respectively). For example you predict the second output by adding the first output in the feature space. I am afraid I am not sure what you mean by links between "class labels" in this case?

You are right in that in the original paper Classifier Chains were proposed for multi-label (binary) problems. However this needs not be restrictive, and indeed it is straightforward to extend them to multi-class multi-output. Personally, I think it would be great to support this case too -- in my own research, I have come across multi-class multi-output classification problems where there is structure between the outputs, and in such cases Classifier Chains came in handy. If we decide to provide such support, I would think that ClassifierChain and MultiOutputClassifier should have exactly the same behaviour (as it would be reasonable to think that in that case ClassifierChain would be just a special case of the more general MultiOutputClassifier). The reason I opened #9245 and this PR was that although ClassifierChain seemed to support multi-class multi-output (as you correctly point out in #19853) the predict_proba methods of MultiOutputClassifier and ClassifierChain did not behave consistently: MultiOutputClassifier returns a list of (n_samples, n_classes) (len=n_outputs), whereas ClassifierChain returns an array of shape (n_samples, n_outputs) where each element (i,j) corresponds to the probability of the positive class for the j-th output (and i-th sample). If ClassifierChain was constrained to multilabel-indicator (as per your suggestion in #19853) then this should be OK, but if multi-class multi-output is supported then it becomes problematic as there is no way of knowing to which classes the probabilities correspond to.

I think that we should be able to be back-compatible. We could detect beforehand which type of target we have and we can output a (n_samples, n_classes) for multilabel and list of (n_samples, n_classes) for mutioutput.

I think what you are suggesting is indeed feasible, however I just wanted to double-check that in the multi-label case the output should be array of shape (n_samples, n_outputs)? In the multi-output multi-class case, I agree with you that the output should be a list (len=n_outputs) of arrays with shape (n_samples, n_classes).
In any case, one problem I can see with this approach would be that as I said earlier MultiOutputClassifier and ClassifierChain would not be consistent: MultiOutputClassifier predict_proba always returns a list (len=n_outputs) of (n_samples, n_classes) arrays, even in the case where n_classes=2.
To me that would be really problematic: image the scenario where one works on a multi-class multi-output problem and wants to investigate whether taking into account output structure improves performance or not. What they would be probably want to do would be to compare the performance of MultiOutputClassifier (naive approach) and ClassifierChain where indeed the output structure is considered. In this case, as a user I would expect to be able to seamlessly switch between the two methods without having to change other bits of the code, depending on which method I use.

Happy to further work on this PR once a consensus has been reached regarding whether multi-class multi-output should be supported for ClassifierChain or not. Thanks!

@glemaitre
Copy link
Member

I think what you are suggesting is indeed feasible, however I just wanted to double-check that in the multi-label case the output should be array of shape (n_samples, n_outputs)?

You are right and I think that we are going to the rabbit hole: #2451.
I think that we should start to make think consistent and a list of ndarray will be fine. It would not be a problem if our scorer can handle this format as well.

However, we should need to have some deprecation. I will start to write some common tests to check what are the current status in scikit-learn.

My proposal will be to implement the multiclass-multioutput as you propose and keep the array format for multilabel-indicator. A separate PR should take care to introduce a common test to check the consistency of the array format.

@agamemnonc
Copy link
Contributor Author

agamemnonc commented Apr 10, 2021

I just had another thought regarding the support of multi-class multi-output for ClassifierChain (this comment is perhaps more relevant to #19853). In the case of multi-output multi-class (and assuming that the output is encoded using a LabelEncoder), when the output values are added to the feature space, these will take the form of a variable encoded with an OrdinalEncoder (vs. for example OneHotEncoder). This could be problematic for some families of base classifiers, e.g., LogisticRegression, LinearDiscriminantAnalysis etc. as in those cases the order of the classes in the output label encoding scheme (which have now become inputs encoded with OrdinalEncoder) would affect their behaviour..
It might be possible to address this issue by detecting such cases and converting output features to one-hot-encoding, but I am not really sure...

@glemaitre
Copy link
Member

In multi-output problems (either multi-label or multi-class multi-output), Classifier Chains indeed create links between the outputs (binary or multi-class, respectively). For example you predict the second output by adding the first output in the feature space. I am afraid I am not sure what you mean by links between "class labels" in this case?

My bad, I confused multilabel and one-hot encoded multiclass (multilabel with a single 1 per line). So the extension seems correct.

@glemaitre
Copy link
Member

glemaitre commented Apr 10, 2021

It might be possible to address this issue by detecting such cases and converting output features to one-hot-encoding, but I am not really sure...

Yes this is right. In this case, we might want to have a keyword specifying the encoding strategy; for instance encode that could be on of {"ordinal", "onehot", "onehot-sparse"}.

@glemaitre glemaitre changed the title [MRG] Classifier chain multiclass predict_proba and decision_function [MRG] ENH add support for multiclass-multioutput to ClassifierChain Apr 12, 2021
@cmarmo cmarmo added Needs Decision Requires decision and removed Waiting for Reviewer labels Feb 22, 2022
@creatorrr
Copy link

Has this lagged behind main too much?

@creatorrr
Copy link

this would be very useful, I would be happy to pitch in effort!

@agamemnonc
Copy link
Contributor Author

@creatorrr Please go ahead and take this forward, I am afraid I do not currently have time to commit to this

@glemaitre glemaitre removed their request for review January 11, 2023 13:33
@WestedCrean
Copy link

Hello, happy to contribute if needed. From what I've read it is still needed since #9245, #13338, #13339 are open? @agamemnonc

@agamemnonc
Copy link
Contributor Author

@WestedCrean please feel free to pick it up. I submitted a PR a couple of times but review was either delayed or not made, unfortunately I don't have time to spend on this atm

@adrinjalali
Copy link
Member

Closing as needs a new PR.

@adrinjalali adrinjalali closed this Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking Change Issue resolution would not be easily handled by the usual deprecation cycle. module:multioutput Needs Decision Requires decision
Projects
None yet
8 participants