-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Classifier Chain for multi-label problems #3727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Unless these are small datasets used frequently in testing and examples (as boston, iris and digits are), you should instead provide a way to fetch other datasets (see
Your
See http://scikit-learn.org/stable/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published; although open-source implementations compatible with the scikit-learn API are welcome, just not in this repository. |
What may be needed is more examples of interesting multilabel classification issues in the examples collection |
It seems mulan datasets are not covered by mldata & mlcomp. I think it may be help to include at least one classical multi-label dataset as it is not sufficient to use multi-class datasets.
OVR fit several classifiers and use them to "vote". When OVR is applied on multi-label, it means we treat each label independently, which also means it can be treated as somewhat a multi-class problem. However, classifier chain train first classifier on
I understand we should always implement "well-established" algorithms in sklearn as it can be used widely. Therefore, I will not implement edge algorithms. However, in my opinion, classifier chain is relatively classical in multi-label context although it's citation is about 300. Also, I'd love to help build some general multi-label training / fitting examples :-) Thanks. |
There's already a fetch function for the Reuters corpus. The emotions corpus you are contributing here -- even were it gzipped -- is almost 3x the size of the next biggest dataset stored in the repository. This is a repository for code, not data except in special cases as I outlined above.
See http://scikit-learn.org/stable/modules/multiclass.html#multilabel-learning
Ah. Now I see how this differs. Your code is not very easy to read to understand the algorithm. This may be appropriate for inclusion. I'm not sure whether its citation count is sufficient, but @arjoly may have a better sense of the algorithm's importance. I don't think you should include the emotions dataset, at least not in the same PR. Could you start a new PR for classifier chains alone, or fix this one such that there is no history of the emotions data in the git branch? |
from ..base import BaseEstimator | ||
|
||
|
||
class CC(BaseEstimator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give this a more complete name
It's already possible to get
I have mixed feeling toward the classifier chain approach. I think this Should it use the predict, predict_proba or decision_score to make the chain? Note also estimator chain could naturally support multioutput-classification/regression. I think that a good example is needed at this stage. Note that @jakevdp has a convincing example for multi-target regression using a chain.
What do you want to implement? I think that there are some possibilities among others for ML-knn or bayesian knn (600 citations), boostexter aka adaboost mh/rk (1714 citations), label power set (see this pr which could lead to rakel for free if we have multi-output bagging. Multilabel neural network (see this pr) is underway. |
OK, I can try to implement a mulan fetcher. I think this makes more sense than committing datasets into git repo, sorry for that 😢
Randomizing the labels order to ensemble is a brilliant idea. I can try to work on it.
I can take a look at these algorithms after I finish this work on Classifier Chain. Thank you so much for so many valuable suggestions @jnothman @arjoly By the way, nosetest works fine on my laptop. However, tavis-CI complains:
I'm not sure what did I miss? Could you give me some hint? |
I have removed this commit from this PR. I will try to implement a fetcher for mulan if possible. |
That's something the ensemble of classifier chains does.
There's a The extension to non-binary multioutput problems (including regression) is completely reasonable, and I think it would be very nice to experiment with this. But it's hard to say what the gatekeepers will think of this level of novelty in a package that collects together well-loved algorithms.
see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/testing.py#L469 |
If this is now a PR for multilabel classifier chains (or more generic multioutput daisy chaining), please update the title and description to reflect that. Thanks. |
clf.fit(X, y) | ||
self.classifiers_.append(clf) | ||
|
||
X = self._predict_and_chain(clf, X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for training, there is no need to predict the labels, as the labels are given in the training set.
This blog article might be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we do need it because we are doing "classifier-chain", and that's exactly what the algorithm need to do.
Also, I wonder what is the relation between classifier chain and the blog article?
Sorry this stalled for so long. I think it is a good addition. Can you fix the tests, add an example comparing with OvR and add tests and documentation? |
|
||
def __init__(self, base_estimator): | ||
self.base_estimator = base_estimator | ||
self.classifiers_ = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be in fit
I'm not sure about the placement in the multilabel module, as @jnothman said. Should we put this in the multiclass module, or should we rename the multiclass to multilabel or should we make multilabel an alias for multiclass? |
Hi @amueller thanks for replying!
I'm not really sure whether we should compare this to OvR, given OvR is multi-class algorithm while Classifier Chain is for multi-lable. If we really want some baseline comparison, then the proper competitor is Binary Relevance.
As mentioned previously, multi-label is different from multi-class. For Multi-Class:
For Multi-Label:
If we are using label indicator,
Multi-label is especially useful in object detections. For example, we might want to know whether there exists "banana" or "monkey" or "apple" in one image. In such case, each image is associated to several labels -- maybe there are banana+monkey in the image, or maybe there are apple + banana, or maybe there is only monkey. |
the ovr classifier also does the binary relevance in scikit-learn. This could be a bit confusing. |
@arjoly any thoughts on the modules? Having OvR to binary relevance is even less obvious when it lives in the multiclass module ;) |
maybe @mblondel and @GaelVaroquaux have opinions. |
OK, so the following up action items are:
Further actions (will be in another new PR):
How do you think? \cc @arjoly @amueller Thanks |
I'm not entirely sure we should separate BR but it would help with the issue of the different module names. |
It looks like the code fail the tests on Windows ... I don't really have any experience on python+windows, could you give me some hint how to pass the test? @amueller |
I'm slightly surprised. It seems to be some difference in how |
@amueller wrote an age ago:
It looks rather like something was wrong with the AppVeyor installation process. There is no line "creating ... creating build\lib.win-amd64-2.7\sklearn\multi_label" or copying of the files... I hope that if we rebase and run tests again now, it'll be fine. |
Closing since #7602 has been merged. |
Hi,
My research project mainly focuses on multi-label classification, but I found there is only limited support for multi-label classification in scikit-learn (there are multiclass module, but multi-label has some different perspectives and thus can use some different algorithms).
Therefore, I'd like to help implement some novel algorithms for MLC in sklearn. I plan to
base.py
?)What do you think?
Should we merge this PR first, or should we keep it until I finish all my work?
Thanks.