Skip to content

[WIP] Classifier Chain for multi-label problems #3727

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

lazywei
Copy link
Contributor

@lazywei lazywei commented Oct 1, 2014

Hi,

My research project mainly focuses on multi-label classification, but I found there is only limited support for multi-label classification in scikit-learn (there are multiclass module, but multi-label has some different perspectives and thus can use some different algorithms).

Therefore, I'd like to help implement some novel algorithms for MLC in sklearn. I plan to

  1. Add some multi-label classification datasets to sklearn, e.g. mulan's mlc datasets. (I have added a dataset "emotions" this time, but I wonder should I create a new "sub module" for these multi label datasets, or should I place them in base.py?)
  2. Implement some basic MLC algorithms
  3. Implement some novel multi-label cost-sensitive classfication algorithms, as people in my lab (CL Lab) have some contributions on this area

What do you think?
Should we merge this PR first, or should we keep it until I finish all my work?

Thanks.

@jnothman
Copy link
Member

jnothman commented Oct 1, 2014

Add some multi-label classification datasets to sklearn, e.g. mulan's mlc datasets.

Unless these are small datasets used frequently in testing and examples (as boston, iris and digits are), you should instead provide a way to fetch other datasets (see fetch_mldata and fetch_mlcomp for example; are the mulan datasets covered by those repositories?).

Implement some basic MLC algorithms

Your ClassifierChain looks like it is treating the multilabel problem as a series of binary problems, which sklearn.multiclass.OneVsRestClassifier does.

Implement some novel multi-label cost-sensitive classfication algorithms

See http://scikit-learn.org/stable/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published; although open-source implementations compatible with the scikit-learn API are welcome, just not in this repository.

@jnothman
Copy link
Member

jnothman commented Oct 1, 2014

What may be needed is more examples of interesting multilabel classification issues in the examples collection

@lazywei
Copy link
Contributor Author

lazywei commented Oct 1, 2014

Unless these are small datasets used frequently in testing and examples (as boston, iris and digits are), you should instead provide a way to fetch other datasets (see fetch_mldata and fetch_mlcomp for example; are the mulan datasets covered by those repositories?).

It seems mulan datasets are not covered by mldata & mlcomp. I think it may be help to include at least one classical multi-label dataset as it is not sufficient to use multi-class datasets.

Your ClassifierChain looks like it is treating the multilabel problem as a series of binary problems, which sklearn.multiclass.OneVsRestClassifier does.

sklearn.multiclass.OneVsRestClassifier is used to deal with multi-class classification problem. In general, multi-class means y is an int belongs to {0, 1, 2, ..., k}. On the other hand, multi-label problem means Y is a subset of {0, 1, 2, ..., k}, e.g. Y = [0, 3, 5], and we can transform it to indicator form [0, 0, 0, 1, 0, 1].

OVR fit several classifiers and use them to "vote". When OVR is applied on multi-label, it means we treat each label independently, which also means it can be treated as somewhat a multi-class problem.

However, classifier chain train first classifier on X, and append predict(X) to X. And then use this "new X" to train second classifier, and so on. In this way, we can leverage the dependencies among labels.

See http://scikit-learn.org/stable/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published; although open-source implementations compatible with the scikit-learn API are welcome, just not in this repository.
What may be needed is more examples of interesting multilabel classification issues in the examples collection

I understand we should always implement "well-established" algorithms in sklearn as it can be used widely. Therefore, I will not implement edge algorithms. However, in my opinion, classifier chain is relatively classical in multi-label context although it's citation is about 300.

Also, I'd love to help build some general multi-label training / fitting examples :-)

Thanks.

@jnothman
Copy link
Member

jnothman commented Oct 1, 2014

It seems mulan datasets are not covered by mldata & mlcomp. I think it may be help to include at least one classical multi-label dataset as it is not sufficient to use multi-class datasets.

There's already a fetch function for the Reuters corpus. The emotions corpus you are contributing here -- even were it gzipped -- is almost 3x the size of the next biggest dataset stored in the repository. This is a repository for code, not data except in special cases as I outlined above.

sklearn.multiclass.OneVsRestClassifier is used to deal with multi-class classification problem.

See http://scikit-learn.org/stable/modules/multiclass.html#multilabel-learning

classifier chain train first classifier on X, and append predict(X) to X. And then use this "new X" to train second classifier, and so on. In this way, we can leverage the dependencies among labels

Ah. Now I see how this differs. Your code is not very easy to read to understand the algorithm. This may be appropriate for inclusion. I'm not sure whether its citation count is sufficient, but @arjoly may have a better sense of the algorithm's importance.

I don't think you should include the emotions dataset, at least not in the same PR. Could you start a new PR for classifier chains alone, or fix this one such that there is no history of the emotions data in the git branch?

from ..base import BaseEstimator


class CC(BaseEstimator):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give this a more complete name

@arjoly
Copy link
Member

arjoly commented Oct 1, 2014

Unless these are small datasets used frequently in testing and examples (as boston, iris and digits are), you should instead provide a way to fetch other datasets (see fetch_mldata and fetch_mlcomp for example; are the mulan datasets covered by those repositories?).

It seems mulan datasets are not covered by mldata & mlcomp. I think it may be help to include at least one classical multi-label dataset as it is not sufficient to use multi-class datasets.

It's already possible to get yeast, scene-classification and siam-competition2007 from mldata. It would be nice to upload all other datasets mulan to this platform. Unfortunately, I haven't been able to understand how to upload those to mldata. A fetcher to mulan could also be a possibility.

classifier chain train first classifier on X, and append predict(X) to X. And then use this "new X" to train second classifier, and so on. In this way, we can leverage the dependencies among labels

Ah. Now I see how this differs. Your code is not very easy to read to understand the algorithm. This may be appropriate for inclusion. I'm not sure whether its citation count is sufficient, but @arjoly may have a better sense of the algorithm's importance.

I have mixed feeling toward the classifier chain approach. I think this
is a very basic and intuitive idea for multi-label classification. It naturally
handles label correlation through the chain. However, the chain order is still
an open question. Here you use the order of the label to make the chain. One option is to randomize the chain order as to make an ensemble of classifier chains (multi-output bagging + a classifier chain.

Should it use the predict, predict_proba or decision_score to make the chain? Note also estimator chain could naturally support multioutput-classification/regression.

I think that a good example is needed at this stage. Note that @jakevdp has a convincing example for multi-target regression using a chain.

Implement some basic MLC algorithms

What do you want to implement? I think that there are some possibilities among others for ML-knn or bayesian knn (600 citations), boostexter aka adaboost mh/rk (1714 citations), label power set (see this pr which could lead to rakel for free if we have multi-output bagging. Multilabel neural network (see this pr) is underway.

@lazywei
Copy link
Contributor Author

lazywei commented Oct 2, 2014

It's already possible to get yeast, scene-classification and siam-competition2007 from mldata. It would be nice to upload all other datasets mulan to this platform. Unfortunately, I haven't been able to understand how to upload those to mldata. A fetcher to mulan could also be a possibility.

OK, I can try to implement a mulan fetcher. I think this makes more sense than committing datasets into git repo, sorry for that 😢

I have mixed feeling toward the classifier chain approach. I think this
is a very basic and intuitive idea for multi-label classification. It naturally
handles label correlation through the chain. However, the chain order is still
an open question. Here you use the order of the label to make the chain. One option is to randomize the chain order as to make an ensemble of classifier chains (multi-output bagging + a classifier chain.

Should it use the predict, predict_proba or decision_score to make the chain? Note also estimator chain could naturally support multioutput-classification/regression.

I think that a good example is needed at this stage. Note that @jakevdp has a convincing example for multi-target regression using a chain.

Randomizing the labels order to ensemble is a brilliant idea. I can try to work on it.
As for regression, I think the extension is intuitive, but we might need some plans on the structure of the multi_label module?

What do you want to implement? I think that there are some possibilities among others for ML-knn or bayesian knn (600 citations), boostexter aka adaboost mh/rk (1714 citations), label power set (see this pr which could lead to rakel for free if we have multi-output bagging. Multilabel neural network (see this pr) is underway.

I can take a look at these algorithms after I finish this work on Classifier Chain.

Thank you so much for so many valuable suggestions @jnothman @arjoly


By the way, nosetest works fine on my laptop. However, tavis-CI complains:

ERROR: sklearn.tests.test_common.test_all_estimators('CC', <class 'sklearn.multi_label.classifier_chain.CC'>)

----------------------------------------------------------------------

Traceback (most recent call last):

File "/home/travis/virtualenv/python2.7_with_system_site_packages/local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest

self.test(*self.arg)

File "/home/travis/build/scikit-learn/scikit-learn/sklearn/utils/estimator_checks.py", line 891, in check_parameters_default_constructible

estimator = Estimator()

TypeError: __init__() takes exactly 2 arguments (1 given)

I'm not sure what did I miss? Could you give me some hint?
Again, thank you so much!

@lazywei
Copy link
Contributor Author

lazywei commented Oct 2, 2014

I don't think you should include the emotions dataset, at least not in the same PR. Could you start a new PR for classifier chains alone, or fix this one such that there is no history of the emotions data in the git branch?

I have removed this commit from this PR. I will try to implement a fetcher for mulan if possible.
Thanks.

@jnothman
Copy link
Member

jnothman commented Oct 2, 2014

Randomizing the labels order to ensemble is a brilliant idea. I can try to work on it.

That's something the ensemble of classifier chains does.

As for regression, I think the extension is intuitive, but we might need some plans on the structure of the multi_label module?

There's a multiclass module that already contains multiclass and multilabel algorithms. This should not be separate, although that might be expanded and renamed to fit other generic approaches to such problems.

The extension to non-binary multioutput problems (including regression) is completely reasonable, and I think it would be very nice to experiment with this. But it's hard to say what the gatekeepers will think of this level of novelty in a package that collects together well-loved algorithms.

I'm not sure what did I miss? Could you give me some hint?

see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/testing.py#L469

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) when pulling 16a960e on lazywei:multi-label into c1eb8f9 on scikit-learn:master.

@jnothman
Copy link
Member

jnothman commented Oct 5, 2014

If this is now a PR for multilabel classifier chains (or more generic multioutput daisy chaining), please update the title and description to reflect that. Thanks.

@lazywei lazywei changed the title Multi label related algorithms Classifier Chain for multi-label problems Oct 6, 2014
@lazywei lazywei changed the title Classifier Chain for multi-label problems [WIP] Classifier Chain for multi-label problems Oct 6, 2014
clf.fit(X, y)
self.classifiers_.append(clf)

X = self._predict_and_chain(clf, X)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for training, there is no need to predict the labels, as the labels are given in the training set.

This blog article might be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we do need it because we are doing "classifier-chain", and that's exactly what the algorithm need to do.
Also, I wonder what is the relation between classifier chain and the blog article?

@amueller
Copy link
Member

Sorry this stalled for so long. I think it is a good addition. Can you fix the tests, add an example comparing with OvR and add tests and documentation?


def __init__(self, base_estimator):
self.base_estimator = base_estimator
self.classifiers_ = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in fit

@amueller
Copy link
Member

I'm not sure about the placement in the multilabel module, as @jnothman said. Should we put this in the multiclass module, or should we rename the multiclass to multilabel or should we make multilabel an alias for multiclass?

@lazywei
Copy link
Contributor Author

lazywei commented May 28, 2015

Hi @amueller thanks for replying!

add an example comparing with OvR

I'm not really sure whether we should compare this to OvR, given OvR is multi-class algorithm while Classifier Chain is for multi-lable. If we really want some baseline comparison, then the proper competitor is Binary Relevance.

I'm not sure about the placement in the multilabel module, as @jnothman said. Should we put this in the multiclass module, or should we rename the multiclass to multilabel or should we make multilabel an alias for multiclass?

As mentioned previously, multi-label is different from multi-class.

For Multi-Class:

  • Every example x is associated to a integer in {1, 2, 3, ... K}

For Multi-Label:

  • Every example x is associated to several integers in {1, 2, 3, ... K}

If we are using label indicator,

  • multi-class: y = [0, 0, 0, 1] (belongs to class 4), y = [0, 1, 0, 0] (belongs to class 2). But not y = [0, 1, 1, 0] (can't belong to several class in the same time)
  • multi-label: y = [0, 1, 1, 0] (belongs to class 2 and class 3), or y = [1, 0, 1, 0] (belongs to class 1 and class 3).

Multi-label is especially useful in object detections. For example, we might want to know whether there exists "banana" or "monkey" or "apple" in one image. In such case, each image is associated to several labels -- maybe there are banana+monkey in the image, or maybe there are apple + banana, or maybe there is only monkey.

@arjoly
Copy link
Member

arjoly commented May 28, 2015

I'm not really sure whether we should compare this to OvR, given OvR is multi-class algorithm while Classifier Chain is for multi-lable. If we really want some baseline comparison, then the proper competitor is Binary Relevance.

the ovr classifier also does the binary relevance in scikit-learn. This could be a bit confusing.

@amueller
Copy link
Member

@arjoly any thoughts on the modules? Having OvR to binary relevance is even less obvious when it lives in the multiclass module ;)

@amueller
Copy link
Member

maybe @mblondel and @GaelVaroquaux have opinions.

@arjoly
Copy link
Member

arjoly commented May 28, 2015

@arjoly any thoughts on the modules? Having OvR to binary relevance is even less obvious when it lives in the multiclass module ;)

Let's add a new module and solve #2451 at the same time.

@lazywei
Copy link
Contributor Author

lazywei commented May 31, 2015

OK, so the following up action items are:

  • Add new module for multi-label
  • Fix testing
  • Add docs / examples
  • Add comparison to BR

Further actions (will be in another new PR):

How do you think? \cc @arjoly @amueller

Thanks

@amueller
Copy link
Member

amueller commented Jun 1, 2015

I'm not entirely sure we should separate BR but it would help with the issue of the different module names.

@lazywei
Copy link
Contributor Author

lazywei commented Jul 3, 2015

It looks like the code fail the tests on Windows ... I don't really have any experience on python+windows, could you give me some hint how to pass the test? @amueller
Thanks.

@amueller
Copy link
Member

I'm slightly surprised. It seems to be some difference in how __all__ works. @ogrisel do you have any idea maybe?

@jnothman
Copy link
Member

@amueller wrote an age ago:

I'm slightly surprised. It seems to be some difference in how all works.

It looks rather like something was wrong with the AppVeyor installation process. There is no line "creating ... creating build\lib.win-amd64-2.7\sklearn\multi_label" or copying of the files... I hope that if we rebase and run tests again now, it'll be fine.

@lesteve
Copy link
Member

lesteve commented Jun 29, 2017

Closing since #7602 has been merged.

@lesteve lesteve closed this Jun 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants