[WIP] Classifier Chain for multi-label problems #3727

lazywei · 2014-10-01T02:39:02Z

Hi,

My research project mainly focuses on multi-label classification, but I found there is only limited support for multi-label classification in scikit-learn (there are multiclass module, but multi-label has some different perspectives and thus can use some different algorithms).

Therefore, I'd like to help implement some novel algorithms for MLC in sklearn. I plan to

Add some multi-label classification datasets to sklearn, e.g. mulan's mlc datasets. (I have added a dataset "emotions" this time, but I wonder should I create a new "sub module" for these multi label datasets, or should I place them in base.py?)
Implement some basic MLC algorithms
Implement some novel multi-label cost-sensitive classfication algorithms, as people in my lab (CL Lab) have some contributions on this area

What do you think?
Should we merge this PR first, or should we keep it until I finish all my work?

Thanks.

jnothman · 2014-10-01T03:16:06Z

Add some multi-label classification datasets to sklearn, e.g. mulan's mlc datasets.

Unless these are small datasets used frequently in testing and examples (as boston, iris and digits are), you should instead provide a way to fetch other datasets (see fetch_mldata and fetch_mlcomp for example; are the mulan datasets covered by those repositories?).

Implement some basic MLC algorithms

Your ClassifierChain looks like it is treating the multilabel problem as a series of binary problems, which sklearn.multiclass.OneVsRestClassifier does.

Implement some novel multi-label cost-sensitive classfication algorithms

See http://scikit-learn.org/stable/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published; although open-source implementations compatible with the scikit-learn API are welcome, just not in this repository.

jnothman · 2014-10-01T03:20:44Z

What may be needed is more examples of interesting multilabel classification issues in the examples collection

lazywei · 2014-10-01T03:33:57Z

Unless these are small datasets used frequently in testing and examples (as boston, iris and digits are), you should instead provide a way to fetch other datasets (see fetch_mldata and fetch_mlcomp for example; are the mulan datasets covered by those repositories?).

It seems mulan datasets are not covered by mldata & mlcomp. I think it may be help to include at least one classical multi-label dataset as it is not sufficient to use multi-class datasets.

Your ClassifierChain looks like it is treating the multilabel problem as a series of binary problems, which sklearn.multiclass.OneVsRestClassifier does.

sklearn.multiclass.OneVsRestClassifier is used to deal with multi-class classification problem. In general, multi-class means y is an int belongs to {0, 1, 2, ..., k}. On the other hand, multi-label problem means Y is a subset of {0, 1, 2, ..., k}, e.g. Y = [0, 3, 5], and we can transform it to indicator form [0, 0, 0, 1, 0, 1].

OVR fit several classifiers and use them to "vote". When OVR is applied on multi-label, it means we treat each label independently, which also means it can be treated as somewhat a multi-class problem.

However, classifier chain train first classifier on X, and append predict(X) to X. And then use this "new X" to train second classifier, and so on. In this way, we can leverage the dependencies among labels.

See http://scikit-learn.org/stable/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published; although open-source implementations compatible with the scikit-learn API are welcome, just not in this repository.
What may be needed is more examples of interesting multilabel classification issues in the examples collection

I understand we should always implement "well-established" algorithms in sklearn as it can be used widely. Therefore, I will not implement edge algorithms. However, in my opinion, classifier chain is relatively classical in multi-label context although it's citation is about 300.

Also, I'd love to help build some general multi-label training / fitting examples :-)

Thanks.

jnothman · 2014-10-01T03:50:13Z

It seems mulan datasets are not covered by mldata & mlcomp. I think it may be help to include at least one classical multi-label dataset as it is not sufficient to use multi-class datasets.

There's already a fetch function for the Reuters corpus. The emotions corpus you are contributing here -- even were it gzipped -- is almost 3x the size of the next biggest dataset stored in the repository. This is a repository for code, not data except in special cases as I outlined above.

sklearn.multiclass.OneVsRestClassifier is used to deal with multi-class classification problem.

See http://scikit-learn.org/stable/modules/multiclass.html#multilabel-learning

classifier chain train first classifier on X, and append predict(X) to X. And then use this "new X" to train second classifier, and so on. In this way, we can leverage the dependencies among labels

Ah. Now I see how this differs. Your code is not very easy to read to understand the algorithm. This may be appropriate for inclusion. I'm not sure whether its citation count is sufficient, but @arjoly may have a better sense of the algorithm's importance.

I don't think you should include the emotions dataset, at least not in the same PR. Could you start a new PR for classifier chains alone, or fix this one such that there is no history of the emotions data in the git branch?

jnothman · 2014-10-01T03:51:36Z

sklearn/multi_label/classifier_chain.py

+from ..base import BaseEstimator
+
+
+class CC(BaseEstimator):


Give this a more complete name

arjoly · 2014-10-01T10:04:04Z

Unless these are small datasets used frequently in testing and examples (as boston, iris and digits are), you should instead provide a way to fetch other datasets (see fetch_mldata and fetch_mlcomp for example; are the mulan datasets covered by those repositories?).

It seems mulan datasets are not covered by mldata & mlcomp. I think it may be help to include at least one classical multi-label dataset as it is not sufficient to use multi-class datasets.

It's already possible to get yeast, scene-classification and siam-competition2007 from mldata. It would be nice to upload all other datasets mulan to this platform. Unfortunately, I haven't been able to understand how to upload those to mldata. A fetcher to mulan could also be a possibility.

classifier chain train first classifier on X, and append predict(X) to X. And then use this "new X" to train second classifier, and so on. In this way, we can leverage the dependencies among labels

Ah. Now I see how this differs. Your code is not very easy to read to understand the algorithm. This may be appropriate for inclusion. I'm not sure whether its citation count is sufficient, but @arjoly may have a better sense of the algorithm's importance.

I have mixed feeling toward the classifier chain approach. I think this
is a very basic and intuitive idea for multi-label classification. It naturally
handles label correlation through the chain. However, the chain order is still
an open question. Here you use the order of the label to make the chain. One option is to randomize the chain order as to make an ensemble of classifier chains (multi-output bagging + a classifier chain.

Should it use the predict, predict_proba or decision_score to make the chain? Note also estimator chain could naturally support multioutput-classification/regression.

I think that a good example is needed at this stage. Note that @jakevdp has a convincing example for multi-target regression using a chain.

Implement some basic MLC algorithms

What do you want to implement? I think that there are some possibilities among others for ML-knn or bayesian knn (600 citations), boostexter aka adaboost mh/rk (1714 citations), label power set (see this pr which could lead to rakel for free if we have multi-output bagging. Multilabel neural network (see this pr) is underway.

lazywei · 2014-10-02T05:54:40Z

It's already possible to get yeast, scene-classification and siam-competition2007 from mldata. It would be nice to upload all other datasets mulan to this platform. Unfortunately, I haven't been able to understand how to upload those to mldata. A fetcher to mulan could also be a possibility.

OK, I can try to implement a mulan fetcher. I think this makes more sense than committing datasets into git repo, sorry for that 😢

I have mixed feeling toward the classifier chain approach. I think this
is a very basic and intuitive idea for multi-label classification. It naturally
handles label correlation through the chain. However, the chain order is still
an open question. Here you use the order of the label to make the chain. One option is to randomize the chain order as to make an ensemble of classifier chains (multi-output bagging + a classifier chain.

Should it use the predict, predict_proba or decision_score to make the chain? Note also estimator chain could naturally support multioutput-classification/regression.

I think that a good example is needed at this stage. Note that @jakevdp has a convincing example for multi-target regression using a chain.

Randomizing the labels order to ensemble is a brilliant idea. I can try to work on it.
As for regression, I think the extension is intuitive, but we might need some plans on the structure of the multi_label module?

What do you want to implement? I think that there are some possibilities among others for ML-knn or bayesian knn (600 citations), boostexter aka adaboost mh/rk (1714 citations), label power set (see this pr which could lead to rakel for free if we have multi-output bagging. Multilabel neural network (see this pr) is underway.

I can take a look at these algorithms after I finish this work on Classifier Chain.

Thank you so much for so many valuable suggestions @jnothman @arjoly

By the way, nosetest works fine on my laptop. However, tavis-CI complains:

ERROR: sklearn.tests.test_common.test_all_estimators('CC', <class 'sklearn.multi_label.classifier_chain.CC'>)

----------------------------------------------------------------------

Traceback (most recent call last):

File "/home/travis/virtualenv/python2.7_with_system_site_packages/local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest

self.test(*self.arg)

File "/home/travis/build/scikit-learn/scikit-learn/sklearn/utils/estimator_checks.py", line 891, in check_parameters_default_constructible

estimator = Estimator()

TypeError: __init__() takes exactly 2 arguments (1 given)

I'm not sure what did I miss? Could you give me some hint?
Again, thank you so much!

lazywei · 2014-10-02T05:56:27Z

I don't think you should include the emotions dataset, at least not in the same PR. Could you start a new PR for classifier chains alone, or fix this one such that there is no history of the emotions data in the git branch?

I have removed this commit from this PR. I will try to implement a fetcher for mulan if possible.
Thanks.

jnothman · 2014-10-02T06:15:09Z

Randomizing the labels order to ensemble is a brilliant idea. I can try to work on it.

That's something the ensemble of classifier chains does.

As for regression, I think the extension is intuitive, but we might need some plans on the structure of the multi_label module?

There's a multiclass module that already contains multiclass and multilabel algorithms. This should not be separate, although that might be expanded and renamed to fit other generic approaches to such problems.

The extension to non-binary multioutput problems (including regression) is completely reasonable, and I think it would be very nice to experiment with this. But it's hard to say what the gatekeepers will think of this level of novelty in a package that collects together well-loved algorithms.

I'm not sure what did I miss? Could you give me some hint?

see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/testing.py#L469

coveralls · 2014-10-05T08:08:52Z

Coverage decreased (-0.0%) when pulling 16a960e on lazywei:multi-label into c1eb8f9 on scikit-learn:master.

jnothman · 2014-10-05T13:18:34Z

If this is now a PR for multilabel classifier chains (or more generic multioutput daisy chaining), please update the title and description to reflect that. Thanks.

xiaohan2012 · 2015-05-22T12:38:27Z

sklearn/multi_label/classifier_chain.py

+            clf.fit(X, y)
+            self.classifiers_.append(clf)
+
+            X = self._predict_and_chain(clf, X)


I think for training, there is no need to predict the labels, as the labels are given in the training set.

This blog article might be helpful.

Actually, we do need it because we are doing "classifier-chain", and that's exactly what the algorithm need to do.
Also, I wonder what is the relation between classifier chain and the blog article?

amueller · 2015-05-27T18:28:31Z

Sorry this stalled for so long. I think it is a good addition. Can you fix the tests, add an example comparing with OvR and add tests and documentation?

amueller · 2015-05-27T18:28:53Z

sklearn/multi_label/classifier_chain.py

+
+    def __init__(self, base_estimator):
+        self.base_estimator = base_estimator
+        self.classifiers_ = []


this should be in fit

amueller · 2015-05-27T18:32:29Z

I'm not sure about the placement in the multilabel module, as @jnothman said. Should we put this in the multiclass module, or should we rename the multiclass to multilabel or should we make multilabel an alias for multiclass?

lazywei · 2015-05-28T05:33:15Z

Hi @amueller thanks for replying!

add an example comparing with OvR

I'm not really sure whether we should compare this to OvR, given OvR is multi-class algorithm while Classifier Chain is for multi-lable. If we really want some baseline comparison, then the proper competitor is Binary Relevance.

I'm not sure about the placement in the multilabel module, as @jnothman said. Should we put this in the multiclass module, or should we rename the multiclass to multilabel or should we make multilabel an alias for multiclass?

As mentioned previously, multi-label is different from multi-class.

For Multi-Class:

Every example x is associated to a integer in {1, 2, 3, ... K}

For Multi-Label:

Every example x is associated to several integers in {1, 2, 3, ... K}

If we are using label indicator,

multi-class: y = [0, 0, 0, 1] (belongs to class 4), y = [0, 1, 0, 0] (belongs to class 2). But not y = [0, 1, 1, 0] (can't belong to several class in the same time)
multi-label: y = [0, 1, 1, 0] (belongs to class 2 and class 3), or y = [1, 0, 1, 0] (belongs to class 1 and class 3).

Multi-label is especially useful in object detections. For example, we might want to know whether there exists "banana" or "monkey" or "apple" in one image. In such case, each image is associated to several labels -- maybe there are banana+monkey in the image, or maybe there are apple + banana, or maybe there is only monkey.

arjoly · 2015-05-28T06:13:24Z

I'm not really sure whether we should compare this to OvR, given OvR is multi-class algorithm while Classifier Chain is for multi-lable. If we really want some baseline comparison, then the proper competitor is Binary Relevance.

the ovr classifier also does the binary relevance in scikit-learn. This could be a bit confusing.

amueller · 2015-05-28T16:42:53Z

@arjoly any thoughts on the modules? Having OvR to binary relevance is even less obvious when it lives in the multiclass module ;)

amueller · 2015-05-28T16:43:43Z

maybe @mblondel and @GaelVaroquaux have opinions.

arjoly · 2015-05-28T17:42:25Z

@arjoly any thoughts on the modules? Having OvR to binary relevance is even less obvious when it lives in the multiclass module ;)

Let's add a new module and solve #2451 at the same time.

lazywei · 2015-05-31T03:39:06Z

OK, so the following up action items are:

Add new module for multi-label
Fix testing
Add docs / examples
Add comparison to BR

Further actions (will be in another new PR):

IMHO, we should separate BR from OvR. Once we have module for multi-label, it would be probably better to move BR to there
Detailed discussion on how to solve Multi-label and multi-output multi-class decision functions and predict proba aren't consistent #2451

How do you think? \cc @arjoly @amueller

Thanks

amueller · 2015-06-01T19:25:53Z

I'm not entirely sure we should separate BR but it would help with the issue of the different module names.

lazywei · 2015-07-03T11:05:10Z

It looks like the code fail the tests on Windows ... I don't really have any experience on python+windows, could you give me some hint how to pass the test? @amueller
Thanks.

amueller · 2015-07-12T21:29:23Z

I'm slightly surprised. It seems to be some difference in how __all__ works. @ogrisel do you have any idea maybe?

jnothman · 2016-05-25T12:45:25Z

@amueller wrote an age ago:

I'm slightly surprised. It seems to be some difference in how all works.

It looks rather like something was wrong with the AppVeyor installation process. There is no line "creating ... creating build\lib.win-amd64-2.7\sklearn\multi_label" or copying of the files... I hope that if we rebase and run tests again now, it'll be fine.

lesteve · 2017-06-29T08:12:42Z

Closing since #7602 has been merged.

lazywei force-pushed the multi-label branch from 56254d2 to c040e6c Compare October 1, 2014 02:52

jnothman reviewed Oct 1, 2014
View reviewed changes

sklearn/multi_label/classifier_chain.py

from ..base import BaseEstimator

class CC(BaseEstimator):

Copy link

Member

jnothman Oct 1, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give this a more complete name

lazywei force-pushed the multi-label branch from c040e6c to 22fd57c Compare October 2, 2014 05:54

lazywei changed the title ~~Multi label related algorithms~~ Classifier Chain for multi-label problems Oct 6, 2014

lazywei changed the title ~~Classifier Chain for multi-label problems~~ [WIP] Classifier Chain for multi-label problems Oct 6, 2014

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

arjoly mentioned this pull request May 22, 2015

Classifier chains & Homer algorithm for multilabel classification #4759

Open

xiaohan2012 reviewed May 22, 2015
View reviewed changes

amueller reviewed May 27, 2015
View reviewed changes

lazywei and others added 7 commits July 3, 2015 15:06

Add the classifier chain algorithm.

1100084

Rename CC to ClassifierChain

0601122

Provide a complete reference info

30bd7a9

Use clone instead of accepting a estimator class

6dcae78

Fix AttributeError in prediction.

f92674d

Add classifier chain to META_ESTIMATORS.

77f1668

Fix the testing.

01ee60d

lazywei force-pushed the multi-label branch from 16a960e to 01ee60d Compare July 3, 2015 07:53

Add simple example for classifier chain.

dc5e63f

maniteja123 mentioned this pull request Mar 21, 2016

[WIP] Label power set multilabel classification strategy #2461

Closed

5 tasks

u1234x1234 mentioned this pull request May 11, 2016

Problem with Classifier Chain Implementation u1234x1234/kaggle-yelp-restaurant-photo-classification#1

Open

ChristianSch mentioned this pull request Jul 13, 2016

Extended Multi-label Classification Support EpistasisLab/tpot#196

Open

adamklec mentioned this pull request Oct 7, 2016

[MRG+2] Classifier chain #7602

Merged

lesteve closed this Jun 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Classifier Chain for multi-label problems #3727

[WIP] Classifier Chain for multi-label problems #3727

lazywei commented Oct 1, 2014

jnothman commented Oct 1, 2014

jnothman commented Oct 1, 2014

lazywei commented Oct 1, 2014

jnothman commented Oct 1, 2014

jnothman Oct 1, 2014

arjoly commented Oct 1, 2014

lazywei commented Oct 2, 2014

lazywei commented Oct 2, 2014

jnothman commented Oct 2, 2014

coveralls commented Oct 5, 2014

jnothman commented Oct 5, 2014

xiaohan2012 May 22, 2015

lazywei May 23, 2015

amueller commented May 27, 2015

amueller May 27, 2015

amueller commented May 27, 2015

lazywei commented May 28, 2015

arjoly commented May 28, 2015

amueller commented May 28, 2015

amueller commented May 28, 2015

arjoly commented May 28, 2015

lazywei commented May 31, 2015

amueller commented Jun 1, 2015

lazywei commented Jul 3, 2015

amueller commented Jul 12, 2015

jnothman commented May 25, 2016

lesteve commented Jun 29, 2017

[WIP] Classifier Chain for multi-label problems #3727

[WIP] Classifier Chain for multi-label problems #3727

Conversation

lazywei commented Oct 1, 2014

jnothman commented Oct 1, 2014

jnothman commented Oct 1, 2014

lazywei commented Oct 1, 2014

jnothman commented Oct 1, 2014

jnothman Oct 1, 2014

Choose a reason for hiding this comment

arjoly commented Oct 1, 2014

lazywei commented Oct 2, 2014

lazywei commented Oct 2, 2014

jnothman commented Oct 2, 2014

coveralls commented Oct 5, 2014

jnothman commented Oct 5, 2014

xiaohan2012 May 22, 2015

Choose a reason for hiding this comment

lazywei May 23, 2015

Choose a reason for hiding this comment

amueller commented May 27, 2015

amueller May 27, 2015

Choose a reason for hiding this comment

amueller commented May 27, 2015

lazywei commented May 28, 2015

arjoly commented May 28, 2015

amueller commented May 28, 2015

amueller commented May 28, 2015

arjoly commented May 28, 2015

lazywei commented May 31, 2015

amueller commented Jun 1, 2015

lazywei commented Jul 3, 2015

amueller commented Jul 12, 2015

jnothman commented May 25, 2016

lesteve commented Jun 29, 2017