Skip to content

Commit 87e7a3e

Browse files
committed
Pushing the docs to dev/ for branch: main, commit beaf4bb4a3a979b7509983706c74e5284591097c
1 parent 0ee4fb7 commit 87e7a3e

File tree

1,320 files changed

+6396
-5973
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,320 files changed

+6396
-5973
lines changed

dev/.buildinfo

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Sphinx build info version 1
22
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3-
config: 2dc85a82fd50b8bc03c0661f70c07bfc
3+
config: 1a20111001535570e5c0fe3cd42d43d9
44
tags: 645f666f9bcd5a90fca523b33c5a78b7

dev/_downloads/05ca8a4e90b4cc2acd69f9e24b4a1f3a/plot_classifier_chain_yeast.ipynb

Lines changed: 81 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"\n# Classifier Chain\nExample of using classifier chain on a multilabel dataset.\n\nFor this example we will use the [yeast](https://www.openml.org/d/40597) dataset which contains\n2417 datapoints each with 103 features and 14 possible labels. Each\ndata point has at least one label. As a baseline we first train a logistic\nregression classifier for each of the 14 labels. To evaluate the performance of\nthese classifiers we predict on a held-out test set and calculate the\n`jaccard score <jaccard_similarity_score>` for each sample.\n\nNext we create 10 classifier chains. Each classifier chain contains a\nlogistic regression model for each of the 14 labels. The models in each\nchain are ordered randomly. In addition to the 103 features in the dataset,\neach model gets the predictions of the preceding models in the chain as\nfeatures (note that by default at training time each model gets the true\nlabels as features). These additional features allow each chain to exploit\ncorrelations among the classes. The Jaccard similarity score for each chain\ntends to be greater than that of the set independent logistic models.\n\nBecause the models in each chain are arranged randomly there is significant\nvariation in performance among the chains. Presumably there is an optimal\nordering of the classes in a chain that will yield the best performance.\nHowever we do not know that ordering a priori. Instead we can construct an\nvoting ensemble of classifier chains by averaging the binary predictions of\nthe chains and apply a threshold of 0.5. The Jaccard similarity score of the\nensemble is greater than that of the independent models and tends to exceed\nthe score of each chain in the ensemble (although this is not guaranteed\nwith randomly ordered chains).\n"
7+
"\n# Multilabel classification using a classifier chain\nThis example shows how to use :class:`~sklearn.multioutput.ClassifierChain` to solve\na multilabel classification problem.\n\nThe most naive strategy to solve such a task is to independently train a binary\nclassifier on each label (i.e. each column of the target variable). At prediction\ntime, the ensemble of binary classifiers is used to assemble multitask prediction.\n\nThis strategy does not allow to model relationship between different tasks. The\n:class:`~sklearn.multioutput.ClassifierChain` is the meta-estimator (i.e. an estimator\ntaking an inner estimator) that implements a more advanced strategy. The ensemble\nof binary classifiers are used as a chain where the prediction of a classifier in the\nchain is used as a feature for training the next classifier on a new label. Therefore,\nthese additional features allow each chain to exploit correlations among labels.\n\nThe `Jaccard similarity <jaccard_similarity_score>` score for chain tends to be\ngreater than that of the set independent base models.\n"
88
]
99
},
1010
{
@@ -15,7 +15,86 @@
1515
},
1616
"outputs": [],
1717
"source": [
18-
"# Author: Adam Kleczewski\n# License: BSD 3 clause\n\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import jaccard_score\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.multiclass import OneVsRestClassifier\nfrom sklearn.multioutput import ClassifierChain\n\n# Load a multi-label dataset from https://www.openml.org/d/40597\nX, Y = fetch_openml(\"yeast\", version=4, return_X_y=True, parser=\"pandas\")\nY = Y == \"TRUE\"\nX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)\n\n# Fit an independent logistic regression model for each class using the\n# OneVsRestClassifier wrapper.\nbase_lr = LogisticRegression()\novr = OneVsRestClassifier(base_lr)\novr.fit(X_train, Y_train)\nY_pred_ovr = ovr.predict(X_test)\novr_jaccard_score = jaccard_score(Y_test, Y_pred_ovr, average=\"samples\")\n\n# Fit an ensemble of logistic regression classifier chains and take the\n# take the average prediction of all the chains.\nchains = [ClassifierChain(base_lr, order=\"random\", random_state=i) for i in range(10)]\nfor chain in chains:\n chain.fit(X_train, Y_train)\n\nY_pred_chains = np.array([chain.predict(X_test) for chain in chains])\nchain_jaccard_scores = [\n jaccard_score(Y_test, Y_pred_chain >= 0.5, average=\"samples\")\n for Y_pred_chain in Y_pred_chains\n]\n\nY_pred_ensemble = Y_pred_chains.mean(axis=0)\nensemble_jaccard_score = jaccard_score(\n Y_test, Y_pred_ensemble >= 0.5, average=\"samples\"\n)\n\nmodel_scores = [ovr_jaccard_score] + chain_jaccard_scores\nmodel_scores.append(ensemble_jaccard_score)\n\nmodel_names = (\n \"Independent\",\n \"Chain 1\",\n \"Chain 2\",\n \"Chain 3\",\n \"Chain 4\",\n \"Chain 5\",\n \"Chain 6\",\n \"Chain 7\",\n \"Chain 8\",\n \"Chain 9\",\n \"Chain 10\",\n \"Ensemble\",\n)\n\nx_pos = np.arange(len(model_names))\n\n# Plot the Jaccard similarity scores for the independent model, each of the\n# chains, and the ensemble (note that the vertical axis on this plot does\n# not begin at 0).\n\nfig, ax = plt.subplots(figsize=(7, 4))\nax.grid(True)\nax.set_title(\"Classifier Chain Ensemble Performance Comparison\")\nax.set_xticks(x_pos)\nax.set_xticklabels(model_names, rotation=\"vertical\")\nax.set_ylabel(\"Jaccard Similarity Score\")\nax.set_ylim([min(model_scores) * 0.9, max(model_scores) * 1.1])\ncolors = [\"r\"] + [\"b\"] * len(chain_jaccard_scores) + [\"g\"]\nax.bar(x_pos, model_scores, alpha=0.5, color=colors)\nplt.tight_layout()\nplt.show()"
18+
"# Author: Adam Kleczewski\n# License: BSD 3 clause"
19+
]
20+
},
21+
{
22+
"cell_type": "markdown",
23+
"metadata": {},
24+
"source": [
25+
"## Loading a dataset\nFor this example, we use the [yeast](https://www.openml.org/d/40597) dataset which contains\n2,417 datapoints each with 103 features and 14 possible labels. Each\ndata point has at least one label. As a baseline we first train a logistic\nregression classifier for each of the 14 labels. To evaluate the performance of\nthese classifiers we predict on a held-out test set and calculate the\nJaccard similarity for each sample.\n\n"
26+
]
27+
},
28+
{
29+
"cell_type": "code",
30+
"execution_count": null,
31+
"metadata": {
32+
"collapsed": false
33+
},
34+
"outputs": [],
35+
"source": [
36+
"import matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.model_selection import train_test_split\n\n# Load a multi-label dataset from https://www.openml.org/d/40597\nX, Y = fetch_openml(\"yeast\", version=4, return_X_y=True, parser=\"pandas\")\nY = Y == \"TRUE\"\nX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)"
37+
]
38+
},
39+
{
40+
"cell_type": "markdown",
41+
"metadata": {},
42+
"source": [
43+
"## Fit models\nWe fit :class:`~sklearn.linear_model.LogisticRegression` wrapped by\n:class:`~sklearn.multiclass.OneVsRestClassifier` and ensemble of multiple\n:class:`~sklearn.multioutput.ClassifierChain`.\n\n### LogisticRegression wrapped by OneVsRestClassifier\nSince by default :class:`~sklearn.linear_model.LogisticRegression` can't\nhandle data with multiple targets, we need to use\n:class:`~sklearn.multiclass.OneVsRestClassifier`.\nAfter fitting the model we calculate Jaccard similarity.\n\n"
44+
]
45+
},
46+
{
47+
"cell_type": "code",
48+
"execution_count": null,
49+
"metadata": {
50+
"collapsed": false
51+
},
52+
"outputs": [],
53+
"source": [
54+
"from sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import jaccard_score\nfrom sklearn.multiclass import OneVsRestClassifier\n\nbase_lr = LogisticRegression()\novr = OneVsRestClassifier(base_lr)\novr.fit(X_train, Y_train)\nY_pred_ovr = ovr.predict(X_test)\novr_jaccard_score = jaccard_score(Y_test, Y_pred_ovr, average=\"samples\")"
55+
]
56+
},
57+
{
58+
"cell_type": "markdown",
59+
"metadata": {},
60+
"source": [
61+
"### Chain of binary classifiers\nBecause the models in each chain are arranged randomly there is significant\nvariation in performance among the chains. Presumably there is an optimal\nordering of the classes in a chain that will yield the best performance.\nHowever, we do not know that ordering a priori. Instead, we can build a\nvoting ensemble of classifier chains by averaging the binary predictions of\nthe chains and apply a threshold of 0.5. The Jaccard similarity score of the\nensemble is greater than that of the independent models and tends to exceed\nthe score of each chain in the ensemble (although this is not guaranteed\nwith randomly ordered chains).\n\n"
62+
]
63+
},
64+
{
65+
"cell_type": "code",
66+
"execution_count": null,
67+
"metadata": {
68+
"collapsed": false
69+
},
70+
"outputs": [],
71+
"source": [
72+
"from sklearn.multioutput import ClassifierChain\n\nchains = [ClassifierChain(base_lr, order=\"random\", random_state=i) for i in range(10)]\nfor chain in chains:\n chain.fit(X_train, Y_train)\n\nY_pred_chains = np.array([chain.predict_proba(X_test) for chain in chains])\nchain_jaccard_scores = [\n jaccard_score(Y_test, Y_pred_chain >= 0.5, average=\"samples\")\n for Y_pred_chain in Y_pred_chains\n]\n\nY_pred_ensemble = Y_pred_chains.mean(axis=0)\nensemble_jaccard_score = jaccard_score(\n Y_test, Y_pred_ensemble >= 0.5, average=\"samples\"\n)"
73+
]
74+
},
75+
{
76+
"cell_type": "markdown",
77+
"metadata": {},
78+
"source": [
79+
"## Plot results\nPlot the Jaccard similarity scores for the independent model, each of the\nchains, and the ensemble (note that the vertical axis on this plot does\nnot begin at 0).\n\n"
80+
]
81+
},
82+
{
83+
"cell_type": "code",
84+
"execution_count": null,
85+
"metadata": {
86+
"collapsed": false
87+
},
88+
"outputs": [],
89+
"source": [
90+
"model_scores = [ovr_jaccard_score] + chain_jaccard_scores + [ensemble_jaccard_score]\n\nmodel_names = (\n \"Independent\",\n \"Chain 1\",\n \"Chain 2\",\n \"Chain 3\",\n \"Chain 4\",\n \"Chain 5\",\n \"Chain 6\",\n \"Chain 7\",\n \"Chain 8\",\n \"Chain 9\",\n \"Chain 10\",\n \"Ensemble\",\n)\n\nx_pos = np.arange(len(model_names))\n\nfig, ax = plt.subplots(figsize=(7, 4))\nax.grid(True)\nax.set_title(\"Classifier Chain Ensemble Performance Comparison\")\nax.set_xticks(x_pos)\nax.set_xticklabels(model_names, rotation=\"vertical\")\nax.set_ylabel(\"Jaccard Similarity Score\")\nax.set_ylim([min(model_scores) * 0.9, max(model_scores) * 1.1])\ncolors = [\"r\"] + [\"b\"] * len(chain_jaccard_scores) + [\"g\"]\nax.bar(x_pos, model_scores, alpha=0.5, color=colors)\nplt.tight_layout()\nplt.show()"
91+
]
92+
},
93+
{
94+
"cell_type": "markdown",
95+
"metadata": {},
96+
"source": [
97+
"## Results interpretation\nThere are three main takeaways from this plot:\n\n- Independent model wrapped by :class:`~sklearn.multiclass.OneVsRestClassifier`\n performs worse than the ensemble of classifier chains and some of individual chains.\n This is caused by the fact that the logistic regression doesn't model relationship\n between the labels.\n- :class:`~sklearn.multioutput.ClassifierChain` takes advantage of correlation\n among labels but due to random nature of labels ordering, it could yield worse\n result than an independent model.\n- An ensemble of chains performs better because it not only captures relationship\n between labels but also does not make strong assumptions about their correct order.\n\n"
1998
]
2099
}
21100
],
Binary file not shown.
Binary file not shown.
Lines changed: 87 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,71 +1,94 @@
11
"""
2-
============================
3-
Classifier Chain
4-
============================
5-
Example of using classifier chain on a multilabel dataset.
6-
7-
For this example we will use the `yeast
8-
<https://www.openml.org/d/40597>`_ dataset which contains
9-
2417 datapoints each with 103 features and 14 possible labels. Each
10-
data point has at least one label. As a baseline we first train a logistic
11-
regression classifier for each of the 14 labels. To evaluate the performance of
12-
these classifiers we predict on a held-out test set and calculate the
13-
:ref:`jaccard score <jaccard_similarity_score>` for each sample.
14-
15-
Next we create 10 classifier chains. Each classifier chain contains a
16-
logistic regression model for each of the 14 labels. The models in each
17-
chain are ordered randomly. In addition to the 103 features in the dataset,
18-
each model gets the predictions of the preceding models in the chain as
19-
features (note that by default at training time each model gets the true
20-
labels as features). These additional features allow each chain to exploit
21-
correlations among the classes. The Jaccard similarity score for each chain
22-
tends to be greater than that of the set independent logistic models.
23-
24-
Because the models in each chain are arranged randomly there is significant
25-
variation in performance among the chains. Presumably there is an optimal
26-
ordering of the classes in a chain that will yield the best performance.
27-
However we do not know that ordering a priori. Instead we can construct an
28-
voting ensemble of classifier chains by averaging the binary predictions of
29-
the chains and apply a threshold of 0.5. The Jaccard similarity score of the
30-
ensemble is greater than that of the independent models and tends to exceed
31-
the score of each chain in the ensemble (although this is not guaranteed
32-
with randomly ordered chains).
33-
2+
==================================================
3+
Multilabel classification using a classifier chain
4+
==================================================
5+
This example shows how to use :class:`~sklearn.multioutput.ClassifierChain` to solve
6+
a multilabel classification problem.
7+
8+
The most naive strategy to solve such a task is to independently train a binary
9+
classifier on each label (i.e. each column of the target variable). At prediction
10+
time, the ensemble of binary classifiers is used to assemble multitask prediction.
11+
12+
This strategy does not allow to model relationship between different tasks. The
13+
:class:`~sklearn.multioutput.ClassifierChain` is the meta-estimator (i.e. an estimator
14+
taking an inner estimator) that implements a more advanced strategy. The ensemble
15+
of binary classifiers are used as a chain where the prediction of a classifier in the
16+
chain is used as a feature for training the next classifier on a new label. Therefore,
17+
these additional features allow each chain to exploit correlations among labels.
18+
19+
The :ref:`Jaccard similarity <jaccard_similarity_score>` score for chain tends to be
20+
greater than that of the set independent base models.
3421
"""
3522

3623
# Author: Adam Kleczewski
3724
# License: BSD 3 clause
3825

26+
# %%
27+
# Loading a dataset
28+
# -----------------
29+
# For this example, we use the `yeast
30+
# <https://www.openml.org/d/40597>`_ dataset which contains
31+
# 2,417 datapoints each with 103 features and 14 possible labels. Each
32+
# data point has at least one label. As a baseline we first train a logistic
33+
# regression classifier for each of the 14 labels. To evaluate the performance of
34+
# these classifiers we predict on a held-out test set and calculate the
35+
# Jaccard similarity for each sample.
36+
3937
import matplotlib.pyplot as plt
4038
import numpy as np
4139

4240
from sklearn.datasets import fetch_openml
43-
from sklearn.linear_model import LogisticRegression
44-
from sklearn.metrics import jaccard_score
4541
from sklearn.model_selection import train_test_split
46-
from sklearn.multiclass import OneVsRestClassifier
47-
from sklearn.multioutput import ClassifierChain
4842

4943
# Load a multi-label dataset from https://www.openml.org/d/40597
5044
X, Y = fetch_openml("yeast", version=4, return_X_y=True, parser="pandas")
5145
Y = Y == "TRUE"
5246
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
5347

54-
# Fit an independent logistic regression model for each class using the
55-
# OneVsRestClassifier wrapper.
48+
# %%
49+
# Fit models
50+
# ----------
51+
# We fit :class:`~sklearn.linear_model.LogisticRegression` wrapped by
52+
# :class:`~sklearn.multiclass.OneVsRestClassifier` and ensemble of multiple
53+
# :class:`~sklearn.multioutput.ClassifierChain`.
54+
#
55+
# LogisticRegression wrapped by OneVsRestClassifier
56+
# **************************************************
57+
# Since by default :class:`~sklearn.linear_model.LogisticRegression` can't
58+
# handle data with multiple targets, we need to use
59+
# :class:`~sklearn.multiclass.OneVsRestClassifier`.
60+
# After fitting the model we calculate Jaccard similarity.
61+
62+
from sklearn.linear_model import LogisticRegression
63+
from sklearn.metrics import jaccard_score
64+
from sklearn.multiclass import OneVsRestClassifier
65+
5666
base_lr = LogisticRegression()
5767
ovr = OneVsRestClassifier(base_lr)
5868
ovr.fit(X_train, Y_train)
5969
Y_pred_ovr = ovr.predict(X_test)
6070
ovr_jaccard_score = jaccard_score(Y_test, Y_pred_ovr, average="samples")
6171

62-
# Fit an ensemble of logistic regression classifier chains and take the
63-
# take the average prediction of all the chains.
72+
# %%
73+
# Chain of binary classifiers
74+
# ***************************
75+
# Because the models in each chain are arranged randomly there is significant
76+
# variation in performance among the chains. Presumably there is an optimal
77+
# ordering of the classes in a chain that will yield the best performance.
78+
# However, we do not know that ordering a priori. Instead, we can build a
79+
# voting ensemble of classifier chains by averaging the binary predictions of
80+
# the chains and apply a threshold of 0.5. The Jaccard similarity score of the
81+
# ensemble is greater than that of the independent models and tends to exceed
82+
# the score of each chain in the ensemble (although this is not guaranteed
83+
# with randomly ordered chains).
84+
85+
from sklearn.multioutput import ClassifierChain
86+
6487
chains = [ClassifierChain(base_lr, order="random", random_state=i) for i in range(10)]
6588
for chain in chains:
6689
chain.fit(X_train, Y_train)
6790

68-
Y_pred_chains = np.array([chain.predict(X_test) for chain in chains])
91+
Y_pred_chains = np.array([chain.predict_proba(X_test) for chain in chains])
6992
chain_jaccard_scores = [
7093
jaccard_score(Y_test, Y_pred_chain >= 0.5, average="samples")
7194
for Y_pred_chain in Y_pred_chains
@@ -76,8 +99,14 @@
7699
Y_test, Y_pred_ensemble >= 0.5, average="samples"
77100
)
78101

79-
model_scores = [ovr_jaccard_score] + chain_jaccard_scores
80-
model_scores.append(ensemble_jaccard_score)
102+
# %%
103+
# Plot results
104+
# ------------
105+
# Plot the Jaccard similarity scores for the independent model, each of the
106+
# chains, and the ensemble (note that the vertical axis on this plot does
107+
# not begin at 0).
108+
109+
model_scores = [ovr_jaccard_score] + chain_jaccard_scores + [ensemble_jaccard_score]
81110

82111
model_names = (
83112
"Independent",
@@ -96,10 +125,6 @@
96125

97126
x_pos = np.arange(len(model_names))
98127

99-
# Plot the Jaccard similarity scores for the independent model, each of the
100-
# chains, and the ensemble (note that the vertical axis on this plot does
101-
# not begin at 0).
102-
103128
fig, ax = plt.subplots(figsize=(7, 4))
104129
ax.grid(True)
105130
ax.set_title("Classifier Chain Ensemble Performance Comparison")
@@ -111,3 +136,18 @@
111136
ax.bar(x_pos, model_scores, alpha=0.5, color=colors)
112137
plt.tight_layout()
113138
plt.show()
139+
140+
# %%
141+
# Results interpretation
142+
# ----------------------
143+
# There are three main takeaways from this plot:
144+
#
145+
# - Independent model wrapped by :class:`~sklearn.multiclass.OneVsRestClassifier`
146+
# performs worse than the ensemble of classifier chains and some of individual chains.
147+
# This is caused by the fact that the logistic regression doesn't model relationship
148+
# between the labels.
149+
# - :class:`~sklearn.multioutput.ClassifierChain` takes advantage of correlation
150+
# among labels but due to random nature of labels ordering, it could yield worse
151+
# result than an independent model.
152+
# - An ensemble of chains performs better because it not only captures relationship
153+
# between labels but also does not make strong assumptions about their correct order.

dev/_downloads/scikit-learn-docs.zip

13.9 KB
Binary file not shown.
-319 Bytes
-59 Bytes
-16 Bytes
11 Bytes
51 Bytes
11 Bytes
-3 Bytes
23 Bytes
-110 Bytes
114 Bytes
4.68 KB
1.47 KB
45 Bytes
-117 Bytes
-185 Bytes
135 Bytes
30 Bytes
-62 Bytes
108 Bytes
28 Bytes
-8 Bytes
-104 Bytes
2 Bytes
24 Bytes
-19 Bytes
-601 Bytes
115 Bytes
35 Bytes
-2 Bytes
22 Bytes
-21 Bytes
0 Bytes

0 commit comments

Comments
 (0)