Skip to content

Commit cad9bc6

Browse files
committed
Pushing the docs for revision for branch: master, commit 1054e072da3235d6befcbd0d3cd2625f1b4e6473
1 parent 1f4d00b commit cad9bc6

File tree

1,447 files changed

+5670
-5104
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,447 files changed

+5670
-5104
lines changed

dev/.buildinfo

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Sphinx build info version 1
22
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3-
config: b90c712f3add20a5f92735ef28d2c523
3+
config: 91e11510374afa0b84e8436171ed9db9
44
tags: 645f666f9bcd5a90fca523b33c5a78b7
1020 KB
Binary file not shown.
770 KB
Binary file not shown.

dev/_downloads/bicluster_newsgroups.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
},
1616
{
1717
"source": [
18-
"\n# Biclustering documents with the Spectral Co-clustering algorithm\n\n\nThis example demonstrates the Spectral Co-clustering algorithm on the\ntwenty newsgroups dataset. The 'comp.os.ms-windows.misc' category is\nexcluded because it contains many posts containing nothing but data.\n\nThe TF-IDF vectorized posts form a word frequency matrix, which is\nthen biclustered using Dhillon's Spectral Co-Clustering algorithm. The\nresulting document-word biclusters indicate subsets words used more\noften in those subsets documents.\n\nFor a few of the best biclusters, its most common document categories\nand its ten most important words get printed. The best biclusters are\ndetermined by their normalized cut. The best words are determined by\ncomparing their sums inside and outside the bicluster.\n\nFor comparison, the documents are also clustered using\nMiniBatchKMeans. The document clusters derived from the biclusters\nachieve a better V-measure than clusters found by MiniBatchKMeans.\n\nOutput::\n\n Vectorizing...\n Coclustering...\n Done in 9.53s. V-measure: 0.4455\n MiniBatchKMeans...\n Done in 12.00s. V-measure: 0.3309\n\n Best biclusters:\n ----------------\n bicluster 0 : 1951 documents, 4373 words\n categories : 23% talk.politics.guns, 19% talk.politics.misc, 14% sci.med\n words : gun, guns, geb, banks, firearms, drugs, gordon, clinton, cdt, amendment\n\n bicluster 1 : 1165 documents, 3304 words\n categories : 29% talk.politics.mideast, 26% soc.religion.christian, 25% alt.atheism\n words : god, jesus, christians, atheists, kent, sin, morality, belief, resurrection, marriage\n\n bicluster 2 : 2219 documents, 2830 words\n categories : 18% comp.sys.mac.hardware, 16% comp.sys.ibm.pc.hardware, 16% comp.graphics\n words : voltage, dsp, board, receiver, circuit, shipping, packages, stereo, compression, package\n\n bicluster 3 : 1860 documents, 2745 words\n categories : 26% rec.motorcycles, 23% rec.autos, 13% misc.forsale\n words : bike, car, dod, engine, motorcycle, ride, honda, cars, bmw, bikes\n\n bicluster 4 : 12 documents, 155 words\n categories : 100% rec.sport.hockey\n words : scorer, unassisted, reichel, semak, sweeney, kovalenko, ricci, audette, momesso, nedved\n\n"
18+
"\n# Biclustering documents with the Spectral Co-clustering algorithm\n\n\nThis example demonstrates the Spectral Co-clustering algorithm on the\ntwenty newsgroups dataset. The 'comp.os.ms-windows.misc' category is\nexcluded because it contains many posts containing nothing but data.\n\nThe TF-IDF vectorized posts form a word frequency matrix, which is\nthen biclustered using Dhillon's Spectral Co-Clustering algorithm. The\nresulting document-word biclusters indicate subsets words used more\noften in those subsets documents.\n\nFor a few of the best biclusters, its most common document categories\nand its ten most important words get printed. The best biclusters are\ndetermined by their normalized cut. The best words are determined by\ncomparing their sums inside and outside the bicluster.\n\nFor comparison, the documents are also clustered using\nMiniBatchKMeans. The document clusters derived from the biclusters\nachieve a better V-measure than clusters found by MiniBatchKMeans.\n\nOutput::\n\n Vectorizing...\n Coclustering...\n Done in 9.53s. V-measure: 0.4455\n MiniBatchKMeans...\n Done in 12.00s. V-measure: 0.3309\n\n Best biclusters:\n ----------------\n bicluster 0 : 1951 documents, 4373 words\n categories : 23% talk.politics.guns, 19% talk.politics.misc, 14% sci.med\n words : gun, guns, geb, banks, firearms, drugs, gordon, clinton, cdt, amendment\n\n bicluster 1 : 1165 documents, 3304 words\n categories : 29% talk.politics.mideast, 26% soc.religion.christian, 25% alt.atheism\n words : god, jesus, christians, atheists, kent, sin, morality, belief, resurrection, marriage\n\n bicluster 2 : 2219 documents, 2830 words\n categories : 18% comp.sys.mac.hardware, 16% comp.sys.ibm.pc.hardware, 16% comp.graphics\n words : voltage, dsp, board, receiver, circuit, shipping, packages, stereo, compression, package\n\n bicluster 3 : 1860 documents, 2745 words\n categories : 26% rec.motorcycles, 23% rec.autos, 13% misc.forsale\n words : bike, car, dod, engine, motorcycle, ride, honda, cars, bmw, bikes\n\n bicluster 4 : 12 documents, 155 words\n categories : 100% rec.sport.hockey\n words : scorer, unassisted, reichel, semak, sweeney, kovalenko, ricci, audette, momesso, nedved\n\n\n"
1919
],
2020
"cell_type": "markdown",
2121
"metadata": {}

dev/_downloads/digits_classification_exercise.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
},
1616
{
1717
"source": [
18-
"\n# Digits Classification Exercise\n\n\nA tutorial exercise regarding the use of classification techniques on\nthe Digits dataset.\n\nThis exercise is used in the :ref:`clf_tut` part of the\n:ref:`supervised_learning_tut` section of the\n:ref:`stat_learn_tut_index`.\n"
18+
"\n# Digits Classification Exercise\n\n\nA tutorial exercise regarding the use of classification techniques on\nthe Digits dataset.\n\nThis exercise is used in the `clf_tut` part of the\n`supervised_learning_tut` section of the\n`stat_learn_tut_index`.\n\n"
1919
],
2020
"cell_type": "markdown",
2121
"metadata": {}

dev/_downloads/document_classification_20newsgroups.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
},
1616
{
1717
"source": [
18-
"\n# Classification of text documents using sparse features\n\n\nThis is an example showing how scikit-learn can be used to classify documents\nby topics using a bag-of-words approach. This example uses a scipy.sparse\nmatrix to store the features and demonstrates various classifiers that can\nefficiently handle sparse matrices.\n\nThe dataset used in this example is the 20 newsgroups dataset. It will be\nautomatically downloaded, then cached.\n\nThe bar plot indicates the accuracy, training time (normalized) and test time\n(normalized) of each classifier.\n\n"
18+
"\n# Classification of text documents using sparse features\n\n\nThis is an example showing how scikit-learn can be used to classify documents\nby topics using a bag-of-words approach. This example uses a scipy.sparse\nmatrix to store the features and demonstrates various classifiers that can\nefficiently handle sparse matrices.\n\nThe dataset used in this example is the 20 newsgroups dataset. It will be\nautomatically downloaded, then cached.\n\nThe bar plot indicates the accuracy, training time (normalized) and test time\n(normalized) of each classifier.\n\n\n"
1919
],
2020
"cell_type": "markdown",
2121
"metadata": {}
@@ -33,7 +33,7 @@
3333
},
3434
{
3535
"source": [
36-
"Load some categories from the training set\n"
36+
"Load some categories from the training set\n\n"
3737
],
3838
"cell_type": "markdown",
3939
"metadata": {}
@@ -51,7 +51,7 @@
5151
},
5252
{
5353
"source": [
54-
"Benchmark classifiers\n"
54+
"Benchmark classifiers\n\n"
5555
],
5656
"cell_type": "markdown",
5757
"metadata": {}

dev/_downloads/document_clustering.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
},
1616
{
1717
"source": [
18-
"\n# Clustering text documents using k-means\n\n\nThis is an example showing how the scikit-learn can be used to cluster\ndocuments by topics using a bag-of-words approach. This example uses\na scipy.sparse matrix to store the features instead of standard numpy arrays.\n\nTwo feature extraction methods can be used in this example:\n\n - TfidfVectorizer uses a in-memory vocabulary (a python dict) to map the most\n frequent words to features indices and hence compute a word occurrence\n frequency (sparse) matrix. The word frequencies are then reweighted using\n the Inverse Document Frequency (IDF) vector collected feature-wise over\n the corpus.\n\n - HashingVectorizer hashes word occurrences to a fixed dimensional space,\n possibly with collisions. The word count vectors are then normalized to\n each have l2-norm equal to one (projected to the euclidean unit-ball) which\n seems to be important for k-means to work in high dimensional space.\n\n HashingVectorizer does not provide IDF weighting as this is a stateless\n model (the fit method does nothing). When IDF weighting is needed it can\n be added by pipelining its output to a TfidfTransformer instance.\n\nTwo algorithms are demoed: ordinary k-means and its more scalable cousin\nminibatch k-means.\n\nAdditionally, latent semantic analysis can also be used to reduce dimensionality\nand discover latent patterns in the data. \n\nIt can be noted that k-means (and minibatch k-means) are very sensitive to\nfeature scaling and that in this case the IDF weighting helps improve the\nquality of the clustering by quite a lot as measured against the \"ground truth\"\nprovided by the class label assignments of the 20 newsgroups dataset.\n\nThis improvement is not visible in the Silhouette Coefficient which is small\nfor both as this measure seem to suffer from the phenomenon called\n\"Concentration of Measure\" or \"Curse of Dimensionality\" for high dimensional\ndatasets such as text data. Other measures such as V-measure and Adjusted Rand\nIndex are information theoretic based evaluation scores: as they are only based\non cluster assignments rather than distances, hence not affected by the curse\nof dimensionality.\n\nNote: as k-means is optimizing a non-convex objective function, it will likely\nend up in a local optimum. Several runs with independent random init might be\nnecessary to get a good convergence.\n\n"
18+
"\n# Clustering text documents using k-means\n\n\nThis is an example showing how the scikit-learn can be used to cluster\ndocuments by topics using a bag-of-words approach. This example uses\na scipy.sparse matrix to store the features instead of standard numpy arrays.\n\nTwo feature extraction methods can be used in this example:\n\n - TfidfVectorizer uses a in-memory vocabulary (a python dict) to map the most\n frequent words to features indices and hence compute a word occurrence\n frequency (sparse) matrix. The word frequencies are then reweighted using\n the Inverse Document Frequency (IDF) vector collected feature-wise over\n the corpus.\n\n - HashingVectorizer hashes word occurrences to a fixed dimensional space,\n possibly with collisions. The word count vectors are then normalized to\n each have l2-norm equal to one (projected to the euclidean unit-ball) which\n seems to be important for k-means to work in high dimensional space.\n\n HashingVectorizer does not provide IDF weighting as this is a stateless\n model (the fit method does nothing). When IDF weighting is needed it can\n be added by pipelining its output to a TfidfTransformer instance.\n\nTwo algorithms are demoed: ordinary k-means and its more scalable cousin\nminibatch k-means.\n\nAdditionally, latent semantic analysis can also be used to reduce dimensionality\nand discover latent patterns in the data. \n\nIt can be noted that k-means (and minibatch k-means) are very sensitive to\nfeature scaling and that in this case the IDF weighting helps improve the\nquality of the clustering by quite a lot as measured against the \"ground truth\"\nprovided by the class label assignments of the 20 newsgroups dataset.\n\nThis improvement is not visible in the Silhouette Coefficient which is small\nfor both as this measure seem to suffer from the phenomenon called\n\"Concentration of Measure\" or \"Curse of Dimensionality\" for high dimensional\ndatasets such as text data. Other measures such as V-measure and Adjusted Rand\nIndex are information theoretic based evaluation scores: as they are only based\non cluster assignments rather than distances, hence not affected by the curse\nof dimensionality.\n\nNote: as k-means is optimizing a non-convex objective function, it will likely\nend up in a local optimum. Several runs with independent random init might be\nnecessary to get a good convergence.\n\n\n"
1919
],
2020
"cell_type": "markdown",
2121
"metadata": {}
@@ -33,7 +33,7 @@
3333
},
3434
{
3535
"source": [
36-
"Load some categories from the training set\n"
36+
"Load some categories from the training set\n\n"
3737
],
3838
"cell_type": "markdown",
3939
"metadata": {}
@@ -51,7 +51,7 @@
5151
},
5252
{
5353
"source": [
54-
"Do the actual clustering\n"
54+
"Do the actual clustering\n\n"
5555
],
5656
"cell_type": "markdown",
5757
"metadata": {}

dev/_downloads/face_recognition.ipynb

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
},
1616
{
1717
"source": [
18-
"\n# Faces recognition example using eigenfaces and SVMs\n\n\nThe dataset used in this example is a preprocessed excerpt of the\n\"Labeled Faces in the Wild\", aka LFW_:\n\n http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)\n\n.. _LFW: http://vis-www.cs.umass.edu/lfw/\n\nExpected results for the top 5 most represented people in the dataset:\n\n================== ============ ======= ========== =======\n precision recall f1-score support\n================== ============ ======= ========== =======\n Ariel Sharon 0.67 0.92 0.77 13\n Colin Powell 0.75 0.78 0.76 60\n Donald Rumsfeld 0.78 0.67 0.72 27\n George W Bush 0.86 0.86 0.86 146\nGerhard Schroeder 0.76 0.76 0.76 25\n Hugo Chavez 0.67 0.67 0.67 15\n Tony Blair 0.81 0.69 0.75 36\n\n avg / total 0.80 0.80 0.80 322\n================== ============ ======= ========== =======\n\n"
18+
"\n# Faces recognition example using eigenfaces and SVMs\n\n\nThe dataset used in this example is a preprocessed excerpt of the\n\"Labeled Faces in the Wild\", aka LFW_:\n\n http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)\n\n\nExpected results for the top 5 most represented people in the dataset:\n\n================== ============ ======= ========== =======\n precision recall f1-score support\n================== ============ ======= ========== =======\n Ariel Sharon 0.67 0.92 0.77 13\n Colin Powell 0.75 0.78 0.76 60\n Donald Rumsfeld 0.78 0.67 0.72 27\n George W Bush 0.86 0.86 0.86 146\nGerhard Schroeder 0.76 0.76 0.76 25\n Hugo Chavez 0.67 0.67 0.67 15\n Tony Blair 0.81 0.69 0.75 36\n\n avg / total 0.80 0.80 0.80 322\n================== ============ ======= ========== =======\n\n\n"
1919
],
2020
"cell_type": "markdown",
2121
"metadata": {}
@@ -33,7 +33,7 @@
3333
},
3434
{
3535
"source": [
36-
"Download the data, if not already on disk and load it as numpy arrays\n"
36+
"Download the data, if not already on disk and load it as numpy arrays\n\n"
3737
],
3838
"cell_type": "markdown",
3939
"metadata": {}
@@ -51,7 +51,7 @@
5151
},
5252
{
5353
"source": [
54-
"Split into a training set and a test set using a stratified k fold\n"
54+
"Split into a training set and a test set using a stratified k fold\n\n"
5555
],
5656
"cell_type": "markdown",
5757
"metadata": {}
@@ -69,7 +69,7 @@
6969
},
7070
{
7171
"source": [
72-
"Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled\ndataset): unsupervised feature extraction / dimensionality reduction\n"
72+
"Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled\ndataset): unsupervised feature extraction / dimensionality reduction\n\n"
7373
],
7474
"cell_type": "markdown",
7575
"metadata": {}
@@ -87,7 +87,7 @@
8787
},
8888
{
8989
"source": [
90-
"Train a SVM classification model\n"
90+
"Train a SVM classification model\n\n"
9191
],
9292
"cell_type": "markdown",
9393
"metadata": {}
@@ -105,7 +105,7 @@
105105
},
106106
{
107107
"source": [
108-
"Quantitative evaluation of the model quality on the test set\n"
108+
"Quantitative evaluation of the model quality on the test set\n\n"
109109
],
110110
"cell_type": "markdown",
111111
"metadata": {}
@@ -123,7 +123,7 @@
123123
},
124124
{
125125
"source": [
126-
"Qualitative evaluation of the predictions using matplotlib\n"
126+
"Qualitative evaluation of the predictions using matplotlib\n\n"
127127
],
128128
"cell_type": "markdown",
129129
"metadata": {}

dev/_downloads/feature_selection_pipeline.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
},
1616
{
1717
"source": [
18-
"\n# Pipeline Anova SVM\n\n\nSimple usage of Pipeline that runs successively a univariate\nfeature selection with anova and then a C-SVM of the selected features.\n"
18+
"\n# Pipeline Anova SVM\n\n\nSimple usage of Pipeline that runs successively a univariate\nfeature selection with anova and then a C-SVM of the selected features.\n\n"
1919
],
2020
"cell_type": "markdown",
2121
"metadata": {}

dev/_downloads/feature_stacker.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
},
1616
{
1717
"source": [
18-
"\n# Concatenating multiple feature extraction methods\n\n\nIn many real-world examples, there are many ways to extract features from a\ndataset. Often it is beneficial to combine several methods to obtain good\nperformance. This example shows how to use ``FeatureUnion`` to combine\nfeatures obtained by PCA and univariate selection.\n\nCombining features using this transformer has the benefit that it allows\ncross validation and grid searches over the whole process.\n\nThe combination used in this example is not particularly helpful on this\ndataset and is only used to illustrate the usage of FeatureUnion.\n"
18+
"\n# Concatenating multiple feature extraction methods\n\n\nIn many real-world examples, there are many ways to extract features from a\ndataset. Often it is beneficial to combine several methods to obtain good\nperformance. This example shows how to use ``FeatureUnion`` to combine\nfeatures obtained by PCA and univariate selection.\n\nCombining features using this transformer has the benefit that it allows\ncross validation and grid searches over the whole process.\n\nThe combination used in this example is not particularly helpful on this\ndataset and is only used to illustrate the usage of FeatureUnion.\n\n"
1919
],
2020
"cell_type": "markdown",
2121
"metadata": {}

dev/_downloads/grid_search_digits.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
},
1616
{
1717
"source": [
18-
"\n# Parameter estimation using grid search with cross-validation\n\n\nThis examples shows how a classifier is optimized by cross-validation,\nwhich is done using the :class:`sklearn.model_selection.GridSearchCV` object\non a development set that comprises only half of the available labeled data.\n\nThe performance of the selected hyper-parameters and trained model is\nthen measured on a dedicated evaluation set that was not used during\nthe model selection step.\n\nMore details on tools available for model selection can be found in the\nsections on :ref:`cross_validation` and :ref:`grid_search`.\n\n"
18+
"\n# Parameter estimation using grid search with cross-validation\n\n\nThis examples shows how a classifier is optimized by cross-validation,\nwhich is done using the :class:`sklearn.model_selection.GridSearchCV` object\non a development set that comprises only half of the available labeled data.\n\nThe performance of the selected hyper-parameters and trained model is\nthen measured on a dedicated evaluation set that was not used during\nthe model selection step.\n\nMore details on tools available for model selection can be found in the\nsections on `cross_validation` and `grid_search`.\n\n\n"
1919
],
2020
"cell_type": "markdown",
2121
"metadata": {}

0 commit comments

Comments
 (0)