From fef80e5f3a131d5892a4f603cdd5acea1efa9a15 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?R=C3=A9my=20L=C3=A9one?= <remy.leone@gmail.com>
Date: Thu, 3 Dec 2015 00:16:40 +0100
Subject: [PATCH 1/4] Fix broken links

---
 AUTHORS.rst                                    |  4 ++--
 doc/about.rst                                  |  6 +++---
 doc/developers/advanced_installation.rst       |  4 ++--
 doc/developers/performance.rst                 |  2 +-
 doc/modules/clustering.rst                     |  2 +-
 doc/modules/decomposition.rst                  |  2 +-
 doc/modules/gaussian_process.rst               |  2 +-
 doc/modules/kernel_approximation.rst           |  6 +++---
 doc/modules/linear_model.rst                   |  2 +-
 doc/modules/manifold.rst                       |  2 +-
 doc/modules/model_evaluation.rst               |  2 +-
 doc/modules/neighbors.rst                      |  2 +-
 doc/modules/sgd.rst                            |  2 +-
 doc/presentations.rst                          |  4 ++--
 .../statistical_inference/finding_help.rst     |  8 ++++----
 doc/whats_new.rst                              | 18 +++++++++---------
 .../plot_species_distribution_modeling.py      |  2 +-
 examples/neighbors/plot_species_kde.py         |  2 +-
 sklearn/covariance/shrunk_covariance_.py       |  4 ++--
 sklearn/datasets/descr/breast_cancer.rst       |  2 --
 sklearn/datasets/descr/linnerud.rst            |  1 -
 sklearn/gaussian_process/gaussian_process.py   |  2 +-
 sklearn/linear_model/theil_sen.py              |  2 +-
 23 files changed, 40 insertions(+), 43 deletions(-)

diff --git a/AUTHORS.rst b/AUTHORS.rst
index 0f15c5b328378..a7ce35d29ddc7 100644
--- a/AUTHORS.rst
+++ b/AUTHORS.rst
@@ -51,7 +51,7 @@ People
   * Ron Weiss
 
   * `Virgile Fritsch
-    <http://parietal.saclay.inria.fr/Members/virgile-fritsch>`_
+    <https://team.inria.fr/parietal/vfritsch>`_
 
   * `Mathieu Blondel <http://mblondel.org>`_
 
@@ -90,4 +90,4 @@ People
 
   * `Kemal Eren <http://www.kemaleren.com>`_
 
-  * `Michael Becker <http://beckerfuffle.com>`_
+  * `Michael Becker <https://mdbecker.github.io>`_
diff --git a/doc/about.rst b/doc/about.rst
index 9f40772f2c7ad..f05ebeb19b20c 100644
--- a/doc/about.rst
+++ b/doc/about.rst
@@ -63,7 +63,7 @@ High quality PNG and SVG logos are available in the `doc/logos/ <https://github.
 Funding
 -------
 
-`INRIA <http://inria.fr>`_ actively supports this project. It has
+`INRIA <https://www.inria.fr>`_ actively supports this project. It has
 provided funding for Fabian Pedregosa (2010-2012), Jaques Grobler
 (2012-2013) and Olivier Grisel (2013-2015) to work on this project
 full-time. It also hosts coding sprints and other events.
@@ -121,10 +121,10 @@ Donating to the project
 ~~~~~~~~~~~~~~~~~~~~~~~
 
 If you are interested in donating to the project or to one of our code-sprints, you can use
-the *Paypal* button below or the `NumFOCUS Donations Page <http://numfocus.org/donatejoin/>`_ (if you use the latter, please indicate that you are donating for the scikit-learn project).
+the *Paypal* button below or the `NumFOCUS Donations Page <http://www.numfocus.org/support-numfocus.html>`_ (if you use the latter, please indicate that you are donating for the scikit-learn project).
 
 All donations will be handled by `NumFOCUS
-<http://numfocus.org/donations>`_, a non-profit-organization which is
+<http://numfocus.org>`_, a non-profit-organization which is
 managed by a board of `Scipy community members
 <http://numfocus.org/board>`_. NumFOCUS's mission is to foster
 scientific computing software, in particular in Python. As a fiscal home
diff --git a/doc/developers/advanced_installation.rst b/doc/developers/advanced_installation.rst
index d85619c8cb24a..382de2fc51626 100644
--- a/doc/developers/advanced_installation.rst
+++ b/doc/developers/advanced_installation.rst
@@ -168,7 +168,7 @@ first, you need to install `numpy <http://numpy.scipy.org/>`_ and `scipy
 
 wheel packages (.whl files) for scikit-learn from `pypi
 <https://pypi.python.org/pypi/scikit-learn/>`_ can be installed with the `pip
-<http://pip.readthedocs.org/en/latest/installing.html>`_ utility.
+<https://pip.readthedocs.org/en/stable/installing>`_ utility.
 open a console and type the following to install or upgrade scikit-learn to the
 latest stable release::
 
@@ -379,7 +379,7 @@ testing scikit-learn once installed
 -----------------------------------
 
 testing requires having the `nose
-<http://somethingaboutorange.com/mrl/projects/nose/>`_ library. after
+<https://nose.readthedocs.org/en/latest>`_ library. after
 installation, the package can be tested by executing *from outside* the
 source directory::
 
diff --git a/doc/developers/performance.rst b/doc/developers/performance.rst
index 3b176ebf09a97..6d738d8b3ddd1 100644
--- a/doc/developers/performance.rst
+++ b/doc/developers/performance.rst
@@ -401,7 +401,7 @@ project.
 TODO: html report, type declarations, bound checks, division by zero checks,
 memory alignment, direct blas calls...
 
-- http://www.euroscipy.org/file/3696?vid=download
+- https://www.youtube.com/watch?v=gMvkiQ-gOW8
 - http://conference.scipy.org/proceedings/SciPy2009/paper_1/
 - http://conference.scipy.org/proceedings/SciPy2009/paper_2/
 
diff --git a/doc/modules/clustering.rst b/doc/modules/clustering.rst
index e2a7f97e2804d..19a08b7fb6428 100644
--- a/doc/modules/clustering.rst
+++ b/doc/modules/clustering.rst
@@ -1158,7 +1158,7 @@ calculated using a similar form to that of the adjusted Rand index:
  * Vinh, Epps, and Bailey, (2009). "Information theoretic measures
    for clusterings comparison". Proceedings of the 26th Annual International
    Conference on Machine Learning - ICML '09.
-   `doi:10.1145/1553374.1553511 <http://dx.doi.org/10.1145/1553374.1553511>`_.
+   `doi:10.1145/1553374.1553511 <https://dl.acm.org/citation.cfm?doid=1553374.1553511>`_.
    ISBN 9781605585161.
 
  * Vinh, Epps, and Bailey, (2010). Information Theoretic Measures for
diff --git a/doc/modules/decomposition.rst b/doc/modules/decomposition.rst
index f10e105664c8b..bc81545b8312e 100644
--- a/doc/modules/decomposition.rst
+++ b/doc/modules/decomposition.rst
@@ -732,7 +732,7 @@ and the regularized objective function is:
 .. topic:: References:
 
     * `"Learning the parts of objects by non-negative matrix factorization"
-      <http://hebb.mit.edu/people/seung/papers/ls-lponm-99.pdf>`_
+      <http://www.columbia.edu/~jwp2128/Teaching/W4721/papers/nmf_nature.pdf>`_
       D. Lee, S. Seung, 1999
 
     * `"Non-negative Matrix Factorization with Sparseness Constraints"
diff --git a/doc/modules/gaussian_process.rst b/doc/modules/gaussian_process.rst
index efe9ad862eed2..44e4eec877529 100644
--- a/doc/modules/gaussian_process.rst
+++ b/doc/modules/gaussian_process.rst
@@ -887,7 +887,7 @@ toolbox.
 .. topic:: References:
 
     * `DACE, A Matlab Kriging Toolbox
-      <http://www2.imm.dtu.dk/~hbn/dace/>`_ S Lophaven, HB Nielsen, J
+      <http://imedea.uib-csic.es/master/cambioglobal/Modulo_V_cod101615/Lab/lab_maps/krigging/DACE-krigingsoft/dace/dace.pdf>`_ S Lophaven, HB Nielsen, J
       Sondergaard 2002,
 
     * W.J. Welch, R.J. Buck, J. Sacks, H.P. Wynn, T.J. Mitchell, and M.D.
diff --git a/doc/modules/kernel_approximation.rst b/doc/modules/kernel_approximation.rst
index 80da380746514..063e79a27b471 100644
--- a/doc/modules/kernel_approximation.rst
+++ b/doc/modules/kernel_approximation.rst
@@ -196,12 +196,12 @@ or store training examples.
       <http://www.robots.ox.ac.uk/~vgg/rg/papers/randomfeatures.pdf>`_
       Rahimi, A. and Recht, B. - Advances in neural information processing 2007,
     .. [LS2010] `"Random Fourier approximations for skewed multiplicative histogram kernels"
-      <http://sminchisescu.ins.uni-bonn.de/papers/lis_dagm10.pdf>`_
+      <http://www.maths.lth.se/matematiklth/personal/sminchis/papers/lis_dagm10.pdf>`_
       Random Fourier approximations for skewed multiplicative histogram kernels
       - Lecture Notes for Computer Sciencd (DAGM)
     .. [VZ2010] `"Efficient additive kernels via explicit feature maps"
-      <http://eprints.pascal-network.org/archive/00006964/01/vedaldi10.pdf>`_
+      <https://www.robots.ox.ac.uk/~vgg/publications/2011/Vedaldi11/vedaldi11.pdf>`_
       Vedaldi, A. and Zisserman, A. - Computer Vision and Pattern Recognition 2010
     .. [VVZ2010] `"Generalized RBF feature maps for Efficient Detection"
-      <http://eprints.pascal-network.org/archive/00007024/01/inproceedings.pdf.8a865c2a5421e40d.537265656b616e7468313047656e6572616c697a65642e706466.pdf>`_
+      <https://www.robots.ox.ac.uk/~vgg/publications/2010/Sreekanth10/sreekanth10.pdf>`_
       Vempati, S. and Vedaldi, A. and Zisserman, A. and Jawahar, CV - 2010
diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
index f827b4a220666..26efb04761a0b 100644
--- a/doc/modules/linear_model.rst
+++ b/doc/modules/linear_model.rst
@@ -1046,7 +1046,7 @@ considering only a random subset of all possible combinations.
 
 .. topic:: References:
 
-    .. [#f1] Xin Dang, Hanxiang Peng, Xueqin Wang and Heping Zhang: `Theil-Sen Estimators in a Multiple Linear Regression Model. <http://www.math.iupui.edu/~hpeng/MTSE_0908.pdf>`_
+    .. [#f1] Xin Dang, Hanxiang Peng, Xueqin Wang and Heping Zhang: `Theil-Sen Estimators in a Multiple Linear Regression Model. <http://home.olemiss.edu/~xdang/papers/MTSE.pdf>`_
 
     .. [#f2] T. Kärkkäinen and S. Äyrämö: `On Computation of Spatial Median for Robust Data Mining. <http://users.jyu.fi/~samiayr/pdf/ayramo_eurogen05.pdf>`_
 
diff --git a/doc/modules/manifold.rst b/doc/modules/manifold.rst
index 09a3ba222ca6c..7fea314f0f6e5 100644
--- a/doc/modules/manifold.rst
+++ b/doc/modules/manifold.rst
@@ -604,7 +604,7 @@ the internal structure of the data.
     van der Maaten, L.J.P.
 
   * `"Accelerating t-SNE using Tree-Based Algorithms."
-    <http://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf>`_
+    <https://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf>`_
     L.J.P. van der Maaten.  Journal of Machine Learning Research 15(Oct):3221-3245, 2014.
 
 Tips on practical use
diff --git a/doc/modules/model_evaluation.rst b/doc/modules/model_evaluation.rst
index d311e60c9f12f..020908d4f4d0f 100644
--- a/doc/modules/model_evaluation.rst
+++ b/doc/modules/model_evaluation.rst
@@ -944,7 +944,7 @@ operating characteristic (ROC) curve, which is also denoted by
 AUC or AUROC.  By computing the
 area under the roc curve, the curve information is summarized in one number.
 For more information see the `Wikipedia article on AUC
-<http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_curve>`_.
+<https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve>`_.
 
   >>> import numpy as np
   >>> from sklearn.metrics import roc_auc_score
diff --git a/doc/modules/neighbors.rst b/doc/modules/neighbors.rst
index 189f05e22df8b..1756ea1866994 100644
--- a/doc/modules/neighbors.rst
+++ b/doc/modules/neighbors.rst
@@ -685,6 +685,6 @@ candidates, the speedup compared to brute force search is approximately
      '06. 47th Annual IEEE Symposium
 
    * `“LSH Forest: Self-Tuning Indexes for Similarity Search”
-     <http://www2005.org/docs/p651.pdf>`_,
+     <http://infolab.stanford.edu/~bawa/Pub/similarity.pdf>`_,
      Bawa, M., Condie, T., Ganesan, P., WWW '05 Proceedings of the 14th
      international conference on World Wide Web  Pages 651-660
diff --git a/doc/modules/sgd.rst b/doc/modules/sgd.rst
index fe5f32a180b74..862fbe914537b 100644
--- a/doc/modules/sgd.rst
+++ b/doc/modules/sgd.rst
@@ -212,7 +212,7 @@ Stochastic Gradient Descent for sparse data
   intercept.
 
 There is built-in support for sparse data given in any matrix in a format
-supported by `scipy.sparse <http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.html>`_. For maximum efficiency, however, use the CSR
+supported by `scipy.sparse <https://docs.scipy.org/doc/scipy/reference/sparse.html>`_. For maximum efficiency, however, use the CSR
 matrix format as defined in `scipy.sparse.csr_matrix
 <http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html>`_.
 
diff --git a/doc/presentations.rst b/doc/presentations.rst
index 4a0c08546e436..e1e34e72cf859 100644
--- a/doc/presentations.rst
+++ b/doc/presentations.rst
@@ -20,7 +20,7 @@ There are several online tutorials available which are geared toward
 specific subject areas:
 
 - `Machine Learning for NeuroImaging in Python <http://nilearn.github.io/>`_
-- `Machine Learning for Astronomical Data Analysis <http://astroml.github.com/sklearn_tutorial/>`_
+- `Machine Learning for Astronomical Data Analysis <https://github.com/astroML/sklearn_tutorial>`_
 
 .. _videos:
 
@@ -50,7 +50,7 @@ Videos
     section :ref:`stat_learn_tut_index`.
 
 - `Statistical Learning for Text Classification with scikit-learn and NLTK
-  <http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2011-statistical-machine-learning-for-text-classification-with-scikit-learn-4898362>`_
+  <http://www.pyvideo.org/video/417/pycon-2011--statistical-machine-learning-for-text>`_
   (and `slides <http://www.slideshare.net/ogrisel/statistical-machine-learning-for-text-classification-with-scikitlearn-and-nltk>`_)
   by `Olivier Grisel`_ at PyCon 2011
 
diff --git a/doc/tutorial/statistical_inference/finding_help.rst b/doc/tutorial/statistical_inference/finding_help.rst
index 96e1ebd790723..0587a19ad85ba 100644
--- a/doc/tutorial/statistical_inference/finding_help.rst
+++ b/doc/tutorial/statistical_inference/finding_help.rst
@@ -7,7 +7,7 @@ The project mailing list
 
 If you encounter a bug with ``scikit-learn`` or something that needs
 clarification in the docstring or the online documentation, please feel free to
-ask on the `Mailing List <http://scikit-learn.sourceforge.net/support.html>`_
+ask on the `Mailing List <http://scikit-learn.org/stable/support.html>`_
 
 
 Q&A communities with Machine Learning practitioners
@@ -35,8 +35,8 @@ Q&A communities with Machine Learning practitioners
 
 .. _`good freely available textbooks on machine learning`: http://metaoptimize.com/qa/questions/186/good-freely-available-textbooks-on-machine-learning
 
-.. _`What are some good resources for learning about machine learning`: http://www.quora.com/What-are-some-good-resources-for-learning-about-machine-learning
+.. _`How do I learn machine learning?`: https://www.quora.com/How-do-I-learn-machine-learning-1
 
--- _'An excellent free online course for Machine Learning taught by Professor Andrew Ng of Stanford': https://www.coursera.org/course/ml
+-- _'An excellent free online course for Machine Learning taught by Professor Andrew Ng of Stanford': https://www.coursera.org/learn/machine-learning
 
--- _'Another excellent free online course that takes a more general approach to Artificial Intelligence':http://www.udacity.com/overview/Course/cs271/CourseRev/1
+-- _'Another excellent free online course that takes a more general approach to Artificial Intelligence': https://www.udacity.com/course/intro-to-artificial-intelligence--cs271
diff --git a/doc/whats_new.rst b/doc/whats_new.rst
index 9df926ccb92d1..5d3a8b8772618 100644
--- a/doc/whats_new.rst
+++ b/doc/whats_new.rst
@@ -3788,27 +3788,27 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
 
 .. _Vlad Niculae: http://vene.ro
 
-.. _Edouard Duchesnay: http://www.lnao.fr/spip.php?rubrique30
+.. _Edouard Duchesnay: https://sites.google.com/site/duchesnay/home
 
 .. _Peter Prettenhofer: http://sites.google.com/site/peterprettenhofer/
 
 .. _Alexandre Passos: http://atpassos.me
 
-.. _Nicolas Pinto: http://pinto.scripts.mit.edu/
+.. _Nicolas Pinto: https://twitter.com/npinto
 
-.. _Virgile Fritsch: http://parietal.saclay.inria.fr/Members/virgile-fritsch
+.. _Virgile Fritsch: https://github.com/VirgileFritsch
 
-.. _Bertrand Thirion: http://parietal.saclay.inria.fr/Members/bertrand-thirion
+.. _Bertrand Thirion: https://team.inria.fr/parietal/bertrand-thirions-page
 
 .. _Andreas Müller: http://peekaboo-vision.blogspot.com
 
-.. _Matthieu Perrot: http://www.lnao.fr/spip.php?rubrique19
+.. _Matthieu Perrot: http://brainvisa.info/biblio/lnao/en/Author/PERROT-M.html
 
 .. _Jake Vanderplas: http://www.astro.washington.edu/users/vanderplas/
 
 .. _Gilles Louppe: http://www.montefiore.ulg.ac.be/~glouppe/
 
-.. _INRIA: http://inria.fr
+.. _INRIA: http://www.inria.fr
 
 .. _Parietal Team: http://parietal.saclay.inria.fr/
 
@@ -3890,7 +3890,7 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
 
 .. _Nikolay Mayorov: https://github.com/nmayorov
 
-.. _Jatin Shah: http://jatinshah.org/
+.. _Jatin Shah: https://github.com/jatinshah
 
 .. _Dougal Sutherland: https://github.com/dougalsutherland
 
@@ -3904,7 +3904,7 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
 
 .. _Florian Wilhelm: https://github.com/FlorianWilhelm
 
-.. _Fares Hedyati: https://github.com/fareshedyati
+.. _Fares Hedyati: http://www.eecs.berkeley.edu/~fareshed
 
 .. _Matt Pico: https://github.com/MattpSoftware
 
@@ -3914,7 +3914,7 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
 
 .. _Clemens Brunner: https://github.com/cle1109
 
-.. _Martin Billinger: https://github.com/kazemakase
+.. _Martin Billinger: http://tnsre.embs.org/author/martinbillinger
 
 .. _Matteo Visconti di Oleggio Castello: http://www.mvdoc.me
 
diff --git a/examples/applications/plot_species_distribution_modeling.py b/examples/applications/plot_species_distribution_modeling.py
index d327a086c6722..3bbc580b017c5 100644
--- a/examples/applications/plot_species_distribution_modeling.py
+++ b/examples/applications/plot_species_distribution_modeling.py
@@ -13,7 +13,7 @@
 by the package `sklearn.svm` as our modeling tool.
 The dataset is provided by Phillips et. al. (2006).
 If available, the example uses
-`basemap <http://matplotlib.sourceforge.net/basemap/doc/html/>`_
+`basemap <http://matplotlib.org/basemap>`_
 to plot the coast lines and national boundaries of South America.
 
 The two species are:
diff --git a/examples/neighbors/plot_species_kde.py b/examples/neighbors/plot_species_kde.py
index 95f4417ce1bca..c582d76a9bf69 100644
--- a/examples/neighbors/plot_species_kde.py
+++ b/examples/neighbors/plot_species_kde.py
@@ -7,7 +7,7 @@
 Haversine distance metric -- i.e. distances over points in latitude/longitude.
 The dataset is provided by Phillips et. al. (2006).
 If available, the example uses
-`basemap <http://matplotlib.sourceforge.net/basemap/doc/html/>`_
+`basemap <http://matplotlib.org/basemap>`_
 to plot the coast lines and national boundaries of South America.
 
 This example does not perform any learning over the data
diff --git a/sklearn/covariance/shrunk_covariance_.py b/sklearn/covariance/shrunk_covariance_.py
index 21929084c4e1a..a84ad808a1005 100644
--- a/sklearn/covariance/shrunk_covariance_.py
+++ b/sklearn/covariance/shrunk_covariance_.py
@@ -436,7 +436,7 @@ def oas(X, assume_centered=False):
     The formula we used to implement the OAS
     does not correspond to the one given in the article. It has been taken
     from the MATLAB program available from the author's webpage
-    (https://tbayes.eecs.umich.edu/yilun/covestimation).
+    (http://tbayes.eecs.umich.edu/yilun/covestimation).
 
     """
     X = np.asarray(X)
@@ -480,7 +480,7 @@ class OAS(EmpiricalCovariance):
 
     The formula used here does not correspond to the one given in the
     article. It has been taken from the Matlab program available from the
-    authors' webpage (https://tbayes.eecs.umich.edu/yilun/covestimation).
+    authors' webpage (http://tbayes.eecs.umich.edu/yilun/covestimation).
 
     Parameters
     ----------
diff --git a/sklearn/datasets/descr/breast_cancer.rst b/sklearn/datasets/descr/breast_cancer.rst
index cb652b7f13168..8e12472941a66 100644
--- a/sklearn/datasets/descr/breast_cancer.rst
+++ b/sklearn/datasets/descr/breast_cancer.rst
@@ -81,8 +81,6 @@ https://goo.gl/U2Uwz2
 Features are computed from a digitized image of a fine needle
 aspirate (FNA) of a breast mass.  They describe
 characteristics of the cell nuclei present in the image.
-A few of the images can be found at
-http://www.cs.wisc.edu/~street/images/
 
 Separating plane described above was obtained using
 Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
diff --git a/sklearn/datasets/descr/linnerud.rst b/sklearn/datasets/descr/linnerud.rst
index 6e5a9b94cf6bf..d790d3c0c9086 100644
--- a/sklearn/datasets/descr/linnerud.rst
+++ b/sklearn/datasets/descr/linnerud.rst
@@ -18,5 +18,4 @@ The Linnerud dataset constains two small dataset:
 
 References
 ----------
-  * http://rgm2.lab.nig.ac.jp/RGM2/func.php?rd_id=mixOmics:linnerud
   * Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions Technic.
diff --git a/sklearn/gaussian_process/gaussian_process.py b/sklearn/gaussian_process/gaussian_process.py
index 0c8f103a88965..b41c8a193864a 100644
--- a/sklearn/gaussian_process/gaussian_process.py
+++ b/sklearn/gaussian_process/gaussian_process.py
@@ -203,7 +203,7 @@ class GaussianProcess(BaseEstimator, RegressorMixin):
 
     .. [NLNS2002] `H.B. Nielsen, S.N. Lophaven, H. B. Nielsen and J.
         Sondergaard.  DACE - A MATLAB Kriging Toolbox.` (2002)
-        http://www2.imm.dtu.dk/~hbn/dace/dace.pdf
+        http://imedea.uib-csic.es/master/cambioglobal/Modulo_V_cod101615/Lab/lab_maps/krigging/DACE-krigingsoft/dace/dace.pdf
 
     .. [WBSWM1992] `W.J. Welch, R.J. Buck, J. Sacks, H.P. Wynn, T.J. Mitchell,
         and M.D.  Morris (1992). Screening, predicting, and computer
diff --git a/sklearn/linear_model/theil_sen.py b/sklearn/linear_model/theil_sen.py
index b4204a381974e..0764304559ddd 100644
--- a/sklearn/linear_model/theil_sen.py
+++ b/sklearn/linear_model/theil_sen.py
@@ -276,7 +276,7 @@ class TheilSenRegressor(LinearModel, RegressorMixin):
     ----------
     - Theil-Sen Estimators in a Multiple Linear Regression Model, 2009
       Xin Dang, Hanxiang Peng, Xueqin Wang and Heping Zhang
-      http://www.math.iupui.edu/~hpeng/MTSE_0908.pdf
+      http://home.olemiss.edu/~xdang/papers/MTSE.pdf
     """
 
     def __init__(self, fit_intercept=True, copy_X=True,

From 09672f516d8592fb82f42e5da3ee0f29210d7366 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?R=C3=A9my=20L=C3=A9one?= <remy.leone@gmail.com>
Date: Thu, 3 Dec 2015 13:46:15 +0100
Subject: [PATCH 2/4] Fix redirection problems

---
 AUTHORS.rst                                   | 14 +++----
 CONTRIBUTING.md                               |  4 +-
 doc/README                                    |  2 +-
 doc/about.rst                                 | 14 +++----
 doc/datasets/twenty_newsgroups.rst            |  2 +-
 doc/developers/advanced_installation.rst      | 12 +++---
 doc/developers/contributing.rst               | 24 ++++++------
 doc/developers/performance.rst                | 12 +++---
 doc/developers/utilities.rst                  |  2 +-
 doc/install.rst                               |  9 ++---
 doc/modules/clustering.rst                    |  2 +-
 doc/modules/computational_performance.rst     |  6 +--
 doc/modules/cross_validation.rst              |  6 +--
 doc/modules/density.rst                       |  2 +-
 doc/modules/ensemble.rst                      |  2 +-
 doc/modules/feature_extraction.rst            |  4 +-
 doc/modules/feature_selection.rst             |  6 +--
 doc/modules/kernel_approximation.rst          |  2 +-
 doc/modules/label_propagation.rst             |  2 +-
 doc/modules/learning_curve.rst                |  2 +-
 doc/modules/linear_model.rst                  | 22 +++++------
 doc/modules/manifold.rst                      | 10 ++---
 doc/modules/mixture.rst                       |  4 +-
 doc/modules/model_evaluation.rst              | 38 +++++++++----------
 doc/modules/model_persistence.rst             |  2 +-
 doc/modules/neural_networks_supervised.rst    |  6 +--
 doc/modules/preprocessing.rst                 |  6 +--
 doc/modules/random_projection.rst             |  4 +-
 doc/modules/sgd.rst                           |  4 +-
 doc/modules/svm.rst                           |  2 +-
 doc/modules/tree.rst                          |  8 ++--
 doc/presentations.rst                         | 12 +++---
 doc/related_projects.rst                      |  2 +-
 doc/testimonials/testimonials.rst             | 20 +++++-----
 doc/tutorial/basic/tutorial.rst               | 26 ++++++-------
 .../statistical_inference/finding_help.rst    |  2 +-
 doc/tutorial/statistical_inference/index.rst  |  6 +--
 .../supervised_learning.rst                   | 14 +++----
 .../unsupervised_learning.rst                 |  2 +-
 .../data/languages/fetch_data.py              |  2 +-
 .../text_analytics/working_with_text_data.rst |  4 +-
 doc/whats_new.rst                             | 28 +++++++-------
 .../plot_species_distribution_modeling.py     |  4 +-
 .../wikipedia_principal_eigenvector.py        |  4 +-
 examples/calibration/plot_calibration.py      |  2 +-
 examples/datasets/plot_iris_dataset.py        |  2 +-
 examples/decomposition/plot_pca_iris.py       |  2 +-
 examples/linear_model/plot_iris_logistic.py   |  2 +-
 examples/manifold/plot_manifold_sphere.py     |  2 +-
 examples/plot_johnson_lindenstrauss_bound.py  |  2 +-
 sklearn/datasets/samples_generator.py         |  2 +-
 .../feature_selection/univariate_selection.py |  2 +-
 sklearn/isotonic.py                           |  4 +-
 sklearn/linear_model/least_angle.py           | 10 ++---
 sklearn/linear_model/perceptron.py            |  2 +-
 sklearn/linear_model/ransac.py                |  2 +-
 sklearn/linear_model/sgd_fast.pyx             |  2 +-
 sklearn/manifold/spectral_embedding_.py       |  2 +-
 sklearn/metrics/classification.py             | 20 +++++-----
 sklearn/metrics/cluster/supervised.py         |  6 +--
 sklearn/metrics/cluster/unsupervised.py       |  4 +-
 sklearn/metrics/ranking.py                    |  6 +--
 sklearn/metrics/regression.py                 |  2 +-
 sklearn/neighbors/classification.py           |  4 +-
 sklearn/neighbors/regression.py               |  4 +-
 sklearn/neighbors/unsupervised.py             |  2 +-
 sklearn/preprocessing/data.py                 |  4 +-
 sklearn/random_projection.py                  |  6 +--
 sklearn/tree/tree.py                          |  4 +-
 sklearn/utils/linear_assignment_.py           |  2 +-
 70 files changed, 229 insertions(+), 234 deletions(-)

diff --git a/AUTHORS.rst b/AUTHORS.rst
index a7ce35d29ddc7..79fa5eef6eaf5 100644
--- a/AUTHORS.rst
+++ b/AUTHORS.rst
@@ -28,15 +28,15 @@ People
 
   * `Matthieu Brucher <http://matt.eifelle.com/>`_
 
-  * `Fabian Pedregosa <http://fseoane.net/blog/>`_
+  * `Fabian Pedregosa <http://fa.bianp.net/blog/>`_
 
-  * `Gael Varoquaux <http://gael-varoquaux.info/blog/>`_
+  * `Gael Varoquaux <http://gael-varoquaux.info/>`_
 
-  * `Jake VanderPlas <http://www.astro.washington.edu/users/vanderplas/>`_
+  * `Jake VanderPlas <http://staff.washington.edu/jakevdp>`_
 
   * `Alexandre Gramfort <http://alexandre.gramfort.net>`_
 
-  * `Olivier Grisel <http://twitter.com/ogrisel>`_
+  * `Olivier Grisel <https://twitter.com/ogrisel>`_
 
   * Bertrand Thirion
 
@@ -56,7 +56,7 @@ People
   * `Mathieu Blondel <http://mblondel.org>`_
 
   * `Peter Prettenhofer
-    <http://sites.google.com/site/peterprettenhofer/>`_
+    <https://sites.google.com/site/peterprettenhofer/>`_
 
   * Vincent Dubourg
 
@@ -74,7 +74,7 @@ People
 
   * Nelle Varoquaux
 
-  * `Brian Holt <http://info.ee.surrey.ac.uk/Personal/B.Holt/>`_
+  * `Brian Holt <http://personal.ee.surrey.ac.uk/Personal/B.Holt/>`_
 
   * Robert Layton
 
@@ -84,7 +84,7 @@ People
 
   * `Satra Ghosh <http://www.mit.edu/~satra>`_
 
-  * `Wei Li <http://kuantkid.github.com>`_
+  * `Wei Li <https://kuantkid.github.io>`_
 
   * `Arnaud Joly <http://www.ajoly.org>`_
 
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 1caf0099fadcf..d325032c5c80c 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -12,10 +12,10 @@ How to contribute
 -----------------
 
 The preferred way to contribute to scikit-learn is to fork the 
-[main repository](http://github.com/scikit-learn/scikit-learn/) on
+[main repository](https://github.com/scikit-learn/scikit-learn) on
 GitHub:
 
-1. Fork the [project repository](http://github.com/scikit-learn/scikit-learn):
+1. Fork the [project repository](https://github.com/scikit-learn/scikit-learn):
    click on the 'Fork' button near the top of the page. This creates
    a copy of the code under your account on the GitHub server.
 
diff --git a/doc/README b/doc/README
index b7fca3448f64e..5d416396f81ca 100644
--- a/doc/README
+++ b/doc/README
@@ -70,5 +70,5 @@ to update the http://scikit-learn.org/dev tree of the website.
 
 The configuration of this server is managed at:
 
-  http://github.com/scikit-learn/sklearn-docbuilder
+  https://github.com/scikit-learn/sklearn-docbuilder
 
diff --git a/doc/about.rst b/doc/about.rst
index f05ebeb19b20c..69bea48faac85 100644
--- a/doc/about.rst
+++ b/doc/about.rst
@@ -88,9 +88,9 @@ Environment also funds several students to work on the project part-time.
    :width: 200pt
    :align: center
 
-The following students were sponsored by `Google <http://code.google.com/opensource/>`_
+The following students were sponsored by `Google <https://developers.google.com/open-source/>`_
 to work on scikit-learn through the
-`Google Summer of Code <http://en.wikipedia.org/wiki/Google_Summer_of_Code>`_
+`Google Summer of Code <https://en.wikipedia.org/wiki/Google_Summer_of_Code>`_
 program.
 
 - 2007 - David Cournapeau
@@ -102,14 +102,14 @@ program.
 It also provided funding for sprints and events around scikit-learn. If
 you would like to participate in the next Google Summer of code
 program, please see `this page
-<http://github.com/scikit-learn/scikit-learn/wiki/SummerOfCode>`_
+<https://github.com/scikit-learn/scikit-learn/wiki/SummerOfCode>`_
 
 The `NeuroDebian <http://neuro.debian.net>`_ project providing `Debian
 <http://www.debian.org>`_ packaging and contributions is supported by
 `Dr. James V. Haxby <http://haxbylab.dartmouth.edu/>`_ (`Dartmouth
-College <http://www.dartmouth.edu/~psych/>`_).
+College <http://pbs.dartmouth.edu>`_).
 
-The `PSF <http://www.python.org/psf/>`_ helped find and manage funding for our
+The `PSF <https://www.python.org/psf/>`_ helped find and manage funding for our
 2011 Granada sprint. More information can be found `here
 <https://github.com/scikit-learn/scikit-learn/wiki/Past-sprints#granada-19th-21th-dec-2011>`__
 
@@ -124,9 +124,9 @@ If you are interested in donating to the project or to one of our code-sprints,
 the *Paypal* button below or the `NumFOCUS Donations Page <http://www.numfocus.org/support-numfocus.html>`_ (if you use the latter, please indicate that you are donating for the scikit-learn project).
 
 All donations will be handled by `NumFOCUS
-<http://numfocus.org>`_, a non-profit-organization which is
+<http://www.numfocus.org>`_, a non-profit-organization which is
 managed by a board of `Scipy community members
-<http://numfocus.org/board>`_. NumFOCUS's mission is to foster
+<http://www.numfocus.org/board>`_. NumFOCUS's mission is to foster
 scientific computing software, in particular in Python. As a fiscal home
 of scikit-learn, it ensures that money is available when needed to keep
 the project funded and available while in compliance with tax regulations.
diff --git a/doc/datasets/twenty_newsgroups.rst b/doc/datasets/twenty_newsgroups.rst
index 0a2f313934f50..01c2a53ff77e5 100644
--- a/doc/datasets/twenty_newsgroups.rst
+++ b/doc/datasets/twenty_newsgroups.rst
@@ -111,7 +111,7 @@ components by sample in a more than 30000-dimensional space
 ready-to-use tfidf features instead of file names.
 
 .. _`20 newsgroups website`: http://people.csail.mit.edu/jrennie/20Newsgroups/
-.. _`TF-IDF`: http://en.wikipedia.org/wiki/Tf-idf
+.. _`TF-IDF`: https://en.wikipedia.org/wiki/Tf-idf
 
 
 Filtering text for more realistic training
diff --git a/doc/developers/advanced_installation.rst b/doc/developers/advanced_installation.rst
index 382de2fc51626..8b5a295675d0a 100644
--- a/doc/developers/advanced_installation.rst
+++ b/doc/developers/advanced_installation.rst
@@ -140,7 +140,7 @@ from source package
 ~~~~~~~~~~~~~~~~~~~
 
 download the source package from 
-`pypi <http://pypi.python.org/pypi/scikit-learn/>`_,
+`pypi <https://pypi.python.org/pypi/scikit-learn>`_,
 , unpack the sources and cd into the source directory.
 
 this packages uses distutils, which is the default way of installing
@@ -163,12 +163,12 @@ or alternatively (also from within the scikit-learn source folder)::
 windows
 -------
 
-first, you need to install `numpy <http://numpy.scipy.org/>`_ and `scipy
+first, you need to install `numpy <http://www.numpy.org/>`_ and `scipy
 <http://www.scipy.org/>`_ from their own official installers.
 
 wheel packages (.whl files) for scikit-learn from `pypi
 <https://pypi.python.org/pypi/scikit-learn/>`_ can be installed with the `pip
-<https://pip.readthedocs.org/en/stable/installing>`_ utility.
+<https://pip.readthedocs.org/en/stable/installing/>`_ utility.
 open a console and type the following to install or upgrade scikit-learn to the
 latest stable release::
 
@@ -279,9 +279,9 @@ path environment variable.
 -------------
 
 for 32-bit python it is possible use the standalone installers for
-`microsoft visual c++ express 2008 <http://go.microsoft.com/?linkid=7729279>`_
+`microsoft visual c++ express 2008 <http://download.microsoft.com/download/A/5/4/A54BADB6-9C3F-478D-8657-93B3FC9FE62D/vcsetup.exe>`_
 for python 2 or
-`microsoft visual c++ express 2010 <http://go.microsoft.com/?linkid=9709949>`_
+`microsoft visual c++ express 2010 <https://www.visualstudio.com/products/visual-studio-dev-essentials-vs>`_
 or python 3.
 
 once installed you should be able to build scikit-learn without any
@@ -379,7 +379,7 @@ testing scikit-learn once installed
 -----------------------------------
 
 testing requires having the `nose
-<https://nose.readthedocs.org/en/latest>`_ library. after
+<https://nose.readthedocs.org/en/latest/>`_ library. after
 installation, the package can be tested by executing *from outside* the
 source directory::
 
diff --git a/doc/developers/contributing.rst b/doc/developers/contributing.rst
index 604326b342982..ed89072d596a5 100644
--- a/doc/developers/contributing.rst
+++ b/doc/developers/contributing.rst
@@ -7,7 +7,7 @@ Contributing
 This project is a community effort, and everyone is welcome to
 contribute.
 
-The project is hosted on http://github.com/scikit-learn/scikit-learn
+The project is hosted on https://github.com/scikit-learn/scikit-learn
 
 Scikit-learn is somewhat :ref:`selective <selectiveness>` when it comes to
 adding new algorithms, and the best way to contribute and to help the project
@@ -19,7 +19,7 @@ Submitting a bug report
 
 In case you experience issues using this package, do not hesitate to submit a
 ticket to the
-`Bug Tracker <http://github.com/scikit-learn/scikit-learn/issues>`_. You are
+`Bug Tracker <https://github.com/scikit-learn/scikit-learn/issues>`_. You are
 also welcome to post feature requests or pull requests.
 
 
@@ -29,7 +29,7 @@ Retrieving the latest code
 ==========================
 
 We use `Git <http://git-scm.com/>`_ for version control and
-`GitHub <http://github.com/>`_ for hosting our main repository.
+`GitHub <https://github.com/>`_ for hosting our main repository.
 
 You can check out the latest sources with the command::
 
@@ -82,14 +82,14 @@ How to contribute
 -----------------
 
 The preferred way to contribute to scikit-learn is to fork the `main
-repository <http://github.com/scikit-learn/scikit-learn/>`__ on GitHub,
+repository <https://github.com/scikit-learn/scikit-learn/>`__ on GitHub,
 then submit a "pull request" (PR):
 
- 1. `Create an account <https://github.com/signup/free>`_ on
+ 1. `Create an account <https://github.com/join>`_ on
     GitHub if you do not already have one.
 
  2. Fork the `project repository
-    <http://github.com/scikit-learn/scikit-learn>`__: click on the 'Fork'
+    <https://github.com/scikit-learn/scikit-learn>`__: click on the 'Fork'
     button near the top of the page. This creates a copy of the code under your
     account on the GitHub server.
 
@@ -237,8 +237,8 @@ and are viewable in a web browser. See the README file in the doc/ directory
 for more information.
 
 For building the documentation, you will need `sphinx
-<http://sphinx.pocoo.org/>`_,
-`matplotlib <http://matplotlib.sourceforge.net/>`_ and
+<http://sphinx-doc.org/>`_,
+`matplotlib <http://matplotlib.org>`_ and
 `pillow <http://pillow.readthedocs.org/en/latest/>`_.
 
 **When you are writing documentation**, it is important to keep a good
@@ -297,7 +297,7 @@ Finally, follow the formatting rules below to make it consistently good:
 Testing and improving test coverage
 ------------------------------------
 
-High-quality `unit testing <http://en.wikipedia.org/wiki/Unit_testing>`_
+High-quality `unit testing <https://en.wikipedia.org/wiki/Unit_testing>`_
 is a corner-stone of the scikit-learn development process. For this
 purpose, we use the `nose <http://nose.readthedocs.org/en/latest/>`_
 package. The tests are functions appropriately named, located in `tests`
@@ -313,7 +313,7 @@ We expect code coverage of new features to be at least around 90%.
 .. note:: **Workflow to improve test coverage**
 
    To test code coverage, you need to install the `coverage
-   <http://pypi.python.org/pypi/coverage>`_ package in addition to nose.
+   <https://pypi.python.org/pypi/coverage>`_ package in addition to nose.
 
    1. Run 'make test-coverage'. The output lists for each file the line
       numbers that are not tested.
@@ -392,7 +392,7 @@ the review easier so new code can be integrated in less time.
 
 Uniformly formatted code makes it easier to share code ownership. The
 scikit-learn project tries to closely follow the official Python guidelines
-detailed in `PEP8 <http://www.python.org/dev/peps/pep-0008/>`_ that
+detailed in `PEP8 <https://www.python.org/dev/peps/pep-0008>`_ that
 detail how code should be formatted and indented. Please read it and
 follow it.
 
@@ -414,7 +414,7 @@ In addition, we add the following guidelines:
 
     * **Please don't use** ``import *`` **in any case**. It is considered harmful
       by the `official Python recommendations
-      <http://docs.python.org/howto/doanddont.html#from-module-import>`_.
+      <https://docs.python.org/2/howto/doanddont.html#from-module-import>`_.
       It makes the code harder to read as the origin of symbols is no
       longer explicitly referenced, but most important, it prevents
       using a static analysis tool like `pyflakes
diff --git a/doc/developers/performance.rst b/doc/developers/performance.rst
index 6d738d8b3ddd1..f2969e04b09c3 100644
--- a/doc/developers/performance.rst
+++ b/doc/developers/performance.rst
@@ -40,7 +40,7 @@ this means trying to **replace any nested for loops by calls to equivalent
 Numpy array methods**. The goal is to avoid the CPU wasting time in the
 Python interpreter rather than crunching numbers to fit your statistical
 model. It's generally a good idea to consider NumPy and SciPy performance tips:
-http://wiki.scipy.org/PerformanceTips
+http://scipy.github.io/old-wiki/pages/PerformanceTips
 
 Sometimes however an algorithm cannot be expressed efficiently in simple
 vectorized Numpy code. In this case, the recommended strategy is the
@@ -304,7 +304,7 @@ Memory usage profiling
 ======================
 
 You can analyze in detail the memory usage of any Python code with the help of
-`memory_profiler <http://pypi.python.org/pypi/memory_profiler>`_. First,
+`memory_profiler <https://pypi.python.org/pypi/memory_profiler>`_. First,
 install the latest version::
 
     $ pip install -U memory_profiler
@@ -421,8 +421,8 @@ Using yep and google-perftools
 
 Easy profiling without special compilation options use yep:
 
-- http://pypi.python.org/pypi/yep
-- http://fseoane.net/blog/2011/a-profiler-for-python-extensions/
+- https://pypi.python.org/pypi/yep
+- http://fa.bianp.net/blog/2011/a-profiler-for-python-extensions
 
 .. note::
 
@@ -430,7 +430,7 @@ Easy profiling without special compilation options use yep:
   can be triggered with the ``--lines`` option. However this
   does not seem to work correctly at the time of writing. This
   issue can be tracked on the `project issue tracker
-  <https://code.google.com/p/google-perftools/issues/detail?id=326>`_.
+  <https://github.com/gperftools/gperftools>`_.
 
 
 
@@ -460,7 +460,7 @@ TODO: give a simple teaser example here.
 
 Checkout the official joblib documentation:
 
-- http://packages.python.org/joblib/
+- https://pythonhosted.org/joblib
 
 
 .. _warm-restarts:
diff --git a/doc/developers/utilities.rst b/doc/developers/utilities.rst
index 88096a1b77519..9ef9f6cd3a886 100644
--- a/doc/developers/utilities.rst
+++ b/doc/developers/utilities.rst
@@ -93,7 +93,7 @@ Efficient Linear Algebra & Array Operations
   by directly calling the BLAS
   ``nrm2`` function.  This is more stable than ``scipy.linalg.norm``.  See
   `Fabian's blog post
-  <http://fseoane.net/blog/2011/computing-the-vector-norm/>`_ for a discussion.
+  <http://fa.bianp.net/blog/2011/computing-the-vector-norm>`_ for a discussion.
 
 - :func:`extmath.fast_logdet`: efficiently compute the log of the determinant
   of a matrix.
diff --git a/doc/install.rst b/doc/install.rst
index 7edcd72c9a4d7..0b58c0b6e28a2 100644
--- a/doc/install.rst
+++ b/doc/install.rst
@@ -51,8 +51,8 @@ Canopy and Anaconda for all supported platforms
 -----------------------------------------------
 
 `Canopy
-<http://www.enthought.com/products/canopy>`_ and `Anaconda
-<https://store.continuum.io/cshop/anaconda/>`_ both ship a recent
+<https://www.enthought.com/products/canopy>`_ and `Anaconda
+<https://www.continuum.io/downloads>`_ both ship a recent
 version of scikit-learn, in addition to a large set of scientific python
 library for Windows, Mac OSX and Linux.
 
@@ -83,9 +83,8 @@ Anaconda offers scikit-learn as part of its free distribution.
 Python(x,y) for Windows
 -----------------------
 
-The `Python(x,y) <https://code.google.com/p/pythonxy/>`_ project distributes
-scikit-learn as an additional plugin, which can be found in the `Additional
-plugins <http://code.google.com/p/pythonxy/wiki/AdditionalPlugins>`_ page.
+The `Python(x,y) <https://python-xy.github.io>`_ project distributes
+scikit-learn as an additional plugin.
 
 
 For installation instructions for particular operating systems or for compiling
diff --git a/doc/modules/clustering.rst b/doc/modules/clustering.rst
index 19a08b7fb6428..352c346752a18 100644
--- a/doc/modules/clustering.rst
+++ b/doc/modules/clustering.rst
@@ -998,7 +998,7 @@ random labelings by defining the adjusted Rand index as follows:
 .. topic:: References
 
  * `Comparing Partitions
-   <http://www.springerlink.com/content/x64124718341j1j0/>`_
+   <http://link.springer.com/article/10.1007%2FBF01908075>`_
    L. Hubert and P. Arabie, Journal of Classification 1985
 
  * `Wikipedia entry for the adjusted Rand index
diff --git a/doc/modules/computational_performance.rst b/doc/modules/computational_performance.rst
index cc5a792a47d57..a3a488ca6ddcc 100644
--- a/doc/modules/computational_performance.rst
+++ b/doc/modules/computational_performance.rst
@@ -241,8 +241,8 @@ Linear algebra libraries
 As scikit-learn relies heavily on Numpy/Scipy and linear algebra in general it
 makes sense to take explicit care of the versions of these libraries.
 Basically, you ought to make sure that Numpy is built using an optimized `BLAS
-<http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms>`_ /
-`LAPACK <http://en.wikipedia.org/wiki/LAPACK>`_ library.
+<https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms>`_ /
+`LAPACK <https://en.wikipedia.org/wiki/LAPACK>`_ library.
 
 Not all models benefit from optimized BLAS and Lapack implementations. For
 instance models based on (randomized) decision trees typically do not rely on
@@ -308,7 +308,7 @@ compromise between model compactness and prediction power. One can also
 further tune the ``l1_ratio`` parameter (in combination with the
 regularization strength ``alpha``) to control this tradeoff.
 
-A typical `benchmark <https://github.com/scikit-learn/scikit-learn/tree/master/benchmarks/bench_sparsify.py>`_
+A typical `benchmark <https://github.com/scikit-learn/scikit-learn/blob/master/benchmarks/bench_sparsify.py>`_
 on synthetic data yields a >30% decrease in latency when both the model and
 input are sparse (with 0.000024 and 0.027400 non-zero coefficients ratio
 respectively). Your mileage may vary depending on the sparsity and size of
diff --git a/doc/modules/cross_validation.rst b/doc/modules/cross_validation.rst
index 19aa62839161e..331f82be5303e 100644
--- a/doc/modules/cross_validation.rst
+++ b/doc/modules/cross_validation.rst
@@ -66,7 +66,7 @@ and the results can depend on a particular random choice for the pair of
 (train, validation) sets.
 
 A solution to this problem is a procedure called
-`cross-validation <http://en.wikipedia.org/wiki/Cross-validation_(statistics)>`_
+`cross-validation <https://en.wikipedia.org/wiki/Cross-validation_(statistics)>`_
 (CV for short).
 A test set should still be held out for final evaluation,
 but the validation set is no longer needed when doing CV.
@@ -337,11 +337,11 @@ fold cross validation should be preferred to LOO.
 
  * http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html
  * T. Hastie, R. Tibshirani, J. Friedman,  `The Elements of Statistical Learning
-   <http://www-stat.stanford.edu/~tibs/ElemStatLearn>`_, Springer 2009
+   <http://statweb.stanford.edu/~tibs/ElemStatLearn>`_, Springer 2009
  * L. Breiman, P. Spector `Submodel selection and evaluation in regression: The X-random case
    <http://digitalassets.lib.berkeley.edu/sdtr/ucb/text/197.pdf>`_, International Statistical Review 1992
  * R. Kohavi, `A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection
-   <http://www.cs.iastate.edu/~jtian/cs573/Papers/Kohavi-IJCAI-95.pdf>`_, Intl. Jnt. Conf. AI
+   <http://web.cs.iastate.edu/~jtian/cs573/Papers/Kohavi-IJCAI-95.pdf>`_, Intl. Jnt. Conf. AI
  * R. Bharat Rao, G. Fung, R. Rosales, `On the Dangers of Cross-Validation. An Experimental Evaluation
    <http://www.siam.org/proceedings/datamining/2008/dm08_54_Rao.pdf>`_, SIAM 2008
  * G. James, D. Witten, T. Hastie, R Tibshirani, `An Introduction to
diff --git a/doc/modules/density.rst b/doc/modules/density.rst
index c9f5c271f7f15..f96f4004e7323 100644
--- a/doc/modules/density.rst
+++ b/doc/modules/density.rst
@@ -139,7 +139,7 @@ The kernel density estimator can be used with any of the valid distance
 metrics (see :class:`sklearn.neighbors.DistanceMetric` for a list of available metrics), though
 the results are properly normalized only for the Euclidean metric.  One
 particularly useful metric is the
-`Haversine distance <http://en.wikipedia.org/wiki/Haversine_formula>`_
+`Haversine distance <https://en.wikipedia.org/wiki/Haversine_formula>`_
 which measures the angular distance between points on a sphere.  Here
 is an example of using a kernel density estimate for a visualization
 of geospatial data, in this case the distribution of observations of two
diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index dc49655b1ada3..35ffd96cfc5f8 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -414,7 +414,7 @@ decision trees).
 Gradient Tree Boosting
 ======================
 
-`Gradient Tree Boosting <http://en.wikipedia.org/wiki/Gradient_boosting>`_
+`Gradient Tree Boosting <https://en.wikipedia.org/wiki/Gradient_boosting>`_
 or Gradient Boosted Regression Trees (GBRT) is a generalization
 of boosting to arbitrary
 differentiable loss functions. GBRT is an accurate and effective
diff --git a/doc/modules/feature_extraction.rst b/doc/modules/feature_extraction.rst
index 8052c06ee28f4..32550b3c488e6 100644
--- a/doc/modules/feature_extraction.rst
+++ b/doc/modules/feature_extraction.rst
@@ -552,7 +552,7 @@ For an introduction to Unicode and character encodings in general,
 see Joel Spolsky's `Absolute Minimum Every Software Developer Must Know
 About Unicode <http://www.joelonsoftware.com/articles/Unicode.html>`_.
 
-.. _`ftfy`: http://github.com/LuminosoInsight/python-ftfy
+.. _`ftfy`: https://github.com/LuminosoInsight/python-ftfy
 
 
 Applications and examples
@@ -748,7 +748,7 @@ An interesting development of using a :class:`HashingVectorizer` is the ability
 to perform `out-of-core`_ scaling. This means that we can learn from data that
 does not fit into the computer's main memory.
 
-.. _out-of-core: http://en.wikipedia.org/wiki/Out-of-core_algorithm 
+.. _out-of-core: https://en.wikipedia.org/wiki/Out-of-core_algorithm
 
 A strategy to implement out-of-core scaling is to stream data to the estimator 
 in mini-batches. Each mini-batch is vectorized using :class:`HashingVectorizer`
diff --git a/doc/modules/feature_selection.rst b/doc/modules/feature_selection.rst
index 60e4d0a38f7c8..79c644e4097b4 100644
--- a/doc/modules/feature_selection.rst
+++ b/doc/modules/feature_selection.rst
@@ -216,7 +216,7 @@ alpha parameter, the fewer features selected.
 
    **Reference** Richard G. Baraniuk "Compressive Sensing", IEEE Signal
    Processing Magazine [120] July 2007
-   http://dsp.rice.edu/files/cs/baraniukCSlecture07.pdf
+   http://dsp.rice.edu/sites/dsp.rice.edu/files/cs/baraniukCSlecture07.pdf
 
 .. _randomized_l1:
 
@@ -256,10 +256,10 @@ of features non zero.
 
    * N. Meinshausen, P. Buhlmann, "Stability selection",
      Journal of the Royal Statistical Society, 72 (2010)
-     http://arxiv.org/pdf/0809.2932
+     http://arxiv.org/pdf/0809.2932.pdf
 
    * F. Bach, "Model-Consistent Sparse Estimation through the Bootstrap"
-     http://hal.inria.fr/hal-00354771/
+     https://hal.inria.fr/hal-00354771/
 
 Tree-based feature selection
 ----------------------------
diff --git a/doc/modules/kernel_approximation.rst b/doc/modules/kernel_approximation.rst
index 063e79a27b471..fd0fe7be0b1d8 100644
--- a/doc/modules/kernel_approximation.rst
+++ b/doc/modules/kernel_approximation.rst
@@ -13,7 +13,7 @@ algorithms.
 .. currentmodule:: sklearn.linear_model
 
 The advantage of using approximate explicit feature maps compared to the
-`kernel trick <http://en.wikipedia.org/wiki/Kernel_trick>`_,
+`kernel trick <https://en.wikipedia.org/wiki/Kernel_trick>`_,
 which makes use of feature maps implicitly, is that explicit mappings
 can be better suited for online learning and can significantly reduce the cost
 of learning with very large datasets.
diff --git a/doc/modules/label_propagation.rst b/doc/modules/label_propagation.rst
index 80f865f01c4d4..31b598971358f 100644
--- a/doc/modules/label_propagation.rst
+++ b/doc/modules/label_propagation.rst
@@ -7,7 +7,7 @@ Semi-Supervised
 .. currentmodule:: sklearn.semi_supervised
 
 `Semi-supervised learning
-<http://en.wikipedia.org/wiki/Semi-supervised_learning>`_ is a situation
+<https://en.wikipedia.org/wiki/Semi-supervised_learning>`_ is a situation
 in which in your training data some of the samples are not labeled. The
 semi-supervised estimators in :mod:`sklearn.semi_supervised` are able to
 make use of this additional unlabeled data to better capture the shape of
diff --git a/doc/modules/learning_curve.rst b/doc/modules/learning_curve.rst
index 8708ef8c7acdf..39ecbcbe76a58 100644
--- a/doc/modules/learning_curve.rst
+++ b/doc/modules/learning_curve.rst
@@ -29,7 +29,7 @@ very well, i.e. it is very sensitive to varying training data (high variance).
 Bias and variance are inherent properties of estimators and we usually have to
 select learning algorithms and hyperparameters so that both bias and variance
 are as low as possible (see `Bias-variance dilemma
-<http://en.wikipedia.org/wiki/Bias-variance_dilemma>`_). Another way to reduce
+<https://en.wikipedia.org/wiki/Bias-variance_dilemma>`_). Another way to reduce
 the variance of a model is to use more training data. However, you should only
 collect more training data if the true function is too complex to be
 approximated by an estimator with a lower variance.
diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
index 26efb04761a0b..295eb158da457 100644
--- a/doc/modules/linear_model.rst
+++ b/doc/modules/linear_model.rst
@@ -474,7 +474,7 @@ column is always zero.
 .. topic:: References:
 
  * Original Algorithm is detailed in the paper `Least Angle Regression
-   <http://www-stat.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf>`_
+   <http://www.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf>`_
    by Hastie et al.
 
 
@@ -530,7 +530,7 @@ parameters in the estimation procedure: the regularization parameter is
 not set in a hard sense but tuned to the data at hand.
 
 This can be done by introducing `uninformative priors
-<http://en.wikipedia.org/wiki/Non-informative_prior#Uninformative_priors>`__
+<https://en.wikipedia.org/wiki/Non-informative_prior#Uninformative_priors>`__
 over the hyper parameters of the model.
 The :math:`\ell_{2}` regularization used in `Ridge Regression`_ is equivalent
 to finding a maximum a-postiori solution under a Gaussian prior over the
@@ -579,7 +579,7 @@ The prior for the parameter :math:`w` is given by a spherical Gaussian:
     \mathcal{N}(w|0,\lambda^{-1}\bold{I_{p}})
 
 The priors over :math:`\alpha` and :math:`\lambda` are chosen to be `gamma
-distributions <http://en.wikipedia.org/wiki/Gamma_distribution>`__, the
+distributions <https://en.wikipedia.org/wiki/Gamma_distribution>`__, the
 conjugate prior for the precision of the Gaussian.
 
 The resulting model is called *Bayesian Ridge Regression*, and is similar to the
@@ -674,7 +674,7 @@ hyperparameters :math:`\lambda_1` and :math:`\lambda_2`.
 
     .. [1] Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 7.2.1
 
-    .. [2] David Wipf and Srikantan Nagarajan: `A new view of automatic relevance determination. <http://books.nips.cc/papers/files/nips20/NIPS2007_0976.pdf>`_
+    .. [2] David Wipf and Srikantan Nagarajan: `A new view of automatic relevance determination. <http://papers.nips.cc/book/advances-in-neural-information-processing-systems-20-2007>`_
 
 .. _Logistic_regression:
 
@@ -683,10 +683,8 @@ Logistic regression
 
 Logistic regression, despite its name, is a linear model for classification
 rather than regression. Logistic regression is also known in the literature as
-logit regression, maximum-entropy classification (MaxEnt) or the log-linear
-classifier. In this model, the probabilities describing the possible outcomes
-of a single trial are modeled using a `logistic function
-<http://en.wikipedia.org/wiki/Logistic_function>`_.
+logit regression, maximum-entropy classification (MaxEnt)
+or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a `logistic function <https://en.wikipedia.org/wiki/Logistic_function>`_.
 
 The implementation of logistic regression in scikit-learn can be accessed from
 class :class:`LogisticRegression`. This implementation can fit binary, One-vs-
@@ -778,9 +776,7 @@ entropy loss.
 
 .. topic:: References:
 
-    .. [3] Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 4.3.4
-
-    .. [4] Mark Schmidt, Nicolas Le Roux, and Francis Bach: `Minimizing Finite Sums with the Stochastic Average Gradient. <http://hal.inria.fr/hal-00860051/PDF/sag_journal.pdf>`_
+    .. [3] Mark Schmidt, Nicolas Le Roux, and Francis Bach: `Minimizing Finite Sums with the Stochastic Average Gradient. <https://hal.inria.fr/hal-00860051/PDF/sag_journal.pdf>`_
 
 Stochastic Gradient Descent - SGD
 =================================
@@ -978,7 +974,7 @@ performance.
 
 .. topic:: References:
 
- * http://en.wikipedia.org/wiki/RANSAC
+ * https://en.wikipedia.org/wiki/RANSAC
  * `"Random Sample Consensus: A Paradigm for Model Fitting with Applications to
    Image Analysis and Automated Cartography"
    <http://www.cs.columbia.edu/~belhumeur/courses/compPhoto/ransac.pdf>`_
@@ -1005,7 +1001,7 @@ better than an ordinary least squares in high dimension.
 
 .. topic:: References:
 
- * http://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator
+ * https://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator
 
 Theoretical considerations
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/doc/modules/manifold.rst b/doc/modules/manifold.rst
index 7fea314f0f6e5..b1b0aac40e769 100644
--- a/doc/modules/manifold.rst
+++ b/doc/modules/manifold.rst
@@ -343,7 +343,7 @@ The overall complexity of spectral embedding is
 
    * `"Laplacian Eigenmaps for Dimensionality Reduction
      and Data Representation" 
-     <http://www.cse.ohio-state.edu/~mbelkin/papers/LEM_NC_03.pdf>`_
+     <http://web.cse.ohio-state.edu/~mbelkin/papers/LEM_NC_03.pdf>`_
      M. Belkin, P. Niyogi, Neural Computation, June 2003; 15 (6):1373-1396
 
 
@@ -397,7 +397,7 @@ The overall complexity of standard LTSA is
 Multi-dimensional Scaling (MDS)
 ===============================
 
-`Multidimensional scaling <http://en.wikipedia.org/wiki/Multidimensional_scaling>`_
+`Multidimensional scaling <https://en.wikipedia.org/wiki/Multidimensional_scaling>`_
 (:class:`MDS`) seeks a low-dimensional
 representation of the data in which the distances respect well the
 distances in the original high-dimensional space.
@@ -461,15 +461,15 @@ order to avoid that, the disparities :math:`\hat{d}_{ij}` are normalized.
 .. topic:: References:
 
   * `"Modern Multidimensional Scaling - Theory and Applications"
-    <http://www.springer.com/statistics/social+sciences+%26+law/book/978-0-387-25150-9>`_
+    <http://www.springer.com/fr/book/9780387251509>`_
     Borg, I.; Groenen P. Springer Series in Statistics (1997)
 
   * `"Nonmetric multidimensional scaling: a numerical method"
-    <http://www.springerlink.com/content/tj18655313945114/>`_
+    <http://link.springer.com/article/10.1007%2FBF02289694>`_
     Kruskal, J. Psychometrika, 29 (1964)
 
   * `"Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis"
-    <http://www.springerlink.com/content/010q1x323915712x/>`_
+    <http://link.springer.com/article/10.1007%2FBF02289565>`_
     Kruskal, J. Psychometrika, 29, (1964)
 
 .. _t_sne:
diff --git a/doc/modules/mixture.rst b/doc/modules/mixture.rst
index 6970e0b9e8e95..774013ac7e8da 100644
--- a/doc/modules/mixture.rst
+++ b/doc/modules/mixture.rst
@@ -122,7 +122,7 @@ data is that it is one usually doesn't know which points came from
 which latent component (if one has access to this information it gets
 very easy to fit a separate Gaussian distribution to each set of
 points). `Expectation-maximization
-<http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm>`_
+<https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm>`_
 is a well-founded statistical
 algorithm to get around this problem by an iterative process. First
 one assumes random components (randomly centered on data points,
@@ -287,7 +287,7 @@ An important question is how can the Dirichlet process use an
 infinite, unbounded number of clusters and still be consistent. While
 a full explanation doesn't fit this manual, one can think of its
 `chinese restaurant process
-<http://en.wikipedia.org/wiki/Chinese_restaurant_process>`_ 
+<https://en.wikipedia.org/wiki/Chinese_restaurant_process>`_
 analogy to help understanding it. The
 chinese restaurant process is a generative story for the Dirichlet
 process. Imagine a chinese restaurant with an infinite number of
diff --git a/doc/modules/model_evaluation.rst b/doc/modules/model_evaluation.rst
index 020908d4f4d0f..5bfdb3c5b936d 100644
--- a/doc/modules/model_evaluation.rst
+++ b/doc/modules/model_evaluation.rst
@@ -314,7 +314,7 @@ Accuracy score
 --------------
 
 The :func:`accuracy_score` function computes the
-`accuracy <http://en.wikipedia.org/wiki/Accuracy_and_precision>`_, either the fraction
+`accuracy <https://en.wikipedia.org/wiki/Accuracy_and_precision>`_, either the fraction
 (default) or the count (normalize=False) of correct predictions.
 
 
@@ -332,7 +332,7 @@ defined as
    \texttt{accuracy}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples}-1} 1(\hat{y}_i = y_i)
 
 where :math:`1(x)` is the `indicator function
-<http://en.wikipedia.org/wiki/Indicator_function>`_.
+<https://en.wikipedia.org/wiki/Indicator_function>`_.
 
   >>> import numpy as np
   >>> from sklearn.metrics import accuracy_score
@@ -378,7 +378,7 @@ Confusion matrix
 
 The :func:`confusion_matrix` function evaluates
 classification accuracy by computing the `confusion matrix
-<http://en.wikipedia.org/wiki/Confusion_matrix>`_.
+<https://en.wikipedia.org/wiki/Confusion_matrix>`_.
 
 By definition, entry :math:`i, j` in a confusion matrix is
 the number of observations actually in group :math:`i`, but
@@ -457,7 +457,7 @@ Hamming loss
 -------------
 
 The :func:`hamming_loss` computes the average Hamming loss or `Hamming
-distance <http://en.wikipedia.org/wiki/Hamming_distance>`_ between two sets
+distance <https://en.wikipedia.org/wiki/Hamming_distance>`_ between two sets
 of samples.
 
 If :math:`\hat{y}_j` is the predicted value for the :math:`j`-th label of
@@ -470,7 +470,7 @@ Hamming loss :math:`L_{Hamming}` between two samples is defined as:
    L_{Hamming}(y, \hat{y}) = \frac{1}{n_\text{labels}} \sum_{j=0}^{n_\text{labels} - 1} 1(\hat{y}_j \not= y_j)
 
 where :math:`1(x)` is the `indicator function
-<http://en.wikipedia.org/wiki/Indicator_function>`_. ::
+<https://en.wikipedia.org/wiki/Indicator_function>`_. ::
 
   >>> from sklearn.metrics import hamming_loss
   >>> y_pred = [1, 2, 3, 4]
@@ -501,7 +501,7 @@ Jaccard similarity coefficient score
 
 The :func:`jaccard_similarity_score` function computes the average (default)
 or sum of `Jaccard similarity coefficients
-<http://en.wikipedia.org/wiki/Jaccard_index>`_, also called the Jaccard index,
+<https://en.wikipedia.org/wiki/Jaccard_index>`_, also called the Jaccard index,
 between pairs of label sets.
 
 The Jaccard similarity coefficient of the :math:`i`-th samples,
@@ -537,12 +537,12 @@ Precision, recall and F-measures
 ---------------------------------
 
 Intuitively, `precision
-<http://en.wikipedia.org/wiki/Precision_and_recall#Precision>`_ is the ability
+<https://en.wikipedia.org/wiki/Precision_and_recall#Precision>`_ is the ability
 of the classifier not to label as positive a sample that is negative, and
-`recall <http://en.wikipedia.org/wiki/Precision_and_recall#Recall>`_ is the
+`recall <https://en.wikipedia.org/wiki/Precision_and_recall#Recall>`_ is the
 ability of the classifier to find all the positive samples.
 
-The  `F-measure <http://en.wikipedia.org/wiki/F1_score>`_
+The  `F-measure <https://en.wikipedia.org/wiki/F1_score>`_
 (:math:`F_\beta` and :math:`F_1` measures) can be interpreted as a weighted
 harmonic mean of the precision and recall. A
 :math:`F_\beta` measure reaches its best value at 1 and its worst score at 0.
@@ -747,7 +747,7 @@ Hinge loss
 
 The :func:`hinge_loss` function computes the average distance between
 the model and the data using
-`hinge loss <http://en.wikipedia.org/wiki/Hinge_loss>`_, a one-sided metric
+`hinge loss <https://en.wikipedia.org/wiki/Hinge_loss>`_, a one-sided metric
 that considers only prediction errors. (Hinge
 loss is used in maximal margin classifiers such as support vector machines.)
 
@@ -868,7 +868,7 @@ Matthews correlation coefficient
 ---------------------------------
 
 The :func:`matthews_corrcoef` function computes the
-`Matthew's correlation coefficient (MCC) <http://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_
+`Matthew's correlation coefficient (MCC) <https://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_
 for binary classes.  Quoting Wikipedia:
 
 
@@ -904,7 +904,7 @@ Receiver operating characteristic (ROC)
 ---------------------------------------
 
 The function :func:`roc_curve` computes the
-`receiver operating characteristic curve, or ROC curve <http://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_.
+`receiver operating characteristic curve, or ROC curve <https://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_.
 Quoting Wikipedia :
 
   "A receiver operating characteristic (ROC), or simply ROC curve, is a
@@ -1006,7 +1006,7 @@ then the 0-1 loss :math:`L_{0-1}` is defined as:
    L_{0-1}(y_i, \hat{y}_i) = 1(\hat{y}_i \not= y_i)
 
 where :math:`1(x)` is the `indicator function
-<http://en.wikipedia.org/wiki/Indicator_function>`_.
+<https://en.wikipedia.org/wiki/Indicator_function>`_.
 
 
   >>> from sklearn.metrics import zero_one_loss
@@ -1094,7 +1094,7 @@ score. This metric will yield better scores if you are able to give better rank
 to the labels associated with each sample. The obtained score is always strictly
 greater than 0, and the best value is 1. If there is exactly one relevant
 label per sample, label ranking average precision is equivalent to the `mean
-reciprocal rank <http://en.wikipedia.org/wiki/Mean_reciprocal_rank>`_.
+reciprocal rank <https://en.wikipedia.org/wiki/Mean_reciprocal_rank>`_.
 
 Formally, given a binary indicator matrix of the ground truth labels
 :math:`y \in \mathcal{R}^{n_\text{samples} \times n_\text{labels}}` and the
@@ -1198,11 +1198,11 @@ Explained variance score
 -------------------------
 
 The :func:`explained_variance_score` computes the `explained variance
-regression score <http://en.wikipedia.org/wiki/Explained_variation>`_.
+regression score <https://en.wikipedia.org/wiki/Explained_variation>`_.
 
 If :math:`\hat{y}` is the estimated target output, :math:`y` the corresponding
 (correct) target output, and :math:`Var` is `Variance
-<http://en.wikipedia.org/wiki/Variance>`_, the square of the standard deviation,
+<https://en.wikipedia.org/wiki/Variance>`_, the square of the standard deviation,
 then the explained variance is estimated as follow:
 
 .. math::
@@ -1234,7 +1234,7 @@ Mean absolute error
 -------------------
 
 The :func:`mean_absolute_error` function computes `mean absolute
-error <http://en.wikipedia.org/wiki/Mean_absolute_error>`_, a risk
+error <https://en.wikipedia.org/wiki/Mean_absolute_error>`_, a risk
 metric corresponding to the expected value of the absolute error loss or
 :math:`l1`-norm loss.
 
@@ -1269,7 +1269,7 @@ Mean squared error
 -------------------
 
 The :func:`mean_squared_error` function computes `mean square
-error <http://en.wikipedia.org/wiki/Mean_squared_error>`_, a risk
+error <https://en.wikipedia.org/wiki/Mean_squared_error>`_, a risk
 metric corresponding to the expected value of the squared (quadratic) error loss or
 loss.
 
@@ -1334,7 +1334,7 @@ R² score, the coefficient of determination
 -------------------------------------------
 
 The :func:`r2_score` function computes R², the `coefficient of
-determination <http://en.wikipedia.org/wiki/Coefficient_of_determination>`_.
+determination <https://en.wikipedia.org/wiki/Coefficient_of_determination>`_.
 It provides a measure of how well future samples are likely to
 be predicted by the model. Best possible score is 1.0 and it can be negative
 (because the model can be arbitrarily worse). A constant model that always
diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst
index dfa0d4646638e..a87688bb4c01a 100644
--- a/doc/modules/model_persistence.rst
+++ b/doc/modules/model_persistence.rst
@@ -14,7 +14,7 @@ Persistence example
 -------------------
 
 It is possible to save a model in the scikit by using Python's built-in
-persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_::
+persistence model, namely `pickle <http://docs.python.org/2/library/pickle.html>`_::
 
   >>> from sklearn import svm
   >>> from sklearn import datasets
diff --git a/doc/modules/neural_networks_supervised.rst b/doc/modules/neural_networks_supervised.rst
index f21037132f732..1aaf541bd0d2f 100644
--- a/doc/modules/neural_networks_supervised.rst
+++ b/doc/modules/neural_networks_supervised.rst
@@ -122,7 +122,7 @@ of probability estimates :math:`P(y|x)` per sample :math:`x`::
            [ 0.,  1.]])
 
 :class:`MLPClassifier` supports multi-class classification by
-applying `Softmax <http://en.wikipedia.org/wiki/Softmax_activation_function>`_
+applying `Softmax <https://en.wikipedia.org/wiki/Softmax_activation_function>`_
 as the output function.
 
 Further, the algorithm supports :ref:`multi-label classification <multiclass>`
@@ -173,9 +173,9 @@ Algorithms
 ==========
 
 MLP trains using `Stochastic Gradient Descent
-<http://en.wikipedia.org/wiki/Stochastic_gradient_descent>`_,
+<https://en.wikipedia.org/wiki/Stochastic_gradient_descent>`_,
 `Adam <http://arxiv.org/abs/1412.6980>`_, or
-`L-BFGS <http://en.wikipedia.org/wiki/Limited-memory_BFGS>`__.
+`L-BFGS <https://en.wikipedia.org/wiki/Limited-memory_BFGS>`__.
 Stochastic Gradient Descent (SGD) updates parameters using the gradient of the
 loss function with respect to a parameter that needs adaptation, i.e.
 
diff --git a/doc/modules/preprocessing.rst b/doc/modules/preprocessing.rst
index 9f94b5d762a59..37789df450aac 100644
--- a/doc/modules/preprocessing.rst
+++ b/doc/modules/preprocessing.rst
@@ -259,7 +259,7 @@ such as the dot-product or any other kernel to quantify the similarity
 of any pair of samples.
 
 This assumption is the base of the `Vector Space Model
-<http://en.wikipedia.org/wiki/Vector_Space_Model>`_ often used in text
+<https://en.wikipedia.org/wiki/Vector_Space_Model>`_ often used in text
 classification and clustering contexts.
 
 The function :func:`normalize` provides a quick and easy way to perform this
@@ -322,7 +322,7 @@ Feature binarization
 features to get boolean values**. This can be useful for downstream
 probabilistic estimators that make assumption that the input data
 is distributed according to a multi-variate `Bernoulli distribution
-<http://en.wikipedia.org/wiki/Bernoulli_distribution>`_. For instance,
+<https://en.wikipedia.org/wiki/Bernoulli_distribution>`_. For instance,
 this is the case for the :class:`sklearn.neural_network.BernoulliRBM`.
 
 It is also common among the text processing community to use binary
@@ -517,7 +517,7 @@ In some cases, only interaction terms among features are required, and it can be
 
 The features of X have been transformed from :math:`(X_1, X_2, X_3)` to :math:`(1, X_1, X_2, X_3, X_1X_2, X_1X_3, X_2X_3, X_1X_2X_3)`.
 
-Note that polynomial features are used implicitily in `kernel methods <http://en.wikipedia.org/wiki/Kernel_method>`_ (e.g., :class:`sklearn.svm.SVC`, :class:`sklearn.decomposition.KernelPCA`) when using polynomial :ref:`svm_kernels`.
+Note that polynomial features are used implicitily in `kernel methods <https://en.wikipedia.org/wiki/Kernel_method>`_ (e.g., :class:`sklearn.svm.SVC`, :class:`sklearn.decomposition.KernelPCA`) when using polynomial :ref:`svm_kernels`.
 
 See :ref:`example_linear_model_plot_polynomial_interpolation.py` for Ridge regression using created polynomial features.
 
diff --git a/doc/modules/random_projection.rst b/doc/modules/random_projection.rst
index e6ef3cb63e02a..d0f733b532c54 100644
--- a/doc/modules/random_projection.rst
+++ b/doc/modules/random_projection.rst
@@ -22,7 +22,7 @@ technique for distance based method.
 .. topic:: References:
 
  * Sanjoy Dasgupta. 2000.
-   `Experiments with random projection. <http://cseweb.ucsd.edu/users/dasgupta/papers/randomf.pdf>`_
+   `Experiments with random projection. <http://cseweb.ucsd.edu/~dasgupta/papers/randomf.pdf>`_
    In Proceedings of the Sixteenth conference on Uncertainty in artificial
    intelligence (UAI'00), Craig Boutilier and Moisés Goldszmidt (Eds.). Morgan
    Kaufmann Publishers Inc., San Francisco, CA, USA, 143-151.
@@ -41,7 +41,7 @@ The Johnson-Lindenstrauss lemma
 
 The main theoretical result behind the efficiency of random projection is the
 `Johnson-Lindenstrauss lemma (quoting Wikipedia)
-<http://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma>`_:
+<https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma>`_:
 
   In mathematics, the Johnson-Lindenstrauss lemma is a result
   concerning low-distortion embeddings of points from high-dimensional
diff --git a/doc/modules/sgd.rst b/doc/modules/sgd.rst
index 862fbe914537b..6893ad7c02880 100644
--- a/doc/modules/sgd.rst
+++ b/doc/modules/sgd.rst
@@ -9,8 +9,8 @@ Stochastic Gradient Descent
 **Stochastic Gradient Descent (SGD)** is a simple yet very efficient
 approach to discriminative learning of linear classifiers under
 convex loss functions such as (linear) `Support Vector Machines
-<http://en.wikipedia.org/wiki/Support_vector_machine>`_ and `Logistic
-Regression <http://en.wikipedia.org/wiki/Logistic_regression>`_.
+<https://en.wikipedia.org/wiki/Support_vector_machine>`_ and `Logistic
+Regression <https://en.wikipedia.org/wiki/Logistic_regression>`_.
 Even though SGD has been around in the machine learning community for
 a long time, it has received a considerable amount of attention just
 recently in the context of large-scale learning.
diff --git a/doc/modules/svm.rst b/doc/modules/svm.rst
index fb3bcc46466bd..c3d0a770ccfd6 100644
--- a/doc/modules/svm.rst
+++ b/doc/modules/svm.rst
@@ -619,7 +619,7 @@ term :math:`\rho` :
 
 
  * `"Support-vector networks"
-   <http://www.springerlink.com/content/k238jx04hm87j80g/>`_
+   <http://link.springer.com/article/10.1007%2FBF00994018>`_
    C. Cortes, V. Vapnik, Machine Leaming, 20, 273-297 (1995)
 
 
diff --git a/doc/modules/tree.rst b/doc/modules/tree.rst
index 591786ac86053..118d22de7d291 100644
--- a/doc/modules/tree.rst
+++ b/doc/modules/tree.rst
@@ -410,8 +410,8 @@ and threshold that yield the largest information gain at each node.
 
 scikit-learn uses an optimised version of the CART algorithm.
 
-.. _ID3: http://en.wikipedia.org/wiki/ID3_algorithm
-.. _CART: http://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees
+.. _ID3: https://en.wikipedia.org/wiki/ID3_algorithm
+.. _CART: https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees
 
 
 .. _tree_mathematical_formulation:
@@ -500,9 +500,9 @@ criterion to minimise is the Mean Squared Error
 
 .. topic:: References:
 
-    * http://en.wikipedia.org/wiki/Decision_tree_learning
+    * https://en.wikipedia.org/wiki/Decision_tree_learning
 
-    * http://en.wikipedia.org/wiki/Predictive_analytics
+    * https://en.wikipedia.org/wiki/Predictive_analytics
 
     * L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and
       Regression Trees. Wadsworth, Belmont, CA, 1984.
diff --git a/doc/presentations.rst b/doc/presentations.rst
index e1e34e72cf859..52977d3daa61e 100644
--- a/doc/presentations.rst
+++ b/doc/presentations.rst
@@ -9,7 +9,7 @@ New to Scientific Python?
 ==========================
 For those that are still new to the scientific Python ecosystem, we highly
 recommend the `Python Scientific Lecture Notes
-<http://scipy-lectures.github.io/>`_. This will help you find your footing a
+<http://scipy-lectures.org>`_. This will help you find your footing a
 bit and will definitely improve your scikit-learn experience.  A basic
 understanding of NumPy arrays is recommended to make the most of scikit-learn.
 
@@ -58,21 +58,21 @@ Videos
     use NLTK and scikit-learn to solve real-world text classification
     tasks and compares against cloud-based solutions.
 
-- `Introduction to Interactive Predictive Analytics in Python with scikit-learn <http://www.youtube.com/watch?v=Zd5dfooZWG4>`_
+- `Introduction to Interactive Predictive Analytics in Python with scikit-learn <https://www.youtube.com/watch?v=Zd5dfooZWG4>`_
   by `Olivier Grisel`_ at PyCon 2012
 
     3-hours long introduction to prediction tasks using scikit-learn.
 
-- `scikit-learn - Machine Learning in Python <http://marakana.com/s/scikit-learn_machine_learning_in_python,1152/index.html>`_
+- `scikit-learn - Machine Learning in Python <https://newcircle.com/s/post/1152/scikit-learn_machine_learning_in_python>`_
   by `Jake Vanderplas`_ at the 2012 PyData workshop at Google
 
     Interactive demonstration of some scikit-learn features. 75 minutes.
 
-- `scikit-learn tutorial <http://vimeo.com/53062607>`_ by `Jake Vanderplas`_ at PyData NYC 2012
+- `scikit-learn tutorial <https://vimeo.com/53062607>`_ by `Jake Vanderplas`_ at PyData NYC 2012
 
     Presentation using the online tutorial, 45 minutes.
 
 
 .. _Gael Varoquaux: http://gael-varoquaux.info
-.. _Jake Vanderplas: http://www.astro.washington.edu/users/vanderplas/
-.. _Olivier Grisel: http://twitter.com/ogrisel
+.. _Jake Vanderplas: http://staff.washington.edu/jakevdp
+.. _Olivier Grisel: https://twitter.com/ogrisel
diff --git a/doc/related_projects.rst b/doc/related_projects.rst
index ea021dc568e4b..fdd66e97ed95c 100644
--- a/doc/related_projects.rst
+++ b/doc/related_projects.rst
@@ -148,7 +148,7 @@ Domain specific packages
 
 - `AstroML <http://www.astroml.org/>`_  Machine learning for astronomy.
 
-- `MSMBuilder <http://www.msmbuilder.org/>`_  Machine learning for protein
+- `MSMBuilder <http://msmbuilder.org/>`_  Machine learning for protein
   conformational dynamics time series.
 
 Snippets and tidbits
diff --git a/doc/testimonials/testimonials.rst b/doc/testimonials/testimonials.rst
index d4c27b7f4594e..0f9ee07df7a9c 100644
--- a/doc/testimonials/testimonials.rst
+++ b/doc/testimonials/testimonials.rst
@@ -82,7 +82,7 @@ Gaël Varoquaux, research at Parietal
    </span>
 
 
-`Evernote <http://evernote.com>`_
+`Evernote <https://evernote.com>`_
 ----------------------------------
 
 .. raw:: html
@@ -149,7 +149,7 @@ Alexandre Gramfort, Assistant Professor
    </span>
 
 
-`AWeber <http://aweber.com/>`_
+`AWeber <http://www.aweber.com>`_
 ------------------------------------------
 
 .. raw:: html
@@ -158,7 +158,7 @@ Alexandre Gramfort, Assistant Professor
 
 .. image:: images/aweber.png
     :width: 120pt
-    :target: http://aweber.com/
+    :target: http://www.aweber.com
 
 .. raw:: html
 
@@ -188,7 +188,7 @@ Michael Becker, Software Engineer, Data Analysis and Management Ninjas
 
    </span>
 
-`Yhat <http://yhathq.com/>`_
+`Yhat <https://www.yhat.com>`_
 ------------------------------------------
 
 .. raw:: html
@@ -197,7 +197,7 @@ Michael Becker, Software Engineer, Data Analysis and Management Ninjas
 
 .. image:: images/yhat.png
     :width: 120pt
-    :target: http://yhathq.com/
+    :target: https://www.yhat.com
 
 .. raw:: html
 
@@ -322,7 +322,7 @@ Eustache Diemert, Lead Scientist Bestofmedia Group
 
    </span>
 
-`Change.org <http://www.change.org>`_
+`Change.org <https://www.change.org>`_
 --------------------------------------------------
 
 .. raw:: html
@@ -331,7 +331,7 @@ Eustache Diemert, Lead Scientist Bestofmedia Group
 
 .. image:: images/change-logo.png
     :width: 120pt
-    :target: http://www.change.org
+    :target: https://www.change.org
 
 .. raw:: html
 
@@ -423,7 +423,7 @@ Daniel Weitzenfeld, Senior Data Scientist at HowAboutWe
    </span>
 
 
-`PeerIndex <http://www.peerindex.com/>`_
+`PeerIndex <https://www.brandwatch.com/peerindex-and-brandwatch>`_
 ----------------------------------------
 
 .. raw:: html
@@ -519,7 +519,7 @@ David Koh - Senior Data Scientist at OkCupid
    </span>
    
 
-`Lovely <https://www.livelovely.com/>`_
+`Lovely <https://livelovely.com/>`_
 -----------------------------------------
 
 .. raw:: html
@@ -528,7 +528,7 @@ David Koh - Senior Data Scientist at OkCupid
 
 .. image:: images/lovely.png
     :width: 120pt
-    :target: https://www.livelovely.com
+    :target: https://livelovely.com
 
 .. raw:: html
 
diff --git a/doc/tutorial/basic/tutorial.rst b/doc/tutorial/basic/tutorial.rst
index 873f9f611a798..f7e49d4e704c1 100644
--- a/doc/tutorial/basic/tutorial.rst
+++ b/doc/tutorial/basic/tutorial.rst
@@ -6,7 +6,7 @@ An introduction to machine learning with scikit-learn
 .. topic:: Section contents
 
     In this section, we introduce the `machine learning
-    <http://en.wikipedia.org/wiki/Machine_learning>`_
+    <https://en.wikipedia.org/wiki/Machine_learning>`_
     vocabulary that we use throughout scikit-learn and give a
     simple learning example.
 
@@ -15,22 +15,22 @@ Machine learning: the problem setting
 -------------------------------------
 
 In general, a learning problem considers a set of n
-`samples <http://en.wikipedia.org/wiki/Sample_(statistics)>`_ of
+`samples <https://en.wikipedia.org/wiki/Sample_(statistics)>`_ of
 data and then tries to predict properties of unknown data. If each sample is
 more than a single number and, for instance, a multi-dimensional entry
-(aka `multivariate <http://en.wikipedia.org/wiki/Multivariate_random_variable>`_
+(aka `multivariate <https://en.wikipedia.org/wiki/Multivariate_random_variable>`_
 data), it is said to have several attributes or **features**.
 
 We can separate learning problems in a few large categories:
 
- * `supervised learning <http://en.wikipedia.org/wiki/Supervised_learning>`_,
+ * `supervised learning <https://en.wikipedia.org/wiki/Supervised_learning>`_,
    in which the data comes with additional attributes that we want to predict
    (:ref:`Click here <supervised-learning>`
    to go to the scikit-learn supervised learning page).This problem
    can be either:
 
     * `classification
-      <http://en.wikipedia.org/wiki/Classification_in_machine_learning>`_:
+      <https://en.wikipedia.org/wiki/Classification_in_machine_learning>`_:
       samples belong to two or more classes and we
       want to learn from already labeled data how to predict the class
       of unlabeled data. An example of classification problem would
@@ -41,19 +41,19 @@ We can separate learning problems in a few large categories:
       limited number of categories and for each of the n samples provided,
       one is to try to label them with the correct category or class.
 
-    * `regression <http://en.wikipedia.org/wiki/Regression_analysis>`_:
+    * `regression <https://en.wikipedia.org/wiki/Regression_analysis>`_:
       if the desired output consists of one or more
       continuous variables, then the task is called *regression*. An
       example of a regression problem would be the prediction of the
       length of a salmon as a function of its age and weight.
 
- * `unsupervised learning <http://en.wikipedia.org/wiki/Unsupervised_learning>`_,
+ * `unsupervised learning <https://en.wikipedia.org/wiki/Unsupervised_learning>`_,
    in which the training data consists of a set of input vectors x
    without any corresponding target values. The goal in such problems
    may be to discover groups of similar examples within the data, where
-   it is called `clustering <http://en.wikipedia.org/wiki/Cluster_analysis>`_,
+   it is called `clustering <https://en.wikipedia.org/wiki/Cluster_analysis>`_,
    or to determine the distribution of data within the input space, known as
-   `density estimation <http://en.wikipedia.org/wiki/Density_estimation>`_, or
+   `density estimation <https://en.wikipedia.org/wiki/Density_estimation>`_, or
    to project the data from a high-dimensional space down to two or three
    dimensions for the purpose of *visualization*
    (:ref:`Click here <unsupervised-learning>`
@@ -74,7 +74,7 @@ Loading an example dataset
 --------------------------
 
 `scikit-learn` comes with a few standard datasets, for instance the
-`iris <http://en.wikipedia.org/wiki/Iris_flower_data_set>`_ and `digits
+`iris <https://en.wikipedia.org/wiki/Iris_flower_data_set>`_ and `digits
 <http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits>`_
 datasets for classification and the `boston house prices dataset
 <http://archive.ics.uci.edu/ml/datasets/Housing>`_ for regression.
@@ -144,7 +144,7 @@ Learning and predicting
 In the case of the digits dataset, the task is to predict, given an image,
 which digit it represents. We are given samples of each of the 10
 possible classes (the digits zero through nine) on which we *fit* an
-`estimator <http://en.wikipedia.org/wiki/Estimator>`_ to be able to *predict*
+`estimator <https://en.wikipedia.org/wiki/Estimator>`_ to be able to *predict*
 the classes to which unseen samples belong.
 
 In scikit-learn, an estimator for classification is a Python object that
@@ -152,7 +152,7 @@ implements the methods ``fit(X, y)`` and ``predict(T)``.
 
 An example of an estimator is the class ``sklearn.svm.SVC`` that
 implements `support vector classification
-<http://en.wikipedia.org/wiki/Support_vector_machine>`_. The
+<https://en.wikipedia.org/wiki/Support_vector_machine>`_. The
 constructor of an estimator takes as arguments the parameters of the
 model, but for the time being, we will consider the estimator as a black
 box::
@@ -207,7 +207,7 @@ Model persistence
 -----------------
 
 It is possible to save a model in the scikit by using Python's built-in
-persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_::
+persistence model, namely `pickle <https://docs.python.org/2/library/pickle.html>`_::
 
   >>> from sklearn import svm
   >>> from sklearn import datasets
diff --git a/doc/tutorial/statistical_inference/finding_help.rst b/doc/tutorial/statistical_inference/finding_help.rst
index 0587a19ad85ba..3dc1e3215eef6 100644
--- a/doc/tutorial/statistical_inference/finding_help.rst
+++ b/doc/tutorial/statistical_inference/finding_help.rst
@@ -26,7 +26,7 @@ Q&A communities with Machine Learning practitioners
 
     Quora has a topic for Machine Learning related questions that
     also features some interesting discussions:
-    http://quora.com/Machine-Learning
+    https://www.quora.com/topic/Machine-Learning
 
     Have a look at the best questions section, eg: `What are some
     good resources for learning about machine learning`_.
diff --git a/doc/tutorial/statistical_inference/index.rst b/doc/tutorial/statistical_inference/index.rst
index 19cfa01302325..a298e61d03b13 100644
--- a/doc/tutorial/statistical_inference/index.rst
+++ b/doc/tutorial/statistical_inference/index.rst
@@ -6,7 +6,7 @@ A tutorial on statistical-learning for scientific data processing
 
 .. topic:: Statistical learning 
 
-    `Machine learning <http://en.wikipedia.org/wiki/Machine_learning>`_ is 
+    `Machine learning <https://en.wikipedia.org/wiki/Machine_learning>`_ is
     a technique with a growing importance, as the
     size of the datasets experimental sciences are facing is rapidly
     growing. Problems it tackles range from building a prediction function
@@ -15,14 +15,14 @@ A tutorial on statistical-learning for scientific data processing
     
     This tutorial will explore *statistical learning*, the use of
     machine learning techniques with the goal of `statistical inference 
-    <http://en.wikipedia.org/wiki/Statistical_inference>`_:
+    <https://en.wikipedia.org/wiki/Statistical_inference>`_:
     drawing conclusions on the data at hand.
 
     Scikit-learn is a Python module integrating classic machine
     learning algorithms in the tightly-knit world of scientific Python
     packages (`NumPy <http://www.scipy.org>`_, `SciPy
     <http://www.scipy.org>`_, `matplotlib
-    <http://matplotlib.sourceforge.net/>`_).
+    <http://matplotlib.org>`_).
 
 .. include:: ../../includes/big_toc_css.rst
 
diff --git a/doc/tutorial/statistical_inference/supervised_learning.rst b/doc/tutorial/statistical_inference/supervised_learning.rst
index aa157d6e1cf7f..a65601e173fcb 100644
--- a/doc/tutorial/statistical_inference/supervised_learning.rst
+++ b/doc/tutorial/statistical_inference/supervised_learning.rst
@@ -13,7 +13,7 @@ Supervised learning: predicting an output variable from high-dimensional observa
    are trying to predict, usually called "target" or "labels". Most often,
    ``y`` is a 1D array of length ``n_samples``.
 
-   All supervised `estimators <http://en.wikipedia.org/wiki/Estimator>`_
+   All supervised `estimators <https://en.wikipedia.org/wiki/Estimator>`_
    in scikit-learn implement a ``fit(X, y)`` method to fit the model
    and a ``predict(X)`` method that, given unlabeled observations ``X``,
    returns the predicted labels ``y``.
@@ -59,7 +59,7 @@ k-Nearest neighbors classifier
 -------------------------------
 
 The simplest possible classifier is the
-`nearest neighbor <http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm>`_:
+`nearest neighbor <https://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm>`_:
 given a new observation ``X_test``, find in the training set (i.e. the data
 used to train the estimator) the observation with the closest feature vector.
 (Please see the :ref:`Nearest Neighbors section<neighbors>` of the online
@@ -128,7 +128,7 @@ require more training data than the current estimated size of the entire
 internet (±1000 Exabytes or so).
 
 This is called the
-`curse of dimensionality  <http://en.wikipedia.org/wiki/Curse_of_dimensionality>`_
+`curse of dimensionality  <https://en.wikipedia.org/wiki/Curse_of_dimensionality>`_
 and is a core problem that machine learning addresses.
 
 Linear model: from regression to sparsity
@@ -265,9 +265,9 @@ diabetes dataset rather than our synthetic data::
 
     Capturing in the fitted parameters noise that prevents the model to
     generalize to new data is called
-    `overfitting <http://en.wikipedia.org/wiki/Overfitting>`_. The bias introduced
+    `overfitting <https://en.wikipedia.org/wiki/Overfitting>`_. The bias introduced
     by the ridge regression is called a
-    `regularization <http://en.wikipedia.org/wiki/Regularization_%28machine_learning%29>`_.
+    `regularization <https://en.wikipedia.org/wiki/Regularization_%28machine_learning%29>`_.
 
 .. _sparsity:
 
@@ -339,7 +339,7 @@ application of Occam's razor: *prefer simpler models*.
     Different algorithms can be used to solve the same mathematical
     problem. For instance the ``Lasso`` object in scikit-learn
     solves the lasso regression problem using a
-    `coordinate decent <http://en.wikipedia.org/wiki/Coordinate_descent>`_ method,
+    `coordinate decent <https://en.wikipedia.org/wiki/Coordinate_descent>`_ method,
     that is efficient on large datasets. However, scikit-learn also
     provides the :class:`LassoLars` object using the *LARS* algorthm,
     which is very efficient for problems in which the weight vector estimated
@@ -356,7 +356,7 @@ Classification
    :align: right
 
 For classification, as in the labeling
-`iris <http://en.wikipedia.org/wiki/Iris_flower_data_set>`_ task, linear
+`iris <https://en.wikipedia.org/wiki/Iris_flower_data_set>`_ task, linear
 regression is not the right approach as it will give too much weight to
 data far from the decision frontier. A linear approach is to fit a sigmoid
 function or **logistic** function:
diff --git a/doc/tutorial/statistical_inference/unsupervised_learning.rst b/doc/tutorial/statistical_inference/unsupervised_learning.rst
index 1b66f2d2780ae..4cf54ce532068 100644
--- a/doc/tutorial/statistical_inference/unsupervised_learning.rst
+++ b/doc/tutorial/statistical_inference/unsupervised_learning.rst
@@ -106,7 +106,7 @@ algorithms. The simplest clustering algorithm is
     Clustering in general and KMeans, in particular, can be seen as a way
     of choosing a small number of exemplars to compress the information.
     The problem is sometimes known as 
-    `vector quantization <http://en.wikipedia.org/wiki/Vector_quantization>`_. 
+    `vector quantization <https://en.wikipedia.org/wiki/Vector_quantization>`_.
     For instance, this can be used to posterize an image::
 
         >>> import scipy as sp
diff --git a/doc/tutorial/text_analytics/data/languages/fetch_data.py b/doc/tutorial/text_analytics/data/languages/fetch_data.py
index 2abef7425fb5f..6ece4eb1b7fb7 100644
--- a/doc/tutorial/text_analytics/data/languages/fetch_data.py
+++ b/doc/tutorial/text_analytics/data/languages/fetch_data.py
@@ -17,7 +17,7 @@
 pages = {
     u'ar': u'http://ar.wikipedia.org/wiki/%D9%88%D9%8A%D9%83%D9%8A%D8%A8%D9%8A%D8%AF%D9%8A%D8%A7',
     u'de': u'http://de.wikipedia.org/wiki/Wikipedia',
-    u'en': u'http://en.wikipedia.org/wiki/Wikipedia',
+    u'en': u'https://en.wikipedia.org/wiki/Wikipedia',
     u'es': u'http://es.wikipedia.org/wiki/Wikipedia',
     u'fr': u'http://fr.wikipedia.org/wiki/Wikip%C3%A9dia',
     u'it': u'http://it.wikipedia.org/wiki/Wikipedia',
diff --git a/doc/tutorial/text_analytics/working_with_text_data.rst b/doc/tutorial/text_analytics/working_with_text_data.rst
index 184b02a09dd11..47036a716c28f 100644
--- a/doc/tutorial/text_analytics/working_with_text_data.rst
+++ b/doc/tutorial/text_analytics/working_with_text_data.rst
@@ -251,7 +251,7 @@ corpus.
 This downscaling is called `tf–idf`_ for "Term Frequency times
 Inverse Document Frequency".
 
-.. _`tf–idf`: http://en.wikipedia.org/wiki/Tf–idf
+.. _`tf–idf`: https://en.wikipedia.org/wiki/Tf–idf
 
 
 Both **tf** and **tf–idf** can be computed as follows::
@@ -553,7 +553,7 @@ upon the completion of this tutorial:
   at the :ref:`Multiclass and multilabel section <multiclass>`
 
 * Try using :ref:`Truncated SVD <LSA>` for
-  `latent semantic analysis <http://en.wikipedia.org/wiki/Latent_semantic_analysis>`_.
+  `latent semantic analysis <https://en.wikipedia.org/wiki/Latent_semantic_analysis>`_.
 
 * Have a look at using
   :ref:`Out-of-core Classification
diff --git a/doc/whats_new.rst b/doc/whats_new.rst
index 5d3a8b8772618..5194226f7d38b 100644
--- a/doc/whats_new.rst
+++ b/doc/whats_new.rst
@@ -3077,7 +3077,7 @@ as well as several new algorithms and documentation improvements.
 
 This release also includes the dictionary-learning work developed by
 `Vlad Niculae`_ as part of the `Google Summer of Code
-<http://code.google.com/soc/>`_ program.
+<https://developers.google.com/open-source/gsoc>`_ program.
 
 
 
@@ -3770,7 +3770,7 @@ Earlier versions
 Earlier versions included contributions by Fred Mailhot, David Cooke,
 David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
 
-.. _Olivier Grisel: http://twitter.com/ogrisel
+.. _Olivier Grisel: https://twitter.com/ogrisel
 
 .. _Gael Varoquaux: http://gael-varoquaux.info
 
@@ -3790,7 +3790,7 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
 
 .. _Edouard Duchesnay: https://sites.google.com/site/duchesnay/home
 
-.. _Peter Prettenhofer: http://sites.google.com/site/peterprettenhofer/
+.. _Peter Prettenhofer: https://sites.google.com/site/peterprettenhofer/
 
 .. _Alexandre Passos: http://atpassos.me
 
@@ -3804,7 +3804,7 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
 
 .. _Matthieu Perrot: http://brainvisa.info/biblio/lnao/en/Author/PERROT-M.html
 
-.. _Jake Vanderplas: http://www.astro.washington.edu/users/vanderplas/
+.. _Jake Vanderplas: http://staff.washington.edu/jakevdp/
 
 .. _Gilles Louppe: http://www.montefiore.ulg.ac.be/~glouppe/
 
@@ -3816,23 +3816,23 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
 
 .. _David Warde-Farley: http://www-etud.iro.umontreal.ca/~wardefar/
 
-.. _Brian Holt: http://info.ee.surrey.ac.uk/Personal/B.Holt/
+.. _Brian Holt: http://personal.ee.surrey.ac.uk/Personal/B.Holt
 
 .. _Satrajit Ghosh: http://www.mit.edu/~satra/
 
-.. _Robert Layton: http://www.twitter.com/robertlayton
+.. _Robert Layton: https://twitter.com/robertlayton
 
-.. _Scott White: http://twitter.com/scottblanc
+.. _Scott White: https://twitter.com/scottblanc
 
 .. _Jaques Grobler: https://github.com/jaquesgrobler/scikit-learn/wiki/Jaques-Grobler
 
 .. _David Marek: http://www.davidmarek.cz/
 
-.. _@kernc: http://github.com/kernc
+.. _@kernc: https://github.com/kernc
 
-.. _Christian Osendorfer: http://osdf.github.com
+.. _Christian Osendorfer: https://osdf.github.io
 
-.. _Noel Dawe: http://noel.dawe.me
+.. _Noel Dawe: https://github.com/ndawe
 
 .. _Arnaud Joly: http://www.ajoly.org
 
@@ -3926,7 +3926,7 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
 
 .. _Cathy Deng: https://github.com/cathydeng
 
-.. _Will Dawson: http://dawsonresearch.com
+.. _Will Dawson: http://www.dawsonresearch.com
 
 .. _Balazs Kegl: https://github.com/kegl
 
@@ -3938,7 +3938,7 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
 
 .. _Hanna Wallach: http://dirichlet.net/
 
-.. _Yan Yi: http://www.seowyanyi.org
+.. _Yan Yi: http://seowyanyi.org
 
 .. _Kyle Beauchamp: https://github.com/kyleabeauchamp
 
@@ -3948,7 +3948,7 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
 
 .. _Dan Blanchard: https://github.com/dan-blanchard
 
-.. _Eric Martin: http://ericmart.in
+.. _Eric Martin: http://www.ericmart.in
 
 .. _Nicolas Goix: https://webperso.telecom-paristech.fr/front/frontoffice.php?SP_ID=241
 
@@ -3977,7 +3977,7 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
 .. _Daniel Galvez: https://github.com/galv
 .. _Jacob Schreiber: https://github.com/jmschrei
 .. _Ankur Ankan: https://github.com/ankurankan
-.. _Valentin Stolbunov: http://vstolbunov.com
+.. _Valentin Stolbunov: http://www.vstolbunov.com
 .. _Jean Kossaifi: https://github.com/JeanKossaifi
 .. _Andrew Lamb: https://github.com/andylamb
 .. _Graham Clenaghan: https://github.com/gclenaghan
diff --git a/examples/applications/plot_species_distribution_modeling.py b/examples/applications/plot_species_distribution_modeling.py
index 3bbc580b017c5..6dab5fa8c9063 100644
--- a/examples/applications/plot_species_distribution_modeling.py
+++ b/examples/applications/plot_species_distribution_modeling.py
@@ -19,11 +19,11 @@
 The two species are:
 
  - `"Bradypus variegatus"
-   <http://www.iucnredlist.org/apps/redlist/details/3038/0>`_ ,
+   <http://www.iucnredlist.org/details/3038/0>`_ ,
    the Brown-throated Sloth.
 
  - `"Microryzomys minutus"
-   <http://www.iucnredlist.org/apps/redlist/details/13408/0>`_ ,
+   <http://www.iucnredlist.org/details/13408/0>`_ ,
    also known as the Forest Small Rice Rat, a rodent that lives in Peru,
    Colombia, Ecuador, Peru, and Venezuela.
 
diff --git a/examples/applications/wikipedia_principal_eigenvector.py b/examples/applications/wikipedia_principal_eigenvector.py
index ed394ebfb57e2..4a5493a81c76e 100644
--- a/examples/applications/wikipedia_principal_eigenvector.py
+++ b/examples/applications/wikipedia_principal_eigenvector.py
@@ -8,7 +8,7 @@
 so as to assign to each vertex the values of the components of the first
 eigenvector as a centrality score:
 
-    http://en.wikipedia.org/wiki/Eigenvector_centrality
+    https://en.wikipedia.org/wiki/Eigenvector_centrality
 
 On the graph of webpages and links those values are called the PageRank
 scores by Google.
@@ -20,7 +20,7 @@
 The traditional way to compute the principal eigenvector is to use the
 power iteration method:
 
-    http://en.wikipedia.org/wiki/Power_iteration
+    https://en.wikipedia.org/wiki/Power_iteration
 
 Here the computation is achieved thanks to Martinsson's Randomized SVD
 algorithm implemented in the scikit.
diff --git a/examples/calibration/plot_calibration.py b/examples/calibration/plot_calibration.py
index 299f924e2a468..b38b25812bb7f 100644
--- a/examples/calibration/plot_calibration.py
+++ b/examples/calibration/plot_calibration.py
@@ -11,7 +11,7 @@
 probabilities is often desirable as a postprocessing. This example illustrates
 two different methods for this calibration and evaluates the quality of the
 returned probabilities using Brier's score
-(see http://en.wikipedia.org/wiki/Brier_score).
+(see https://en.wikipedia.org/wiki/Brier_score).
 
 Compared are the estimated probability using a Gaussian naive Bayes classifier
 without calibration, with a sigmoid calibration, and with a non-parametric
diff --git a/examples/datasets/plot_iris_dataset.py b/examples/datasets/plot_iris_dataset.py
index 2436dac67253c..fc8790762d1de 100644
--- a/examples/datasets/plot_iris_dataset.py
+++ b/examples/datasets/plot_iris_dataset.py
@@ -13,7 +13,7 @@
 Sepal Length, Sepal Width, Petal Length	and Petal Width.
 
 The below plot uses the first two features.
-See `here <http://en.wikipedia.org/wiki/Iris_flower_data_set>`_ for more
+See `here <https://en.wikipedia.org/wiki/Iris_flower_data_set>`_ for more
 information on this dataset.
 """
 print(__doc__)
diff --git a/examples/decomposition/plot_pca_iris.py b/examples/decomposition/plot_pca_iris.py
index 67a679e8e5677..f8451915b4412 100644
--- a/examples/decomposition/plot_pca_iris.py
+++ b/examples/decomposition/plot_pca_iris.py
@@ -8,7 +8,7 @@
 
 Principal Component Analysis applied to the Iris dataset.
 
-See `here <http://en.wikipedia.org/wiki/Iris_flower_data_set>`_ for more
+See `here <https://en.wikipedia.org/wiki/Iris_flower_data_set>`_ for more
 information on this dataset.
 
 """
diff --git a/examples/linear_model/plot_iris_logistic.py b/examples/linear_model/plot_iris_logistic.py
index 4cd705dc32df3..5186a775cd8aa 100644
--- a/examples/linear_model/plot_iris_logistic.py
+++ b/examples/linear_model/plot_iris_logistic.py
@@ -7,7 +7,7 @@
 =========================================================
 
 Show below is a logistic-regression classifiers decision boundaries on the
-`iris <http://en.wikipedia.org/wiki/Iris_flower_data_set>`_ dataset. The
+`iris <https://en.wikipedia.org/wiki/Iris_flower_data_set>`_ dataset. The
 datapoints are colored according to their labels.
 
 """
diff --git a/examples/manifold/plot_manifold_sphere.py b/examples/manifold/plot_manifold_sphere.py
index 77b37e7877785..744eb2f37675b 100644
--- a/examples/manifold/plot_manifold_sphere.py
+++ b/examples/manifold/plot_manifold_sphere.py
@@ -24,7 +24,7 @@
 it does not seeks an isotropic representation of the data in
 the low-dimensional space. Here the manifold problem matches fairly
 that of representing a flat map of the Earth, as with
-`map projection <http://en.wikipedia.org/wiki/Map_projection>`_
+`map projection <https://en.wikipedia.org/wiki/Map_projection>`_
 """
 
 # Author: Jaques Grobler <jaques.grobler@inria.fr>
diff --git a/examples/plot_johnson_lindenstrauss_bound.py b/examples/plot_johnson_lindenstrauss_bound.py
index b5b1e6062218b..de734e4d2f000 100644
--- a/examples/plot_johnson_lindenstrauss_bound.py
+++ b/examples/plot_johnson_lindenstrauss_bound.py
@@ -8,7 +8,7 @@
 dataset can be randomly projected into a lower dimensional Euclidean
 space while controlling the distortion in the pairwise distances.
 
-.. _`Johnson-Lindenstrauss lemma`: http://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma
+.. _`Johnson-Lindenstrauss lemma`: https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma
 
 
 Theoretical bounds
diff --git a/sklearn/datasets/samples_generator.py b/sklearn/datasets/samples_generator.py
index 6a1533b9b6aba..5f564d7b2d04b 100644
--- a/sklearn/datasets/samples_generator.py
+++ b/sklearn/datasets/samples_generator.py
@@ -1293,7 +1293,7 @@ def make_swiss_roll(n_samples=100, noise=0.0, random_state=None):
     ----------
     .. [1] S. Marsland, "Machine Learning: An Algorithmic Perspective",
            Chapter 10, 2009.
-           http://www-ist.massey.ac.nz/smarsland/Code/10/lle.py
+           http://seat.massey.ac.nz/personal/s.r.marsland/Code/10/lle.py
     """
     generator = check_random_state(random_state)
 
diff --git a/sklearn/feature_selection/univariate_selection.py b/sklearn/feature_selection/univariate_selection.py
index 2b0617c81636e..538b407ef3959 100644
--- a/sklearn/feature_selection/univariate_selection.py
+++ b/sklearn/feature_selection/univariate_selection.py
@@ -551,7 +551,7 @@ class SelectFdr(_BaseFilter):
 
     References
     ----------
-    http://en.wikipedia.org/wiki/False_discovery_rate
+    https://en.wikipedia.org/wiki/False_discovery_rate
 
     See also
     --------
diff --git a/sklearn/isotonic.py b/sklearn/isotonic.py
index 1a805d1625b63..d90b8a33c03ac 100644
--- a/sklearn/isotonic.py
+++ b/sklearn/isotonic.py
@@ -48,7 +48,7 @@ def check_increasing(x, y):
     References
     ----------
     Fisher transformation. Wikipedia.
-    http://en.wikipedia.org/w/index.php?title=Fisher_transformation
+    https://en.wikipedia.org/wiki/Fisher_transformation
     """
 
     # Calculate Spearman rho estimate and set return accordingly.
@@ -61,7 +61,7 @@ def check_increasing(x, y):
         F_se = 1 / math.sqrt(len(x) - 3)
 
         # Use a 95% CI, i.e., +/-1.96 S.E.
-        # http://en.wikipedia.org/wiki/Fisher_transformation
+        # https://en.wikipedia.org/wiki/Fisher_transformation
         rho_0 = math.tanh(F - 1.96 * F_se)
         rho_1 = math.tanh(F + 1.96 * F_se)
 
diff --git a/sklearn/linear_model/least_angle.py b/sklearn/linear_model/least_angle.py
index 8f6531b7d7f59..bbbad80575d08 100644
--- a/sklearn/linear_model/least_angle.py
+++ b/sklearn/linear_model/least_angle.py
@@ -135,13 +135,13 @@ def lars_path(X, y, Xy=None, Gram=None, max_iter=500,
     References
     ----------
     .. [1] "Least Angle Regression", Effron et al.
-           http://www-stat.stanford.edu/~tibs/ftp/lars.pdf
+           http://statweb.stanford.edu/~tibs/ftp/lars.pdf
 
     .. [2] `Wikipedia entry on the Least-angle regression
-           <http://en.wikipedia.org/wiki/Least-angle_regression>`_
+           <https://en.wikipedia.org/wiki/Least-angle_regression>`_
 
     .. [3] `Wikipedia entry on the Lasso
-           <http://en.wikipedia.org/wiki/Lasso_(statistics)#Lasso_method>`_
+           <https://en.wikipedia.org/wiki/Lasso_(statistics)#Lasso_method>`_
 
     """
 
@@ -1360,8 +1360,8 @@ class LassoLarsIC(LassoLars):
     Hui Zou, Trevor Hastie, and Robert Tibshirani
     Ann. Statist. Volume 35, Number 5 (2007), 2173-2192.
 
-    http://en.wikipedia.org/wiki/Akaike_information_criterion
-    http://en.wikipedia.org/wiki/Bayesian_information_criterion
+    https://en.wikipedia.org/wiki/Akaike_information_criterion
+    https://en.wikipedia.org/wiki/Bayesian_information_criterion
 
     See also
     --------
diff --git a/sklearn/linear_model/perceptron.py b/sklearn/linear_model/perceptron.py
index 0eb2ac2d3af0b..76f8c648c7201 100644
--- a/sklearn/linear_model/perceptron.py
+++ b/sklearn/linear_model/perceptron.py
@@ -84,7 +84,7 @@ class Perceptron(BaseSGDClassifier, _LearntSelectorMixin):
     References
     ----------
 
-    http://en.wikipedia.org/wiki/Perceptron and references therein.
+    https://en.wikipedia.org/wiki/Perceptron and references therein.
     """
     def __init__(self, penalty=None, alpha=0.0001, fit_intercept=True,
                  n_iter=5, shuffle=True, verbose=0, eta0=1.0, n_jobs=1,
diff --git a/sklearn/linear_model/ransac.py b/sklearn/linear_model/ransac.py
index 12c45b26aa567..daf4ccd66fd71 100644
--- a/sklearn/linear_model/ransac.py
+++ b/sklearn/linear_model/ransac.py
@@ -153,7 +153,7 @@ class RANSACRegressor(BaseEstimator, MetaEstimatorMixin, RegressorMixin):
 
     References
     ----------
-    .. [1] http://en.wikipedia.org/wiki/RANSAC
+    .. [1] https://en.wikipedia.org/wiki/RANSAC
     .. [2] http://www.cs.columbia.edu/~belhumeur/courses/compPhoto/ransac.pdf
     .. [3] http://www.bmva.org/bmvc/2009/Papers/Paper355/Paper355.pdf
     """
diff --git a/sklearn/linear_model/sgd_fast.pyx b/sklearn/linear_model/sgd_fast.pyx
index 56c087dea0d08..df9fb8be09ed1 100644
--- a/sklearn/linear_model/sgd_fast.pyx
+++ b/sklearn/linear_model/sgd_fast.pyx
@@ -242,7 +242,7 @@ cdef class Huber(Regression):
     Variant of the SquaredLoss that is robust to outliers (quadratic near zero,
     linear in for large errors).
 
-    http://en.wikipedia.org/wiki/Huber_Loss_Function
+    https://en.wikipedia.org/wiki/Huber_Loss_Function
     """
 
     cdef double c
diff --git a/sklearn/manifold/spectral_embedding_.py b/sklearn/manifold/spectral_embedding_.py
index 0d011d4c54592..e6fed53ae1bba 100644
--- a/sklearn/manifold/spectral_embedding_.py
+++ b/sklearn/manifold/spectral_embedding_.py
@@ -193,7 +193,7 @@ def spectral_embedding(adjacency, n_components=8, eigen_solver=None,
 
     References
     ----------
-    * http://en.wikipedia.org/wiki/LOBPCG
+    * https://en.wikipedia.org/wiki/LOBPCG
 
     * Toward the Optimal Preconditioned Eigensolver: Locally Optimal
       Block Preconditioned Conjugate Gradient Method
diff --git a/sklearn/metrics/classification.py b/sklearn/metrics/classification.py
index 51405e1a21296..e78d694ff59ed 100644
--- a/sklearn/metrics/classification.py
+++ b/sklearn/metrics/classification.py
@@ -210,7 +210,7 @@ def confusion_matrix(y_true, y_pred, labels=None):
     References
     ----------
     .. [1] `Wikipedia entry for the Confusion matrix
-           <http://en.wikipedia.org/wiki/Confusion_matrix>`_
+           <https://en.wikipedia.org/wiki/Confusion_matrix>`_
 
     Examples
     --------
@@ -358,7 +358,7 @@ def jaccard_similarity_score(y_true, y_pred, normalize=True,
     References
     ----------
     .. [1] `Wikipedia entry for the Jaccard index
-           <http://en.wikipedia.org/wiki/Jaccard_index>`_
+           <https://en.wikipedia.org/wiki/Jaccard_index>`_
 
 
     Examples
@@ -437,7 +437,7 @@ def matthews_corrcoef(y_true, y_pred):
        <http://dx.doi.org/10.1093/bioinformatics/16.5.412>`_
 
     .. [2] `Wikipedia entry for the Matthews Correlation Coefficient
-       <http://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_
+       <https://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_
 
     Examples
     --------
@@ -616,7 +616,7 @@ def f1_score(y_true, y_pred, labels=None, pos_label=1, average='binary',
 
     References
     ----------
-    .. [1] `Wikipedia entry for the F1-score <http://en.wikipedia.org/wiki/F1_score>`_
+    .. [1] `Wikipedia entry for the F1-score <https://en.wikipedia.org/wiki/F1_score>`_
 
     Examples
     --------
@@ -726,7 +726,7 @@ def fbeta_score(y_true, y_pred, beta, labels=None, pos_label=1,
            Modern Information Retrieval. Addison Wesley, pp. 327-328.
 
     .. [2] `Wikipedia entry for the F1-score
-           <http://en.wikipedia.org/wiki/F1_score>`_
+           <https://en.wikipedia.org/wiki/F1_score>`_
 
     Examples
     --------
@@ -910,10 +910,10 @@ def precision_recall_fscore_support(y_true, y_pred, beta=1.0, labels=None,
     References
     ----------
     .. [1] `Wikipedia entry for the Precision and recall
-           <http://en.wikipedia.org/wiki/Precision_and_recall>`_
+           <https://en.wikipedia.org/wiki/Precision_and_recall>`_
 
     .. [2] `Wikipedia entry for the F1-score
-           <http://en.wikipedia.org/wiki/F1_score>`_
+           <https://en.wikipedia.org/wiki/F1_score>`_
 
     .. [3] `Discriminative Methods for Multi-labeled Classification Advances
            in Knowledge Discovery and Data Mining (2004), pp. 22-30 by Shantanu
@@ -1456,7 +1456,7 @@ def hamming_loss(y_true, y_pred, classes=None, sample_weight=None):
            3(3), 1-13, July-September 2007.
 
     .. [2] `Wikipedia entry on the Hamming distance
-           <http://en.wikipedia.org/wiki/Hamming_distance>`_
+           <https://en.wikipedia.org/wiki/Hamming_distance>`_
 
     Examples
     --------
@@ -1618,7 +1618,7 @@ def hinge_loss(y_true, pred_decision, labels=None, sample_weight=None):
     References
     ----------
     .. [1] `Wikipedia entry on the Hinge loss
-           <http://en.wikipedia.org/wiki/Hinge_loss>`_
+           <https://en.wikipedia.org/wiki/Hinge_loss>`_
 
     .. [2] Koby Crammer, Yoram Singer. On the Algorithmic
            Implementation of Multiclass Kernel-based Vector
@@ -1787,7 +1787,7 @@ def brier_score_loss(y_true, y_prob, sample_weight=None, pos_label=None):
 
     References
     ----------
-    http://en.wikipedia.org/wiki/Brier_score
+    https://en.wikipedia.org/wiki/Brier_score
     """
     y_true = column_or_1d(y_true)
     y_prob = column_or_1d(y_prob)
diff --git a/sklearn/metrics/cluster/supervised.py b/sklearn/metrics/cluster/supervised.py
index b61a528fb3819..f8ed5f98bfd87 100644
--- a/sklearn/metrics/cluster/supervised.py
+++ b/sklearn/metrics/cluster/supervised.py
@@ -177,9 +177,9 @@ def adjusted_rand_score(labels_true, labels_pred, max_n_classes=5000):
 
     .. [Hubert1985] `L. Hubert and P. Arabie, Comparing Partitions,
       Journal of Classification 1985`
-      http://www.springerlink.com/content/x64124718341j1j0/
+      http://link.springer.com/article/10.1007%2FBF01908075
 
-    .. [wk] http://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index
+    .. [wk] https://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index
 
     See also
     --------
@@ -702,7 +702,7 @@ def adjusted_mutual_info_score(labels_true, labels_pred, max_n_classes=5000):
        <http://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf>`_
 
     .. [2] `Wikipedia entry for the Adjusted Mutual Information
-       <http://en.wikipedia.org/wiki/Adjusted_Mutual_Information>`_
+       <https://en.wikipedia.org/wiki/Adjusted_Mutual_Information>`_
 
     """
     labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
diff --git a/sklearn/metrics/cluster/unsupervised.py b/sklearn/metrics/cluster/unsupervised.py
index 42e7610af0f4e..96fc3e227afa1 100644
--- a/sklearn/metrics/cluster/unsupervised.py
+++ b/sklearn/metrics/cluster/unsupervised.py
@@ -76,7 +76,7 @@ def silhouette_score(X, labels, metric='euclidean', sample_size=None,
        <http://www.sciencedirect.com/science/article/pii/0377042787901257>`_
 
     .. [2] `Wikipedia entry on the Silhouette Coefficient
-           <http://en.wikipedia.org/wiki/Silhouette_(clustering)>`_
+           <https://en.wikipedia.org/wiki/Silhouette_(clustering)>`_
 
     """
     n_labels = len(np.unique(labels))
@@ -152,7 +152,7 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):
        <http://www.sciencedirect.com/science/article/pii/0377042787901257>`_
 
     .. [2] `Wikipedia entry on the Silhouette Coefficient
-       <http://en.wikipedia.org/wiki/Silhouette_(clustering)>`_
+       <https://en.wikipedia.org/wiki/Silhouette_(clustering)>`_
 
     """
     distances = pairwise_distances(X, metric=metric, **kwds)
diff --git a/sklearn/metrics/ranking.py b/sklearn/metrics/ranking.py
index 3e6bd0dc31861..b6ae626a7e41f 100644
--- a/sklearn/metrics/ranking.py
+++ b/sklearn/metrics/ranking.py
@@ -150,7 +150,7 @@ def average_precision_score(y_true, y_score, average="macro",
     References
     ----------
     .. [1] `Wikipedia entry for the Average precision
-           <http://en.wikipedia.org/wiki/Average_precision>`_
+           <https://en.wikipedia.org/wiki/Average_precision>`_
 
     See also
     --------
@@ -221,7 +221,7 @@ def roc_auc_score(y_true, y_score, average="macro", sample_weight=None):
     References
     ----------
     .. [1] `Wikipedia entry for the Receiver operating characteristic
-            <http://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_
+            <https://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_
 
     See also
     --------
@@ -475,7 +475,7 @@ class or confidence values.
     References
     ----------
     .. [1] `Wikipedia entry for the Receiver operating characteristic
-            <http://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_
+            <https://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_
 
 
     Examples
diff --git a/sklearn/metrics/regression.py b/sklearn/metrics/regression.py
index d6eb96057e0f3..46871d929f759 100644
--- a/sklearn/metrics/regression.py
+++ b/sklearn/metrics/regression.py
@@ -425,7 +425,7 @@ def r2_score(y_true, y_pred,
     References
     ----------
     .. [1] `Wikipedia entry on the Coefficient of determination
-            <http://en.wikipedia.org/wiki/Coefficient_of_determination>`_
+            <https://en.wikipedia.org/wiki/Coefficient_of_determination>`_
 
     Examples
     --------
diff --git a/sklearn/neighbors/classification.py b/sklearn/neighbors/classification.py
index 4288fa1e90fd2..1e765e8e4a8e2 100644
--- a/sklearn/neighbors/classification.py
+++ b/sklearn/neighbors/classification.py
@@ -114,7 +114,7 @@ class KNeighborsClassifier(NeighborsBase, KNeighborsMixin,
        but different labels, the results will depend on the ordering of the
        training data.
 
-    http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
+    https://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
     """
 
     def __init__(self, n_neighbors=5,
@@ -312,7 +312,7 @@ class RadiusNeighborsClassifier(NeighborsBase, RadiusNeighborsMixin,
     See :ref:`Nearest Neighbors <neighbors>` in the online documentation
     for a discussion of the choice of ``algorithm`` and ``leaf_size``.
 
-    http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
+    https://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
     """
 
     def __init__(self, radius=1.0, weights='uniform',
diff --git a/sklearn/neighbors/regression.py b/sklearn/neighbors/regression.py
index 06b956a08ce55..386ed1348ca92 100644
--- a/sklearn/neighbors/regression.py
+++ b/sklearn/neighbors/regression.py
@@ -112,7 +112,7 @@ class KNeighborsRegressor(NeighborsBase, KNeighborsMixin,
        but different labels, the results will depend on the ordering of the
        training data.
 
-    http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
+    https://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
     """
 
     def __init__(self, n_neighbors=5, weights='uniform',
@@ -250,7 +250,7 @@ class RadiusNeighborsRegressor(NeighborsBase, RadiusNeighborsMixin,
     See :ref:`Nearest Neighbors <neighbors>` in the online documentation
     for a discussion of the choice of ``algorithm`` and ``leaf_size``.
 
-    http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
+    https://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
     """
 
     def __init__(self, radius=1.0, weights='uniform',
diff --git a/sklearn/neighbors/unsupervised.py b/sklearn/neighbors/unsupervised.py
index 590069b9ed55e..7231c820976a4 100644
--- a/sklearn/neighbors/unsupervised.py
+++ b/sklearn/neighbors/unsupervised.py
@@ -110,7 +110,7 @@ class NearestNeighbors(NeighborsBase, KNeighborsMixin,
     See :ref:`Nearest Neighbors <neighbors>` in the online documentation
     for a discussion of the choice of ``algorithm`` and ``leaf_size``.
 
-    http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
+    https://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
     """
 
     def __init__(self, n_neighbors=5, radius=1.0,
diff --git a/sklearn/preprocessing/data.py b/sklearn/preprocessing/data.py
index 0eee0cabcb608..3f17839ca7adb 100644
--- a/sklearn/preprocessing/data.py
+++ b/sklearn/preprocessing/data.py
@@ -940,8 +940,8 @@ class RobustScaler(BaseEstimator, TransformerMixin):
     -----
     See examples/preprocessing/plot_robust_scaling.py for an example.
 
-    http://en.wikipedia.org/wiki/Median_(statistics)
-    http://en.wikipedia.org/wiki/Interquartile_range
+    https://en.wikipedia.org/wiki/Median_(statistics)
+    https://en.wikipedia.org/wiki/Interquartile_range
     """
 
     def __init__(self, with_centering=True, with_scaling=True, copy=True):
diff --git a/sklearn/random_projection.py b/sklearn/random_projection.py
index 1a1414e3fc823..19235732e71e2 100644
--- a/sklearn/random_projection.py
+++ b/sklearn/random_projection.py
@@ -12,7 +12,7 @@
 
 The main theoretical result behind the efficiency of random projection is the
 `Johnson-Lindenstrauss lemma (quoting Wikipedia)
-<http://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma>`_:
+<https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma>`_:
 
   In mathematics, the Johnson-Lindenstrauss lemma is a result
   concerning low-distortion embeddings of points from high-dimensional
@@ -110,7 +110,7 @@ def johnson_lindenstrauss_min_dim(n_samples, eps=0.1):
     References
     ----------
 
-    .. [1] http://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma
+    .. [1] https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma
 
     .. [2] Sanjoy Dasgupta and Anupam Gupta, 1999,
            "An elementary proof of the Johnson-Lindenstrauss Lemma."
@@ -584,7 +584,7 @@ class SparseRandomProjection(BaseRandomProjection):
            http://www.stanford.edu/~hastie/Papers/Ping/KDD06_rp.pdf
 
     .. [2] D. Achlioptas, 2001, "Database-friendly random projections",
-           http://www.cs.ucsc.edu/~optas/papers/jl.pdf
+           https://users.soe.ucsc.edu/~optas/papers/jl.pdf
 
     """
     def __init__(self, n_components='auto', density='auto', eps=0.1,
diff --git a/sklearn/tree/tree.py b/sklearn/tree/tree.py
index d33f2fbadcb80..c2ba3f9e91f2b 100644
--- a/sklearn/tree/tree.py
+++ b/sklearn/tree/tree.py
@@ -649,7 +649,7 @@ class DecisionTreeClassifier(BaseDecisionTree, ClassifierMixin):
     References
     ----------
 
-    .. [1] http://en.wikipedia.org/wiki/Decision_tree_learning
+    .. [1] https://en.wikipedia.org/wiki/Decision_tree_learning
 
     .. [2] L. Breiman, J. Friedman, R. Olshen, and C. Stone, "Classification
            and Regression Trees", Wadsworth, Belmont, CA, 1984.
@@ -880,7 +880,7 @@ class DecisionTreeRegressor(BaseDecisionTree, RegressorMixin):
     References
     ----------
 
-    .. [1] http://en.wikipedia.org/wiki/Decision_tree_learning
+    .. [1] https://en.wikipedia.org/wiki/Decision_tree_learning
 
     .. [2] L. Breiman, J. Friedman, R. Olshen, and C. Stone, "Classification
            and Regression Trees", Wadsworth, Belmont, CA, 1984.
diff --git a/sklearn/utils/linear_assignment_.py b/sklearn/utils/linear_assignment_.py
index edcedc4dba23f..5282c84e21130 100644
--- a/sklearn/utils/linear_assignment_.py
+++ b/sklearn/utils/linear_assignment_.py
@@ -50,7 +50,7 @@ def linear_assignment(X):
        *Journal of the Society of Industrial and Applied Mathematics*,
        5(1):32-38, March, 1957.
 
-    5. http://en.wikipedia.org/wiki/Hungarian_algorithm
+    5. https://en.wikipedia.org/wiki/Hungarian_algorithm
     """
     indices = _hungarian(X).tolist()
     indices.sort()

From 8bea87efc7f6db484f583c15d36a775d82381ef3 Mon Sep 17 00:00:00 2001
From: Nelson Liu <nelson.liu.2009@gmail.com>
Date: Sun, 21 Feb 2016 14:32:13 -0800
Subject: [PATCH 3/4] chore:remove .orig files

---
 AUTHORS.rst.orig                              |  131 --
 doc/modules/feature_extraction.rst.orig       |  922 ---------
 doc/modules/linear_model.rst.orig             | 1168 -----------
 doc/modules/neighbors.rst.orig                |  694 -------
 .../unsupervised_learning.rst.orig            |  327 ---
 sklearn/metrics/classification.py.orig        | 1827 -----------------
 6 files changed, 5069 deletions(-)
 delete mode 100644 AUTHORS.rst.orig
 delete mode 100644 doc/modules/feature_extraction.rst.orig
 delete mode 100644 doc/modules/linear_model.rst.orig
 delete mode 100644 doc/modules/neighbors.rst.orig
 delete mode 100644 doc/tutorial/statistical_inference/unsupervised_learning.rst.orig
 delete mode 100644 sklearn/metrics/classification.py.orig

diff --git a/AUTHORS.rst.orig b/AUTHORS.rst.orig
deleted file mode 100644
index e2cac7766f207..0000000000000
--- a/AUTHORS.rst.orig
+++ /dev/null
@@ -1,131 +0,0 @@
-.. -*- mode: rst -*-
-
-
-This is a community effort, and as such many people have contributed
-to it over the years.
-
-History
--------
-
-This project was started in 2007 as a Google Summer of Code project by
-David Cournapeau. Later that year, Matthieu Brucher started work on
-this project as part of his thesis.
-
-In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent
-Michel of INRIA took leadership of the project and made the first public
-release, February the 1st 2010. Since then, several releases have appeared
-following a ~3 month cycle, and a thriving international community has
-been leading the development.
-
-People
-------
-
-The following people have been core contributors to scikit-learn's development and maintenance:
-
-.. hlist::
-
-  * `Mathieu Blondel <http://mblondel.org>`_
-  * `Matthieu Brucher <http://matt.eifelle.com/>`_
-<<<<<<< HEAD
-
-  * `Fabian Pedregosa <http://fa.bianp.net/blog/>`_
-
-  * `Gael Varoquaux <http://gael-varoquaux.info/>`_
-
-  * `Jake VanderPlas <http://staff.washington.edu/jakevdp>`_
-
-  * `Alexandre Gramfort <http://alexandre.gramfort.net>`_
-
-  * `Olivier Grisel <https://twitter.com/ogrisel>`_
-
-  * Bertrand Thirion
-
-  * Vincent Michel
-
-  * Chris Filo Gorgolewski
-
-  * `Angel Soler Gollonet <http://webylimonada.com>`_
-
-  * `Yaroslav Halchenko <http://www.onerussian.com/>`_
-
-  * Ron Weiss
-
-  * `Virgile Fritsch
-    <https://team.inria.fr/parietal/vfritsch>`_
-
-  * `Mathieu Blondel <http://mblondel.org>`_
-
-  * `Peter Prettenhofer
-    <https://sites.google.com/site/peterprettenhofer/>`_
-
-  * Vincent Dubourg
-
-  * `Alexandre Passos <http://atpassos.posterous.com>`_
-
-  * `Vlad Niculae <http://vene.ro>`_
-
-  * Edouard Duchesnay
-
-  * Thouis (Ray) Jones
-
-  * Lars Buitinck
-
-  * Paolo Losi
-
-  * Nelle Varoquaux
-
-  * `Brian Holt <http://personal.ee.surrey.ac.uk/Personal/B.Holt/>`_
-
-=======
-  * Lars Buitinck
-  * David Cournapeau
-  * `Noel Dawe <http://noel.dawe.me>`_
-  * Vincent Dubourg
-  * Edouard Duchesnay
-  * `Tom Dupré la Tour <https://github.com/TomDLT>`_
-  * Alexander Fabisch
-  * `Virgile Fritsch <http://parietal.saclay.inria.fr/Members/virgile-fritsch>`_
-  * `Satra Ghosh <http://www.mit.edu/~satra>`_
-  * `Angel Soler Gollonet <http://webylimonada.com>`_
-  * Chris Filo Gorgolewski
-  * `Alexandre Gramfort <http://alexandre.gramfort.net>`_
-  * `Olivier Grisel <http://twitter.com/ogrisel>`_
-  * `Jaques Grobler <https://github.com/jaquesgrobler>`_
-  * `Yaroslav Halchenko <http://www.onerussian.com/>`_
-  * `Brian Holt <http://info.ee.surrey.ac.uk/Personal/B.Holt/>`_
-  * `Arnaud Joly <http://www.ajoly.org>`_
-  * Thouis (Ray) Jones
-  * `Kyle Kastner <http://kastnerkyle.github.io>`_
-  * `Manoj Kumar <https://manojbits.wordpress.com>`_
->>>>>>> origin/master
-  * Robert Layton
-  * `Wei Li <http://kuantkid.github.com>`_
-  * Paolo Losi
-  * `Gilles Louppe <http://www.montefiore.ulg.ac.be/~glouppe>`_
-  * `Jan Hendrik Metzen <https://github.com/jmetzen>`_
-  * Vincent Michel
-  * Jarrod Millman
-  * `Andreas Müller <http://peekaboo-vision.blogspot.com>`_ (release manager)
-<<<<<<< HEAD
-
-  * `Satra Ghosh <http://www.mit.edu/~satra>`_
-
-  * `Wei Li <https://kuantkid.github.io>`_
-
-  * `Arnaud Joly <http://www.ajoly.org>`_
-
-  * `Kemal Eren <http://www.kemaleren.com>`_
-
-  * `Michael Becker <https://mdbecker.github.io>`_
-=======
-  * `Vlad Niculae <http://vene.ro>`_
-  * `Joel Nothman <http://joelnothman.com>`_
-  * `Alexandre Passos <http://atpassos.posterous.com>`_
-  * `Fabian Pedregosa <http://fseoane.net/blog/>`_
-  * `Peter Prettenhofer <http://sites.google.com/site/peterprettenhofer/>`_
-  * Bertrand Thirion
-  * `Jake VanderPlas <http://www.astro.washington.edu/users/vanderplas/>`_
-  * Nelle Varoquaux
-  * `Gael Varoquaux <http://gael-varoquaux.info/blog/>`_
-  * Ron Weiss
->>>>>>> origin/master
diff --git a/doc/modules/feature_extraction.rst.orig b/doc/modules/feature_extraction.rst.orig
deleted file mode 100644
index 244f1c21c0f05..0000000000000
--- a/doc/modules/feature_extraction.rst.orig
+++ /dev/null
@@ -1,922 +0,0 @@
-.. _feature_extraction:
-
-==================
-Feature extraction
-==================
-
-.. currentmodule:: sklearn.feature_extraction
-
-The :mod:`sklearn.feature_extraction` module can be used to extract
-features in a format supported by machine learning algorithms from datasets
-consisting of formats such as text and image.
-
-.. note::
-
-   Feature extraction is very different from :ref:`feature_selection`:
-   the former consists in transforming arbitrary data, such as text or
-   images, into numerical features usable for machine learning. The latter
-   is a machine learning technique applied on these features.
-
-.. _dict_feature_extraction:
-
-Loading features from dicts
-===========================
-
-The class :class:`DictVectorizer` can be used to convert feature
-arrays represented as lists of standard Python ``dict`` objects to the
-NumPy/SciPy representation used by scikit-learn estimators.
-
-While not particularly fast to process, Python's ``dict`` has the
-advantages of being convenient to use, being sparse (absent features
-need not be stored) and storing feature names in addition to values.
-
-:class:`DictVectorizer` implements what is called one-of-K or "one-hot"
-coding for categorical (aka nominal, discrete) features. Categorical
-features are "attribute-value" pairs where the value is restricted
-to a list of discrete of possibilities without ordering (e.g. topic
-identifiers, types of objects, tags, names...).
-
-In the following, "city" is a categorical attribute while "temperature"
-is a traditional numerical feature::
-
-  >>> measurements = [
-  ...     {'city': 'Dubai', 'temperature': 33.},
-  ...     {'city': 'London', 'temperature': 12.},
-  ...     {'city': 'San Fransisco', 'temperature': 18.},
-  ... ]
-
-  >>> from sklearn.feature_extraction import DictVectorizer
-  >>> vec = DictVectorizer()
-
-  >>> vec.fit_transform(measurements).toarray()
-  array([[  1.,   0.,   0.,  33.],
-         [  0.,   1.,   0.,  12.],
-         [  0.,   0.,   1.,  18.]])
-
-  >>> vec.get_feature_names()
-  ['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
-
-:class:`DictVectorizer` is also a useful representation transformation
-for training sequence classifiers in Natural Language Processing models
-that typically work by extracting feature windows around a particular
-word of interest.
-
-For example, suppose that we have a first algorithm that extracts Part of
-Speech (PoS) tags that we want to use as complementary tags for training
-a sequence classifier (e.g. a chunker). The following dict could be
-such a window of features extracted around the word 'sat' in the sentence
-'The cat sat on the mat.'::
-
-  >>> pos_window = [
-  ...     {
-  ...         'word-2': 'the',
-  ...         'pos-2': 'DT',
-  ...         'word-1': 'cat',
-  ...         'pos-1': 'NN',
-  ...         'word+1': 'on',
-  ...         'pos+1': 'PP',
-  ...     },
-  ...     # in a real application one would extract many such dictionaries
-  ... ]
-
-This description can be vectorized into a sparse two-dimensional matrix
-suitable for feeding into a classifier (maybe after being piped into a
-:class:`text.TfidfTransformer` for normalization)::
-
-  >>> vec = DictVectorizer()
-  >>> pos_vectorized = vec.fit_transform(pos_window)
-  >>> pos_vectorized                # doctest: +NORMALIZE_WHITESPACE  +ELLIPSIS
-  <1x6 sparse matrix of type '<... 'numpy.float64'>'
-      with 6 stored elements in Compressed Sparse ... format>
-  >>> pos_vectorized.toarray()
-  array([[ 1.,  1.,  1.,  1.,  1.,  1.]])
-  >>> vec.get_feature_names()
-  ['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']
-
-As you can imagine, if one extracts such a context around each individual
-word of a corpus of documents the resulting matrix will be very wide
-(many one-hot-features) with most of them being valued to zero most
-of the time. So as to make the resulting data structure able to fit in
-memory the ``DictVectorizer`` class uses a ``scipy.sparse`` matrix by
-default instead of a ``numpy.ndarray``.
-
-
-.. _feature_hashing:
-
-Feature hashing
-===============
-
-.. currentmodule:: sklearn.feature_extraction
-
-The class :class:`FeatureHasher` is a high-speed, low-memory vectorizer that
-uses a technique known as
-`feature hashing <https://en.wikipedia.org/wiki/Feature_hashing>`_,
-or the "hashing trick".
-Instead of building a hash table of the features encountered in training,
-as the vectorizers do, instances of :class:`FeatureHasher`
-apply a hash function to the features
-to determine their column index in sample matrices directly.
-The result is increased speed and reduced memory usage,
-at the expense of inspectability;
-the hasher does not remember what the input features looked like
-and has no ``inverse_transform`` method.
-
-Since the hash function might cause collisions between (unrelated) features,
-a signed hash function is used and the sign of the hash value
-determines the sign of the value stored in the output matrix for a feature.
-This way, collisions are likely to cancel out rather than accumulate error,
-and the expected mean of any output feature's value is zero.
-
-If ``non_negative=True`` is passed to the constructor, the absolute
-value is taken.  This undoes some of the collision handling, but allows
-the output to be passed to estimators like
-:class:`sklearn.naive_bayes.MultinomialNB` or
-:class:`sklearn.feature_selection.chi2`
-feature selectors that expect non-negative inputs.
-
-:class:`FeatureHasher` accepts either mappings
-(like Python's ``dict`` and its variants in the ``collections`` module),
-``(feature, value)`` pairs, or strings,
-depending on the constructor parameter ``input_type``.
-Mapping are treated as lists of ``(feature, value)`` pairs,
-while single strings have an implicit value of 1,
-so ``['feat1', 'feat2', 'feat3']`` is interpreted as
-``[('feat1', 1), ('feat2', 1), ('feat3', 1)]``.
-If a single feature occurs multiple times in a sample,
-the associated values will be summed
-(so ``('feat', 2)`` and ``('feat', 3.5)`` become ``('feat', 5.5)``).
-The output from :class:`FeatureHasher` is always a ``scipy.sparse`` matrix
-in the CSR format.
-
-Feature hashing can be employed in document classification,
-but unlike :class:`text.CountVectorizer`,
-:class:`FeatureHasher` does not do word
-splitting or any other preprocessing except Unicode-to-UTF-8 encoding;
-see :ref:`hashing_vectorizer`, below, for a combined tokenizer/hasher.
-
-As an example, consider a word-level natural language processing task
-that needs features extracted from ``(token, part_of_speech)`` pairs.
-One could use a Python generator function to extract features::
-
-  def token_features(token, part_of_speech):
-      if token.isdigit():
-          yield "numeric"
-      else:
-          yield "token={}".format(token.lower())
-          yield "token,pos={},{}".format(token, part_of_speech)
-      if token[0].isupper():
-          yield "uppercase_initial"
-      if token.isupper():
-          yield "all_uppercase"
-      yield "pos={}".format(part_of_speech)
-
-Then, the ``raw_X`` to be fed to ``FeatureHasher.transform``
-can be constructed using::
-
-  raw_X = (token_features(tok, pos_tagger(tok)) for tok in corpus)
-
-and fed to a hasher with::
-
-  hasher = FeatureHasher(input_type='string')
-  X = hasher.transform(raw_X)
-
-to get a ``scipy.sparse`` matrix ``X``.
-
-Note the use of a generator comprehension,
-which introduces laziness into the feature extraction:
-tokens are only processed on demand from the hasher.
-
-Implementation details
-----------------------
-
-:class:`FeatureHasher` uses the signed 32-bit variant of MurmurHash3.
-As a result (and because of limitations in ``scipy.sparse``),
-the maximum number of features supported is currently :math:`2^{31} - 1`.
-
-The original formulation of the hashing trick by Weinberger et al.
-used two separate hash functions :math:`h` and :math:`\xi`
-to determine the column index and sign of a feature, respectively.
-The present implementation works under the assumption
-that the sign bit of MurmurHash3 is independent of its other bits.
-
-Since a simple modulo is used to transform the hash function to a column index,
-it is advisable to use a power of two as the ``n_features`` parameter;
-otherwise the features will not be mapped evenly to the columns.
-
-
-.. topic:: References:
-
- * Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola and
-   Josh Attenberg (2009). `Feature hashing for large scale multitask learning
-   <http://alex.smola.org/papers/2009/Weinbergeretal09.pdf>`_. Proc. ICML.
-
- * `MurmurHash3 <http://code.google.com/p/smhasher/wiki/MurmurHash3>`_.
-
-
-.. _text_feature_extraction:
-
-Text feature extraction
-=======================
-
-.. currentmodule:: sklearn.feature_extraction.text
-
-
-The Bag of Words representation
--------------------------------
-
-Text Analysis is a major application field for machine learning
-algorithms. However the raw data, a sequence of symbols cannot be fed
-directly to the algorithms themselves as most of them expect numerical
-feature vectors with a fixed size rather than the raw text documents
-with variable length.
-
-In order to address this, scikit-learn provides utilities for the most
-common ways to extract numerical features from text content, namely:
-
-- **tokenizing** strings and giving an integer id for each possible token,
-  for instance by using white-spaces and punctuation as token separators.
-
-- **counting** the occurrences of tokens in each document.
-
-- **normalizing** and weighting with diminishing importance tokens that
-  occur in the majority of samples / documents.
-
-In this scheme, features and samples are defined as follows:
-
-- each **individual token occurrence frequency** (normalized or not)
-  is treated as a **feature**.
-
-- the vector of all the token frequencies for a given **document** is
-  considered a multivariate **sample**.
-
-A corpus of documents can thus be represented by a matrix with one row
-per document and one column per token (e.g. word) occurring in the corpus.
-
-We call **vectorization** the general process of turning a collection
-of text documents into numerical feature vectors. This specific strategy
-(tokenization, counting and normalization) is called the **Bag of Words**
-or "Bag of n-grams" representation. Documents are described by word
-occurrences while completely ignoring the relative position information
-of the words in the document.
-
-
-Sparsity
---------
-
-As most documents will typically use a very small subset of the words used in
-the corpus, the resulting matrix will have many feature values that are
-zeros (typically more than 99% of them).
-
-For instance a collection of 10,000 short text documents (such as emails)
-will use a vocabulary with a size in the order of 100,000 unique words in
-total while each document will use 100 to 1000 unique words individually.
-
-In order to be able to store such a matrix in memory but also to speed
-up algebraic operations matrix / vector, implementations will typically
-use a sparse representation such as the implementations available in the
-``scipy.sparse`` package.
-
-
-Common Vectorizer usage
------------------------
-
-:class:`CountVectorizer` implements both tokenization and occurrence
-counting in a single class::
-
-  >>> from sklearn.feature_extraction.text import CountVectorizer
-
-This model has many parameters, however the default values are quite
-reasonable (please see  the :ref:`reference documentation
-<text_feature_extraction_ref>` for the details)::
-
-  >>> vectorizer = CountVectorizer(min_df=1)
-  >>> vectorizer                     # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
-  CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
-          dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
-          lowercase=True, max_df=1.0, max_features=None, min_df=1,
-          ngram_range=(1, 1), preprocessor=None, stop_words=None,
-          strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b',
-          tokenizer=None, vocabulary=None)
-
-Let's use it to tokenize and count the word occurrences of a minimalistic
-corpus of text documents::
-
-  >>> corpus = [
-  ...     'This is the first document.',
-  ...     'This is the second second document.',
-  ...     'And the third one.',
-  ...     'Is this the first document?',
-  ... ]
-  >>> X = vectorizer.fit_transform(corpus)
-  >>> X                              # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
-  <4x9 sparse matrix of type '<... 'numpy.int64'>'
-      with 19 stored elements in Compressed Sparse ... format>
-
-The default configuration tokenizes the string by extracting words of
-at least 2 letters. The specific function that does this step can be
-requested explicitly::
-
-  >>> analyze = vectorizer.build_analyzer()
-  >>> analyze("This is a text document to analyze.") == (
-  ...     ['this', 'is', 'text', 'document', 'to', 'analyze'])
-  True
-
-Each term found by the analyzer during the fit is assigned a unique
-integer index corresponding to a column in the resulting matrix. This
-interpretation of the columns can be retrieved as follows::
-
-  >>> vectorizer.get_feature_names() == (
-  ...     ['and', 'document', 'first', 'is', 'one',
-  ...      'second', 'the', 'third', 'this'])
-  True
-
-  >>> X.toarray()           # doctest: +ELLIPSIS
-  array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
-         [0, 1, 0, 1, 0, 2, 1, 0, 1],
-         [1, 0, 0, 0, 1, 0, 1, 1, 0],
-         [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)
-
-The converse mapping from feature name to column index is stored in the
-``vocabulary_`` attribute of the vectorizer::
-
-  >>> vectorizer.vocabulary_.get('document')
-  1
-
-Hence words that were not seen in the training corpus will be completely
-ignored in future calls to the transform method::
-
-  >>> vectorizer.transform(['Something completely new.']).toarray()
-  ...                           # doctest: +ELLIPSIS
-  array([[0, 0, 0, 0, 0, 0, 0, 0, 0]]...)
-
-Note that in the previous corpus, the first and the last documents have
-exactly the same words hence are encoded in equal vectors. In particular
-we lose the information that the last document is an interrogative form. To
-preserve some of the local ordering information we can extract 2-grams
-of words in addition to the 1-grams (individual words)::
-
-  >>> bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
-  ...                                     token_pattern=r'\b\w+\b', min_df=1)
-  >>> analyze = bigram_vectorizer.build_analyzer()
-  >>> analyze('Bi-grams are cool!') == (
-  ...     ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])
-  True
-
-The vocabulary extracted by this vectorizer is hence much bigger and
-can now resolve ambiguities encoded in local positioning patterns::
-
-  >>> X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
-  >>> X_2
-  ...                           # doctest: +ELLIPSIS
-  array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
-         [0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
-         [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
-         [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]]...)
-
-
-In particular the interrogative form "Is this" is only present in the
-last document::
-
-  >>> feature_index = bigram_vectorizer.vocabulary_.get('is this')
-  >>> X_2[:, feature_index]     # doctest: +ELLIPSIS
-  array([0, 0, 0, 1]...)
-
-
-.. _tfidf:
-
-Tf–idf term weighting
----------------------
-
-In a large text corpus, some words will be very present (e.g. "the", "a",
-"is" in English) hence carrying very little meaningful information about
-the actual contents of the document. If we were to feed the direct count
-data directly to a classifier those very frequent terms would shadow
-the frequencies of rarer yet more interesting terms.
-
-In order to re-weight the count features into floating point values
-suitable for usage by a classifier it is very common to use the tf–idf
-transform.
-
-Tf means **term-frequency** while tf–idf means term-frequency times
-**inverse document-frequency**. This was originally a term weighting
-scheme developed for information retrieval (as a ranking function
-for search engines results), that has also found good use in document
-classification and clustering.
-
-This normalization is implemented by the :class:`TfidfTransformer`
-class::
-
-  >>> from sklearn.feature_extraction.text import TfidfTransformer
-  >>> transformer = TfidfTransformer()
-  >>> transformer   # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
-  TfidfTransformer(norm=...'l2', smooth_idf=True, sublinear_tf=False,
-                   use_idf=True)
-
-Again please see the :ref:`reference documentation
-<text_feature_extraction_ref>` for the details on all the parameters.
-
-Let's take an example with the following counts. The first term is present
-100% of the time hence not very interesting. The two other features only
-in less than 50% of the time hence probably more representative of the
-content of the documents::
-
-  >>> counts = [[3, 0, 1],
-  ...           [2, 0, 0],
-  ...           [3, 0, 0],
-  ...           [4, 0, 0],
-  ...           [3, 2, 0],
-  ...           [3, 0, 2]]
-  ...
-  >>> tfidf = transformer.fit_transform(counts)
-  >>> tfidf                         # doctest: +NORMALIZE_WHITESPACE  +ELLIPSIS
-  <6x3 sparse matrix of type '<... 'numpy.float64'>'
-      with 9 stored elements in Compressed Sparse ... format>
-
-  >>> tfidf.toarray()                        # doctest: +ELLIPSIS
-  array([[ 0.85...,  0.  ...,  0.52...],
-         [ 1.  ...,  0.  ...,  0.  ...],
-         [ 1.  ...,  0.  ...,  0.  ...],
-         [ 1.  ...,  0.  ...,  0.  ...],
-         [ 0.55...,  0.83...,  0.  ...],
-         [ 0.63...,  0.  ...,  0.77...]])
-
-Each row is normalized to have unit euclidean norm. The weights of each
-feature computed by the ``fit`` method call are stored in a model
-attribute::
-
-  >>> transformer.idf_                       # doctest: +ELLIPSIS
-  array([ 1. ...,  2.25...,  1.84...])
-
-
-As tf–idf is very often used for text features, there is also another
-class called :class:`TfidfVectorizer` that combines all the options of
-:class:`CountVectorizer` and :class:`TfidfTransformer` in a single model::
-
-  >>> from sklearn.feature_extraction.text import TfidfVectorizer
-  >>> vectorizer = TfidfVectorizer(min_df=1)
-  >>> vectorizer.fit_transform(corpus)
-  ...                                # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
-  <4x9 sparse matrix of type '<... 'numpy.float64'>'
-      with 19 stored elements in Compressed Sparse ... format>
-
-While the tf–idf normalization is often very useful, there might
-be cases where the binary occurrence markers might offer better
-features. This can be achieved by using the ``binary`` parameter
-of :class:`CountVectorizer`. In particular, some estimators such as
-:ref:`bernoulli_naive_bayes` explicitly model discrete boolean random
-variables. Also, very short texts are likely to have noisy tf–idf values
-while the binary occurrence info is more stable.
-
-As usual the best way to adjust the feature extraction parameters
-is to use a cross-validated grid search, for instance by pipelining the
-feature extractor with a classifier:
-
- * :ref:`example_model_selection_grid_search_text_feature_extraction.py`
-
-
-Decoding text files
--------------------
-Text is made of characters, but files are made of bytes. These bytes represent
-characters according to some *encoding*. To work with text files in Python,
-their bytes must be *decoded* to a character set called Unicode.
-Common encodings are ASCII, Latin-1 (Western Europe), KOI8-R (Russian)
-and the universal encodings UTF-8 and UTF-16. Many others exist.
-
-.. note::
-    An encoding can also be called a 'character set',
-    but this term is less accurate: several encodings can exist
-    for a single character set.
-
-The text feature extractors in scikit-learn know how to decode text files,
-but only if you tell them what encoding the files are in.
-The :class:`CountVectorizer` takes an ``encoding`` parameter for this purpose.
-For modern text files, the correct encoding is probably UTF-8,
-which is therefore the default (``encoding="utf-8"``).
-
-If the text you are loading is not actually encoded with UTF-8, however,
-you will get a ``UnicodeDecodeError``.
-The vectorizers can be told to be silent about decoding errors
-by setting the ``decode_error`` parameter to either ``"ignore"``
-or ``"replace"``. See the documentation for the Python function
-``bytes.decode`` for more details
-(type ``help(bytes.decode)`` at the Python prompt).
-
-If you are having trouble decoding text, here are some things to try:
-
-- Find out what the actual encoding of the text is. The file might come
-  with a header or README that tells you the encoding, or there might be some
-  standard encoding you can assume based on where the text comes from.
-
-- You may be able to find out what kind of encoding it is in general
-  using the UNIX command ``file``. The Python ``chardet`` module comes with
-  a script called ``chardetect.py`` that will guess the specific encoding,
-  though you cannot rely on its guess being correct.
-
-- You could try UTF-8 and disregard the errors. You can decode byte
-  strings with ``bytes.decode(errors='replace')`` to replace all
-  decoding errors with a meaningless character, or set
-  ``decode_error='replace'`` in the vectorizer. This may damage the
-  usefulness of your features.
-
-- Real text may come from a variety of sources that may have used different
-  encodings, or even be sloppily decoded in a different encoding than the
-  one it was encoded with. This is common in text retrieved from the Web.
-  The Python package `ftfy`_ can automatically sort out some classes of
-  decoding errors, so you could try decoding the unknown text as ``latin-1``
-  and then using ``ftfy`` to fix errors.
-
-- If the text is in a mish-mash of encodings that is simply too hard to sort
-  out (which is the case for the 20 Newsgroups dataset), you can fall back on
-  a simple single-byte encoding such as ``latin-1``. Some text may display
-  incorrectly, but at least the same sequence of bytes will always represent
-  the same feature.
-
-For example, the following snippet uses ``chardet``
-(not shipped with scikit-learn, must be installed separately)
-to figure out the encoding of three texts.
-It then vectorizes the texts and prints the learned vocabulary.
-The output is not shown here.
-
-  >>> import chardet    # doctest: +SKIP
-  >>> text1 = b"Sei mir gegr\xc3\xbc\xc3\x9ft mein Sauerkraut"
-  >>> text2 = b"holdselig sind deine Ger\xfcche"
-  >>> text3 = b"\xff\xfeA\x00u\x00f\x00 \x00F\x00l\x00\xfc\x00g\x00e\x00l\x00n\x00 \x00d\x00e\x00s\x00 \x00G\x00e\x00s\x00a\x00n\x00g\x00e\x00s\x00,\x00 \x00H\x00e\x00r\x00z\x00l\x00i\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00 \x00t\x00r\x00a\x00g\x00 \x00i\x00c\x00h\x00 \x00d\x00i\x00c\x00h\x00 \x00f\x00o\x00r\x00t\x00"
-  >>> decoded = [x.decode(chardet.detect(x)['encoding'])
-  ...            for x in (text1, text2, text3)]        # doctest: +SKIP
-  >>> v = CountVectorizer().fit(decoded).vocabulary_    # doctest: +SKIP
-  >>> for term in v: print(v)                           # doctest: +SKIP
-
-(Depending on the version of ``chardet``, it might get the first one wrong.)
-
-For an introduction to Unicode and character encodings in general,
-see Joel Spolsky's `Absolute Minimum Every Software Developer Must Know
-About Unicode <http://www.joelonsoftware.com/articles/Unicode.html>`_.
-
-.. _`ftfy`: https://github.com/LuminosoInsight/python-ftfy
-
-
-Applications and examples
--------------------------
-
-The bag of words representation is quite simplistic but surprisingly
-useful in practice.
-
-In particular in a **supervised setting** it can be successfully combined
-with fast and scalable linear models to train **document classifiers**,
-for instance:
-
- * :ref:`example_text_document_classification_20newsgroups.py`
-
-In an **unsupervised setting** it can be used to group similar documents
-together by applying clustering algorithms such as :ref:`k_means`:
-
-  * :ref:`example_text_document_clustering.py`
-
-Finally it is possible to discover the main topics of a corpus by
-relaxing the hard assignment constraint of clustering, for instance by
-using :ref:`NMF`:
-
-  * :ref:`example_applications_topics_extraction_with_nmf_lda.py`
-
-
-Limitations of the Bag of Words representation
-----------------------------------------------
-
-A collection of unigrams (what bag of words is) cannot capture phrases
-and multi-word expressions, effectively disregarding any word order
-dependence. Additionally, the bag of words model doesn't account for potential
-misspellings or word derivations.
-
-N-grams to the rescue! Instead of building a simple collection of
-unigrams (n=1), one might prefer a collection of bigrams (n=2), where
-occurrences of pairs of consecutive words are counted.
-
-One might alternatively consider a collection of character n-grams, a
-representation resilient against misspellings and derivations.
-
-For example, let's say we're dealing with a corpus of two documents:
-``['words', 'wprds']``. The second document contains a misspelling
-of the word 'words'.
-A simple bag of words representation would consider these two as
-very distinct documents, differing in both of the two possible features.
-A character 2-gram representation, however, would find the documents
-matching in 4 out of 8 features, which may help the preferred classifier
-decide better::
-
-  >>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2), min_df=1)
-  >>> counts = ngram_vectorizer.fit_transform(['words', 'wprds'])
-  >>> ngram_vectorizer.get_feature_names() == (
-  ...     [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
-  True
-  >>> counts.toarray().astype(int)
-  array([[1, 1, 1, 0, 1, 1, 1, 0],
-         [1, 1, 0, 1, 1, 1, 0, 1]])
-
-In the above example, ``'char_wb`` analyzer is used, which creates n-grams
-only from characters inside word boundaries (padded with space on each
-side). The ``'char'`` analyzer, alternatively, creates n-grams that
-span across words::
-
-  >>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(5, 5), min_df=1)
-  >>> ngram_vectorizer.fit_transform(['jumpy fox'])
-  ...                                # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
-  <1x4 sparse matrix of type '<... 'numpy.int64'>'
-     with 4 stored elements in Compressed Sparse ... format>
-  >>> ngram_vectorizer.get_feature_names() == (
-  ...     [' fox ', ' jump', 'jumpy', 'umpy '])
-  True
-
-  >>> ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(5, 5), min_df=1)
-  >>> ngram_vectorizer.fit_transform(['jumpy fox'])
-  ...                                # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
-  <1x5 sparse matrix of type '<... 'numpy.int64'>'
-      with 5 stored elements in Compressed Sparse ... format>
-  >>> ngram_vectorizer.get_feature_names() == (
-  ...     ['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'])
-  True
-
-The word boundaries-aware variant ``char_wb`` is especially interesting
-for languages that use white-spaces for word separation as it generates
-significantly less noisy features than the raw ``char`` variant in
-that case. For such languages it can increase both the predictive
-accuracy and convergence speed of classifiers trained using such
-features while retaining the robustness with regards to misspellings and
-word derivations.
-
-While some local positioning information can be preserved by extracting
-n-grams instead of individual words, bag of words and bag of n-grams
-destroy most of the inner structure of the document and hence most of
-the meaning carried by that internal structure.
-
-In order to address the wider task of Natural Language Understanding,
-the local structure of sentences and paragraphs should thus be taken
-into account. Many such models will thus be casted as "Structured output"
-problems which are currently outside of the scope of scikit-learn.
-
-
-.. _hashing_vectorizer:
-
-Vectorizing a large text corpus with the hashing trick
-------------------------------------------------------
-
-The above vectorization scheme is simple but the fact that it holds an **in-
-memory mapping from the string tokens to the integer feature indices** (the
-``vocabulary_`` attribute) causes several **problems when dealing with large
-datasets**:
-
-- the larger the corpus, the larger the vocabulary will grow and hence the
-  memory use too,
-
-- fitting requires the allocation of intermediate data structures
-  of size proportional to that of the original dataset.
-
-- building the word-mapping requires a full pass over the dataset hence it is
-  not possible to fit text classifiers in a strictly online manner.
-
-- pickling and un-pickling vectorizers with a large ``vocabulary_`` can be very
-  slow (typically much slower than pickling / un-pickling flat data structures
-  such as a NumPy array of the same size),
-
-- it is not easily possible to split the vectorization work into concurrent sub
-  tasks as the ``vocabulary_`` attribute would have to be a shared state with a
-  fine grained synchronization barrier: the mapping from token string to
-  feature index is dependent on ordering of the first occurrence of each token
-  hence would have to be shared, potentially harming the concurrent workers'
-  performance to the point of making them slower than the sequential variant.
-
-It is possible to overcome those limitations by combining the "hashing trick"
-(:ref:`Feature_hashing`) implemented by the
-:class:`sklearn.feature_extraction.FeatureHasher` class and the text
-preprocessing and tokenization features of the :class:`CountVectorizer`.
-
-This combination is implementing in :class:`HashingVectorizer`,
-a transformer class that is mostly API compatible with :class:`CountVectorizer`.
-:class:`HashingVectorizer` is stateless,
-meaning that you don't have to call ``fit`` on it::
-
-  >>> from sklearn.feature_extraction.text import HashingVectorizer
-  >>> hv = HashingVectorizer(n_features=10)
-  >>> hv.transform(corpus)
-  ...                                # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
-  <4x10 sparse matrix of type '<... 'numpy.float64'>'
-      with 16 stored elements in Compressed Sparse ... format>
-
-You can see that 16 non-zero feature tokens were extracted in the vector
-output: this is less than the 19 non-zeros extracted previously by the
-:class:`CountVectorizer` on the same toy corpus. The discrepancy comes from
-hash function collisions because of the low value of the ``n_features`` parameter.
-
-In a real world setting, the ``n_features`` parameter can be left to its
-default value of ``2 ** 20`` (roughly one million possible features). If memory
-or downstream models size is an issue selecting a lower value such as ``2 **
-18`` might help without introducing too many additional collisions on typical
-text classification tasks.
-
-Note that the dimensionality does not affect the CPU training time of
-algorithms which operate on CSR matrices (``LinearSVC(dual=True)``,
-``Perceptron``, ``SGDClassifier``, ``PassiveAggressive``) but it does for
-algorithms that work with CSC matrices (``LinearSVC(dual=False)``, ``Lasso()``,
-etc).
-
-Let's try again with the default setting::
-
-  >>> hv = HashingVectorizer()
-  >>> hv.transform(corpus)
-  ...                               # doctest: +NORMALIZE_WHITESPACE  +ELLIPSIS
-  <4x1048576 sparse matrix of type '<... 'numpy.float64'>'
-      with 19 stored elements in Compressed Sparse ... format>
-
-We no longer get the collisions, but this comes at the expense of a much larger
-dimensionality of the output space.
-Of course, other terms than the 19 used here
-might still collide with each other.
-
-The :class:`HashingVectorizer` also comes with the following limitations:
-
-- it is not possible to invert the model (no ``inverse_transform`` method),
-  nor to access the original string representation of the features,
-  because of the one-way nature of the hash function that performs the mapping.
-
-- it does not provide IDF weighting as that would introduce statefulness in the
-  model. A :class:`TfidfTransformer` can be appended to it in a pipeline if
-  required.
-
-Performing out-of-core scaling with HashingVectorizer
-------------------------------------------------------
-
-An interesting development of using a :class:`HashingVectorizer` is the ability
-to perform `out-of-core`_ scaling. This means that we can learn from data that
-does not fit into the computer's main memory.
-
-<<<<<<< HEAD
-.. _out-of-core: https://en.wikipedia.org/wiki/Out-of-core_algorithm
-=======
-.. _out-of-core: http://en.wikipedia.org/wiki/Out-of-core_algorithm
->>>>>>> origin/master
-
-A strategy to implement out-of-core scaling is to stream data to the estimator
-in mini-batches. Each mini-batch is vectorized using :class:`HashingVectorizer`
-so as to guarantee that the input space of the estimator has always the same
-dimensionality. The amount of memory used at any time is thus bounded by the
-size of a mini-batch. Although there is no limit to the amount of data that can
-be ingested using such an approach, from a practical point of view the learning
-time is often limited by the CPU time one wants to spend on the task.
-
-For a full-fledged example of out-of-core scaling in a text classification
-task see :ref:`example_applications_plot_out_of_core_classification.py`.
-
-Customizing the vectorizer classes
-----------------------------------
-
-It is possible to customize the behavior by passing a callable
-to the vectorizer constructor::
-
-  >>> def my_tokenizer(s):
-  ...     return s.split()
-  ...
-  >>> vectorizer = CountVectorizer(tokenizer=my_tokenizer)
-  >>> vectorizer.build_analyzer()(u"Some... punctuation!") == (
-  ...     ['some...', 'punctuation!'])
-  True
-
-In particular we name:
-
-  * ``preprocessor``: a callable that takes an entire document as input (as a
-    single string), and returns a possibly transformed version of the document,
-    still as an entire string. This can be used to remove HTML tags, lowercase
-    the entire document, etc.
-
-  * ``tokenizer``: a callable that takes the output from the preprocessor
-    and splits it into tokens, then returns a list of these.
-
-  * ``analyzer``: a callable that replaces the preprocessor and tokenizer.
-    The default analyzers all call the preprocessor and tokenizer, but custom
-    analyzers will skip this. N-gram extraction and stop word filtering take
-    place at the analyzer level, so a custom analyzer may have to reproduce
-    these steps.
-
-(Lucene users might recognize these names, but be aware that scikit-learn
-concepts may not map one-to-one onto Lucene concepts.)
-
-To make the preprocessor, tokenizer and analyzers aware of the model
-parameters it is possible to derive from the class and override the
-``build_preprocessor``, ``build_tokenizer``` and ``build_analyzer``
-factory methods instead of passing custom functions.
-
-Some tips and tricks:
-
-  * If documents are pre-tokenized by an external package, then store them in
-    files (or strings) with the tokens separated by whitespace and pass
-    ``analyzer=str.split``
-  * Fancy token-level analysis such as stemming, lemmatizing, compound
-    splitting, filtering based on part-of-speech, etc. are not included in the
-    scikit-learn codebase, but can be added by customizing either the
-    tokenizer or the analyzer.
-    Here's a ``CountVectorizer`` with a tokenizer and lemmatizer using
-    `NLTK <http://www.nltk.org>`_::
-
-        >>> from nltk import word_tokenize          # doctest: +SKIP
-        >>> from nltk.stem import WordNetLemmatizer # doctest: +SKIP
-        >>> class LemmaTokenizer(object):
-        ...     def __init__(self):
-        ...         self.wnl = WordNetLemmatizer()
-        ...     def __call__(self, doc):
-        ...         return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
-        ...
-        >>> vect = CountVectorizer(tokenizer=LemmaTokenizer())  # doctest: +SKIP
-
-    (Note that this will not filter out punctuation.)
-
-Customizing the vectorizer can also be useful when handling Asian languages
-that do not use an explicit word separator such as whitespace.
-
-.. _image_feature_extraction:
-
-Image feature extraction
-========================
-
-.. currentmodule:: sklearn.feature_extraction.image
-
-Patch extraction
-----------------
-
-The :func:`extract_patches_2d` function extracts patches from an image stored
-as a two-dimensional array, or three-dimensional with color information along
-the third axis. For rebuilding an image from all its patches, use
-:func:`reconstruct_from_patches_2d`. For example let use generate a 4x4 pixel
-picture with 3 color channels (e.g. in RGB format)::
-
-    >>> import numpy as np
-    >>> from sklearn.feature_extraction import image
-
-    >>> one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
-    >>> one_image[:, :, 0]  # R channel of a fake RGB picture
-    array([[ 0,  3,  6,  9],
-           [12, 15, 18, 21],
-           [24, 27, 30, 33],
-           [36, 39, 42, 45]])
-
-    >>> patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,
-    ...     random_state=0)
-    >>> patches.shape
-    (2, 2, 2, 3)
-    >>> patches[:, :, :, 0]
-    array([[[ 0,  3],
-            [12, 15]],
-    <BLANKLINE>
-           [[15, 18],
-            [27, 30]]])
-    >>> patches = image.extract_patches_2d(one_image, (2, 2))
-    >>> patches.shape
-    (9, 2, 2, 3)
-    >>> patches[4, :, :, 0]
-    array([[15, 18],
-           [27, 30]])
-
-Let us now try to reconstruct the original image from the patches by averaging
-on overlapping areas::
-
-    >>> reconstructed = image.reconstruct_from_patches_2d(patches, (4, 4, 3))
-    >>> np.testing.assert_array_equal(one_image, reconstructed)
-
-The :class:`PatchExtractor` class works in the same way as
-:func:`extract_patches_2d`, only it supports multiple images as input. It is
-implemented as an estimator, so it can be used in pipelines. See::
-
-    >>> five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
-    >>> patches = image.PatchExtractor((2, 2)).transform(five_images)
-    >>> patches.shape
-    (45, 2, 2, 3)
-
-Connectivity graph of an image
--------------------------------
-
-Several estimators in the scikit-learn can use connectivity information between
-features or samples. For instance Ward clustering
-(:ref:`hierarchical_clustering`) can cluster together only neighboring pixels
-of an image, thus forming contiguous patches:
-
-.. figure:: ../auto_examples/cluster/images/plot_face_ward_segmentation_001.png
-   :target: ../auto_examples/cluster/plot_face_ward_segmentation.html
-   :align: center
-   :scale: 40
-
-For this purpose, the estimators use a 'connectivity' matrix, giving
-which samples are connected.
-
-The function :func:`img_to_graph` returns such a matrix from a 2D or 3D
-image. Similarly, :func:`grid_to_graph` build a connectivity matrix for
-images given the shape of these image.
-
-These matrices can be used to impose connectivity in estimators that use
-connectivity information, such as Ward clustering
-(:ref:`hierarchical_clustering`), but also to build precomputed kernels,
-or similarity matrices.
-
-.. note:: **Examples**
-
-   * :ref:`example_cluster_plot_face_ward_segmentation.py`
-
-   * :ref:`example_cluster_plot_segmentation_toy.py`
-
-   * :ref:`example_cluster_plot_feature_agglomeration_vs_univariate_selection.py`
diff --git a/doc/modules/linear_model.rst.orig b/doc/modules/linear_model.rst.orig
deleted file mode 100644
index 82e1ea5e27b1d..0000000000000
--- a/doc/modules/linear_model.rst.orig
+++ /dev/null
@@ -1,1168 +0,0 @@
-.. _linear_model:
-
-=========================
-Generalized Linear Models
-=========================
-
-.. currentmodule:: sklearn.linear_model
-
-The following are a set of methods intended for regression in which
-the target value is expected to be a linear combination of the input
-variables. In mathematical notion, if :math:`\hat{y}` is the predicted
-value.
-
-.. math::    \hat{y}(w, x) = w_0 + w_1 x_1 + ... + w_p x_p
-
-Across the module, we designate the vector :math:`w = (w_1,
-..., w_p)` as ``coef_`` and :math:`w_0` as ``intercept_``.
-
-To perform classification with generalized linear models, see
-:ref:`Logistic_regression`.
-
-
-.. _ordinary_least_squares:
-
-Ordinary Least Squares
-=======================
-
-:class:`LinearRegression` fits a linear model with coefficients
-:math:`w = (w_1, ..., w_p)` to minimize the residual sum
-of squares between the observed responses in the dataset, and the
-responses predicted by the linear approximation. Mathematically it
-solves a problem of the form:
-
-.. math:: \underset{w}{min\,} {|| X w - y||_2}^2
-
-.. figure:: ../auto_examples/linear_model/images/plot_ols_001.png
-   :target: ../auto_examples/linear_model/plot_ols.html
-   :align: center
-   :scale: 50%
-
-:class:`LinearRegression` will take in its ``fit`` method arrays X, y
-and will store the coefficients :math:`w` of the linear model in its
-``coef_`` member::
-
-    >>> from sklearn import linear_model
-    >>> clf = linear_model.LinearRegression()
-    >>> clf.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
-    LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
-    >>> clf.coef_
-    array([ 0.5,  0.5])
-
-However, coefficient estimates for Ordinary Least Squares rely on the
-independence of the model terms. When terms are correlated and the
-columns of the design matrix :math:`X` have an approximate linear
-dependence, the design matrix becomes close to singular
-and as a result, the least-squares estimate becomes highly sensitive
-to random errors in the observed response, producing a large
-variance. This situation of *multicollinearity* can arise, for
-example, when data are collected without an experimental design.
-
-.. topic:: Examples:
-
-   * :ref:`example_linear_model_plot_ols.py`
-
-
-Ordinary Least Squares Complexity
----------------------------------
-
-This method computes the least squares solution using a singular value
-decomposition of X. If X is a matrix of size (n, p) this method has a
-cost of :math:`O(n p^2)`, assuming that :math:`n \geq p`.
-
-.. _ridge_regression:
-
-Ridge Regression
-================
-
-:class:`Ridge` regression addresses some of the problems of
-:ref:`ordinary_least_squares` by imposing a penalty on the size of
-coefficients. The ridge coefficients minimize a penalized residual sum
-of squares,
-
-
-.. math::
-
-   \underset{w}{min\,} {{|| X w - y||_2}^2 + \alpha {||w||_2}^2}
-
-
-Here, :math:`\alpha \geq 0` is a complexity parameter that controls the amount
-of shrinkage: the larger the value of :math:`\alpha`, the greater the amount
-of shrinkage and thus the coefficients become more robust to collinearity.
-
-.. figure:: ../auto_examples/linear_model/images/plot_ridge_path_001.png
-   :target: ../auto_examples/linear_model/plot_ridge_path.html
-   :align: center
-   :scale: 50%
-
-
-As with other linear models, :class:`Ridge` will take in its ``fit`` method
-arrays X, y and will store the coefficients :math:`w` of the linear model in
-its ``coef_`` member::
-
-    >>> from sklearn import linear_model
-    >>> clf = linear_model.Ridge (alpha = .5)
-    >>> clf.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) # doctest: +NORMALIZE_WHITESPACE
-    Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
-          normalize=False, random_state=None, solver='auto', tol=0.001)
-    >>> clf.coef_
-    array([ 0.34545455,  0.34545455])
-    >>> clf.intercept_ #doctest: +ELLIPSIS
-    0.13636...
-
-
-.. topic:: Examples:
-
-   * :ref:`example_linear_model_plot_ridge_path.py`
-   * :ref:`example_text_document_classification_20newsgroups.py`
-
-
-Ridge Complexity
-----------------
-
-This method has the same order of complexity than an
-:ref:`ordinary_least_squares`.
-
-.. FIXME:
-.. Not completely true: OLS is solved by an SVD, while Ridge is solved by
-.. the method of normal equations (Cholesky), there is a big flop difference
-.. between these
-
-
-Setting the regularization parameter: generalized Cross-Validation
-------------------------------------------------------------------
-
-:class:`RidgeCV` implements ridge regression with built-in
-cross-validation of the alpha parameter.  The object works in the same way
-as GridSearchCV except that it defaults to Generalized Cross-Validation
-(GCV), an efficient form of leave-one-out cross-validation::
-
-    >>> from sklearn import linear_model
-    >>> clf = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0])
-    >>> clf.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])       # doctest: +SKIP
-    RidgeCV(alphas=[0.1, 1.0, 10.0], cv=None, fit_intercept=True, scoring=None,
-        normalize=False)
-    >>> clf.alpha_                                      # doctest: +SKIP
-    0.1
-
-.. topic:: References
-
-    * "Notes on Regularized Least Squares", Rifkin & Lippert (`technical report
-      <http://cbcl.mit.edu/projects/cbcl/publications/ps/MIT-CSAIL-TR-2007-025.pdf>`_,
-      `course slides
-      <http://www.mit.edu/~9.520/spring07/Classes/rlsslides.pdf>`_).
-
-
-.. _lasso:
-
-Lasso
-=====
-
-The :class:`Lasso` is a linear model that estimates sparse coefficients.
-It is useful in some contexts due to its tendency to prefer solutions
-with fewer parameter values, effectively reducing the number of variables
-upon which the given solution is dependent. For this reason, the Lasso
-and its variants are fundamental to the field of compressed sensing.
-Under certain conditions, it can recover the exact set of non-zero
-weights (see
-:ref:`example_applications_plot_tomography_l1_reconstruction.py`).
-
-Mathematically, it consists of a linear model trained with :math:`\ell_1` prior
-as regularizer. The objective function to minimize is:
-
-.. math::  \underset{w}{min\,} { \frac{1}{2n_{samples}} ||X w - y||_2 ^ 2 + \alpha ||w||_1}
-
-The lasso estimate thus solves the minimization of the
-least-squares penalty with :math:`\alpha ||w||_1` added, where
-:math:`\alpha` is a constant and :math:`||w||_1` is the :math:`\ell_1`-norm of
-the parameter vector.
-
-The implementation in the class :class:`Lasso` uses coordinate descent as
-the algorithm to fit the coefficients. See :ref:`least_angle_regression`
-for another implementation::
-
-    >>> from sklearn import linear_model
-    >>> clf = linear_model.Lasso(alpha = 0.1)
-    >>> clf.fit([[0, 0], [1, 1]], [0, 1])
-    Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
-       normalize=False, positive=False, precompute=False, random_state=None,
-       selection='cyclic', tol=0.0001, warm_start=False)
-    >>> clf.predict([[1, 1]])
-    array([ 0.8])
-
-Also useful for lower-level tasks is the function :func:`lasso_path` that
-computes the coefficients along the full path of possible values.
-
-.. topic:: Examples:
-
-  * :ref:`example_linear_model_plot_lasso_and_elasticnet.py`
-  * :ref:`example_applications_plot_tomography_l1_reconstruction.py`
-
-
-.. note:: **Feature selection with Lasso**
-
-      As the Lasso regression yields sparse models, it can
-      thus be used to perform feature selection, as detailed in
-      :ref:`l1_feature_selection`.
-
-.. note:: **Randomized sparsity**
-
-      For feature selection or sparse recovery, it may be interesting to
-      use :ref:`randomized_l1`.
-
-
-Setting regularization parameter
---------------------------------
-
-The ``alpha`` parameter controls the degree of sparsity of the coefficients
-estimated.
-
-Using cross-validation
-^^^^^^^^^^^^^^^^^^^^^^^
-
-scikit-learn exposes objects that set the Lasso ``alpha`` parameter by
-cross-validation: :class:`LassoCV` and :class:`LassoLarsCV`.
-:class:`LassoLarsCV` is based on the :ref:`least_angle_regression` algorithm
-explained below.
-
-For high-dimensional datasets with many collinear regressors,
-:class:`LassoCV` is most often preferable. However, :class:`LassoLarsCV` has
-the advantage of exploring more relevant values of `alpha` parameter, and
-if the number of samples is very small compared to the number of
-observations, it is often faster than :class:`LassoCV`.
-
-.. |lasso_cv_1| image:: ../auto_examples/linear_model/images/plot_lasso_model_selection_002.png
-    :target: ../auto_examples/linear_model/plot_lasso_model_selection.html
-    :scale: 48%
-
-.. |lasso_cv_2| image:: ../auto_examples/linear_model/images/plot_lasso_model_selection_003.png
-    :target: ../auto_examples/linear_model/plot_lasso_model_selection.html
-    :scale: 48%
-
-.. centered:: |lasso_cv_1| |lasso_cv_2|
-
-
-Information-criteria based model selection
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Alternatively, the estimator :class:`LassoLarsIC` proposes to use the
-Akaike information criterion (AIC) and the Bayes Information criterion (BIC).
-It is a computationally cheaper alternative to find the optimal value of alpha
-as the regularization path is computed only once instead of k+1 times
-when using k-fold cross-validation. However, such criteria needs a
-proper estimation of the degrees of freedom of the solution, are
-derived for large samples (asymptotic results) and assume the model
-is correct, i.e. that the data are actually generated by this model.
-They also tend to break when the problem is badly conditioned
-(more features than samples).
-
-.. figure:: ../auto_examples/linear_model/images/plot_lasso_model_selection_001.png
-    :target: ../auto_examples/linear_model/plot_lasso_model_selection.html
-    :align: center
-    :scale: 50%
-
-
-.. topic:: Examples:
-
-  * :ref:`example_linear_model_plot_lasso_model_selection.py`
-
-
-.. _multi_task_lasso:
-
-Multi-task Lasso
-================
-
-The :class:`MultiTaskLasso` is a linear model that estimates sparse
-coefficients for multiple regression problems jointly: ``y`` is a 2D array,
-of shape ``(n_samples, n_tasks)``. The constraint is that the selected
-features are the same for all the regression problems, also called tasks.
-
-The following figure compares the location of the non-zeros in W obtained
-with a simple Lasso or a MultiTaskLasso. The Lasso estimates yields
-scattered non-zeros while the non-zeros of the MultiTaskLasso are full
-columns.
-
-.. |multi_task_lasso_1| image:: ../auto_examples/linear_model/images/plot_multi_task_lasso_support_001.png
-    :target: ../auto_examples/linear_model/plot_multi_task_lasso_support.html
-    :scale: 48%
-
-.. |multi_task_lasso_2| image:: ../auto_examples/linear_model/images/plot_multi_task_lasso_support_002.png
-    :target: ../auto_examples/linear_model/plot_multi_task_lasso_support.html
-    :scale: 48%
-
-.. centered:: |multi_task_lasso_1| |multi_task_lasso_2|
-
-.. centered:: Fitting a time-series model, imposing that any active feature be active at all times.
-
-.. topic:: Examples:
-
-  * :ref:`example_linear_model_plot_multi_task_lasso_support.py`
-
-
-Mathematically, it consists of a linear model trained with a mixed
-:math:`\ell_1` :math:`\ell_2` prior as regularizer.
-The objective function to minimize is:
-
-.. math::  \underset{w}{min\,} { \frac{1}{2n_{samples}} ||X W - Y||_{Fro} ^ 2 + \alpha ||W||_{21}}
-
-where :math:`Fro` indicates the Frobenius norm:
-
-.. math:: ||A||_{Fro} = \sqrt{\sum_{ij} a_{ij}^2}
-
-and :math:`\ell_1` :math:`\ell_2` reads:
-
-.. math:: ||A||_{2 1} = \sum_i \sqrt{\sum_j a_{ij}^2}
-
-
-The implementation in the class :class:`MultiTaskLasso` uses coordinate descent as
-the algorithm to fit the coefficients.
-
-
-.. _elastic_net:
-
-Elastic Net
-===========
-:class:`ElasticNet` is a linear regression model trained with L1 and L2 prior
-as regularizer. This combination allows for learning a sparse model where
-few of the weights are non-zero like :class:`Lasso`, while still maintaining
-the regularization properties of :class:`Ridge`. We control the convex
-combination of L1 and L2 using the ``l1_ratio`` parameter.
-
-Elastic-net is useful when there are multiple features which are
-correlated with one another. Lasso is likely to pick one of these
-at random, while elastic-net is likely to pick both.
-
-A practical advantage of trading-off between Lasso and Ridge is it allows
-Elastic-Net to inherit some of Ridge's stability under rotation.
-
-The objective function to minimize is in this case
-
-.. math::
-
-    \underset{w}{min\,} { \frac{1}{2n_{samples}} ||X w - y||_2 ^ 2 + \alpha \rho ||w||_1 +
-    \frac{\alpha(1-\rho)}{2} ||w||_2 ^ 2}
-
-
-.. figure:: ../auto_examples/linear_model/images/plot_lasso_coordinate_descent_path_001.png
-   :target: ../auto_examples/linear_model/plot_lasso_coordinate_descent_path.html
-   :align: center
-   :scale: 50%
-
-The class :class:`ElasticNetCV` can be used to set the parameters
-``alpha`` (:math:`\alpha`) and ``l1_ratio`` (:math:`\rho`) by cross-validation.
-
-.. topic:: Examples:
-
-  * :ref:`example_linear_model_plot_lasso_and_elasticnet.py`
-  * :ref:`example_linear_model_plot_lasso_coordinate_descent_path.py`
-
-
-
-.. _multi_task_elastic_net:
-
-Multi-task Elastic Net
-======================
-
-The :class:`MultiTaskElasticNet` is an elastic-net model that estimates sparse
-coefficients for multiple regression problems jointly: ``Y`` is a 2D array,
-of shape ``(n_samples, n_tasks)``. The constraint is that the selected
-features are the same for all the regression problems, also called tasks.
-
-Mathematically, it consists of a linear model trained with a mixed
-:math:`\ell_1` :math:`\ell_2` prior and :math:`\ell_2` prior as regularizer.
-The objective function to minimize is:
-
-.. math::
-
-    \underset{W}{min\,} { \frac{1}{2n_{samples}} ||X W - Y||_{Fro}^2 + \alpha \rho ||W||_{2 1} +
-    \frac{\alpha(1-\rho)}{2} ||W||_{Fro}^2}
-
-The implementation in the class :class:`MultiTaskElasticNet` uses coordinate descent as
-the algorithm to fit the coefficients.
-
-The class :class:`MultiTaskElasticNetCV` can be used to set the parameters
-``alpha`` (:math:`\alpha`) and ``l1_ratio`` (:math:`\rho`) by cross-validation.
-
-
-.. _least_angle_regression:
-
-Least Angle Regression
-======================
-
-Least-angle regression (LARS) is a regression algorithm for
-high-dimensional data, developed by Bradley Efron, Trevor Hastie, Iain
-Johnstone and Robert Tibshirani.
-
-The advantages of LARS are:
-
-  - It is numerically efficient in contexts where p >> n (i.e., when the
-    number of dimensions is significantly greater than the number of
-    points)
-
-  - It is computationally just as fast as forward selection and has
-    the same order of complexity as an ordinary least squares.
-
-  - It produces a full piecewise linear solution path, which is
-    useful in cross-validation or similar attempts to tune the model.
-
-  - If two variables are almost equally correlated with the response,
-    then their coefficients should increase at approximately the same
-    rate. The algorithm thus behaves as intuition would expect, and
-    also is more stable.
-
-  - It is easily modified to produce solutions for other estimators,
-    like the Lasso.
-
-The disadvantages of the LARS method include:
-
-  - Because LARS is based upon an iterative refitting of the
-    residuals, it would appear to be especially sensitive to the
-    effects of noise. This problem is discussed in detail by Weisberg
-    in the discussion section of the Efron et al. (2004) Annals of
-    Statistics article.
-
-The LARS model can be used using estimator :class:`Lars`, or its
-low-level implementation :func:`lars_path`.
-
-
-LARS Lasso
-==========
-
-:class:`LassoLars` is a lasso model implemented using the LARS
-algorithm, and unlike the implementation based on coordinate_descent,
-this yields the exact solution, which is piecewise linear as a
-function of the norm of its coefficients.
-
-.. figure:: ../auto_examples/linear_model/images/plot_lasso_lars_001.png
-   :target: ../auto_examples/linear_model/plot_lasso_lars.html
-   :align: center
-   :scale: 50%
-
-::
-
-   >>> from sklearn import linear_model
-   >>> clf = linear_model.LassoLars(alpha=.1)
-   >>> clf.fit([[0, 0], [1, 1]], [0, 1])  # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
-   LassoLars(alpha=0.1, copy_X=True, eps=..., fit_intercept=True,
-        fit_path=True, max_iter=500, normalize=True, positive=False,
-        precompute='auto', verbose=False)
-   >>> clf.coef_    # doctest: +ELLIPSIS
-   array([ 0.717157...,  0.        ])
-
-.. topic:: Examples:
-
- * :ref:`example_linear_model_plot_lasso_lars.py`
-
-The Lars algorithm provides the full path of the coefficients along
-the regularization parameter almost for free, thus a common operation
-consist of retrieving the path with function :func:`lars_path`
-
-Mathematical formulation
-------------------------
-
-The algorithm is similar to forward stepwise regression, but instead
-of including variables at each step, the estimated parameters are
-increased in a direction equiangular to each one's correlations with
-the residual.
-
-Instead of giving a vector result, the LARS solution consists of a
-curve denoting the solution for each value of the L1 norm of the
-parameter vector. The full coefficients path is stored in the array
-``coef_path_``, which has size (n_features, max_features+1). The first
-column is always zero.
-
-.. topic:: References:
-
- * Original Algorithm is detailed in the paper `Least Angle Regression
-   <http://www.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf>`_
-   by Hastie et al.
-
-
-.. _omp:
-
-Orthogonal Matching Pursuit (OMP)
-=================================
-:class:`OrthogonalMatchingPursuit` and :func:`orthogonal_mp` implements the OMP
-algorithm for approximating the fit of a linear model with constraints imposed
-on the number of non-zero coefficients (ie. the L :sub:`0` pseudo-norm).
-
-Being a forward feature selection method like :ref:`least_angle_regression`,
-orthogonal matching pursuit can approximate the optimum solution vector with a
-fixed number of non-zero elements:
-
-.. math:: \text{arg\,min\,} ||y - X\gamma||_2^2 \text{ subject to } \
-    ||\gamma||_0 \leq n_{nonzero\_coefs}
-
-Alternatively, orthogonal matching pursuit can target a specific error instead
-of a specific number of non-zero coefficients. This can be expressed as:
-
-.. math:: \text{arg\,min\,} ||\gamma||_0 \text{ subject to } ||y-X\gamma||_2^2 \
-    \leq \text{tol}
-
-
-OMP is based on a greedy algorithm that includes at each step the atom most
-highly correlated with the current residual. It is similar to the simpler
-matching pursuit (MP) method, but better in that at each iteration, the
-residual is recomputed using an orthogonal projection on the space of the
-previously chosen dictionary elements.
-
-
-.. topic:: Examples:
-
- * :ref:`example_linear_model_plot_omp.py`
-
-.. topic:: References:
-
- * http://www.cs.technion.ac.il/~ronrubin/Publications/KSVD-OMP-v2.pdf
-
- * `Matching pursuits with time-frequency dictionaries
-   <http://blanche.polytechnique.fr/~mallat/papiers/MallatPursuit93.pdf>`_,
-   S. G. Mallat, Z. Zhang,
-
-
-.. _bayesian_regression:
-
-Bayesian Regression
-===================
-
-Bayesian regression techniques can be used to include regularization
-parameters in the estimation procedure: the regularization parameter is
-not set in a hard sense but tuned to the data at hand.
-
-This can be done by introducing `uninformative priors
-<https://en.wikipedia.org/wiki/Non-informative_prior#Uninformative_priors>`__
-over the hyper parameters of the model.
-The :math:`\ell_{2}` regularization used in `Ridge Regression`_ is equivalent
-to finding a maximum a-postiori solution under a Gaussian prior over the
-parameters :math:`w` with precision :math:`\lambda^-1`.  Instead of setting
-`\lambda` manually, it is possible to treat it as a random variable to be
-estimated from the data.
-
-To obtain a fully probabilistic model, the output :math:`y` is assumed
-to be Gaussian distributed around :math:`X w`:
-
-.. math::  p(y|X,w,\alpha) = \mathcal{N}(y|X w,\alpha)
-
-Alpha is again treated as a random variable that is to be estimated from the
-data.
-
-The advantages of Bayesian Regression are:
-
-    - It adapts to the data at hand.
-
-    - It can be used to include regularization parameters in the
-      estimation procedure.
-
-The disadvantages of Bayesian regression include:
-
-    - Inference of the model can be time consuming.
-
-
-.. topic:: References
-
- * A good introduction to Bayesian methods is given in C. Bishop: Pattern
-   Recognition and Machine learning
-
- * Original Algorithm is detailed in the  book `Bayesian learning for neural
-   networks` by Radford M. Neal
-
-.. _bayesian_ridge_regression:
-
-Bayesian Ridge Regression
--------------------------
-
-:class:`BayesianRidge` estimates a probabilistic model of the
-regression problem as described above.
-The prior for the parameter :math:`w` is given by a spherical Gaussian:
-
-.. math:: p(w|\lambda) =
-    \mathcal{N}(w|0,\lambda^{-1}\bold{I_{p}})
-
-The priors over :math:`\alpha` and :math:`\lambda` are chosen to be `gamma
-distributions <https://en.wikipedia.org/wiki/Gamma_distribution>`__, the
-conjugate prior for the precision of the Gaussian.
-
-The resulting model is called *Bayesian Ridge Regression*, and is similar to the
-classical :class:`Ridge`.  The parameters :math:`w`, :math:`\alpha` and
-:math:`\lambda` are estimated jointly during the fit of the model.  The
-remaining hyperparameters are the parameters of the gamma priors over
-:math:`\alpha` and :math:`\lambda`.  These are usually chosen to be
-*non-informative*.  The parameters are estimated by maximizing the *marginal
-log likelihood*.
-
-By default :math:`\alpha_1 = \alpha_2 =  \lambda_1 = \lambda_2 = 1.e^{-6}`.
-
-
-.. figure:: ../auto_examples/linear_model/images/plot_bayesian_ridge_001.png
-   :target: ../auto_examples/linear_model/plot_bayesian_ridge.html
-   :align: center
-   :scale: 50%
-
-
-Bayesian Ridge Regression is used for regression::
-
-    >>> from sklearn import linear_model
-    >>> X = [[0., 0.], [1., 1.], [2., 2.], [3., 3.]]
-    >>> Y = [0., 1., 2., 3.]
-    >>> clf = linear_model.BayesianRidge()
-    >>> clf.fit(X, Y)
-    BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,
-           fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=300,
-           normalize=False, tol=0.001, verbose=False)
-
-After being fitted, the model can then be used to predict new values::
-
-    >>> clf.predict ([[1, 0.]])
-    array([ 0.50000013])
-
-
-The weights :math:`w` of the model can be access::
-
-    >>> clf.coef_
-    array([ 0.49999993,  0.49999993])
-
-Due to the Bayesian framework, the weights found are slightly different to the
-ones found by :ref:`ordinary_least_squares`. However, Bayesian Ridge Regression
-is more robust to ill-posed problem.
-
-.. topic:: Examples:
-
- * :ref:`example_linear_model_plot_bayesian_ridge.py`
-
-.. topic:: References
-
-  * More details can be found in the article `Bayesian Interpolation
-    <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.27.9072&rep=rep1&type=pdf>`_
-    by MacKay, David J. C.
-
-
-
-Automatic Relevance Determination - ARD
----------------------------------------
-
-:class:`ARDRegression` is very similar to `Bayesian Ridge Regression`_,
-but can lead to sparser weights :math:`w` [1]_ [2]_.
-:class:`ARDRegression` poses a different prior over :math:`w`, by dropping the
-assumption of the Gaussian being spherical.
-
-Instead, the distribution over :math:`w` is assumed to be an axis-parallel,
-elliptical Gaussian distribution.
-
-This means each weight :math:`w_{i}` is drawn from a Gaussian distribution,
-centered on zero and with a precision :math:`\lambda_{i}`:
-
-.. math:: p(w|\lambda) = \mathcal{N}(w|0,A^{-1})
-
-with :math:`diag \; (A) = \lambda = \{\lambda_{1},...,\lambda_{p}\}`.
-
-In contrast to `Bayesian Ridge Regression`_, each coordinate of :math:`w_{i}`
-has its own standard deviation :math:`\lambda_i`. The prior over all
-:math:`\lambda_i` is chosen to be the same gamma distribution given by
-hyperparameters :math:`\lambda_1` and :math:`\lambda_2`.
-
-.. figure:: ../auto_examples/linear_model/images/plot_ard_001.png
-   :target: ../auto_examples/linear_model/plot_ard.html
-   :align: center
-   :scale: 50%
-
-
-.. topic:: Examples:
-
-  * :ref:`example_linear_model_plot_ard.py`
-
-.. topic:: References:
-
-    .. [1] Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 7.2.1
-
-<<<<<<< HEAD
-    .. [2] David Wipf and Srikantan Nagarajan: `A new view of automatic relevance determination. <http://papers.nips.cc/book/advances-in-neural-information-processing-systems-20-2007>`_
-=======
-    .. [2] David Wipf and Srikantan Nagarajan: `A new view of automatic relevance determination. <http://papers.nips.cc/paper/3372-a-new-view-of-automatic-relevance-determination.pdf>`_
->>>>>>> origin/master
-
-.. _Logistic_regression:
-
-Logistic regression
-===================
-
-Logistic regression, despite its name, is a linear model for classification
-rather than regression. Logistic regression is also known in the literature as
-logit regression, maximum-entropy classification (MaxEnt)
-or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a `logistic function <https://en.wikipedia.org/wiki/Logistic_function>`_.
-
-The implementation of logistic regression in scikit-learn can be accessed from
-class :class:`LogisticRegression`. This implementation can fit binary, One-vs-
-Rest, or multinomial logistic regression with optional L2 or L1
-regularization.
-
-As an optimization problem, binary class L2 penalized logistic regression
-minimizes the following cost function:
-
-.. math:: \underset{w, c}{min\,} \frac{1}{2}w^T w + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1) .
-
-Similarly, L1 regularized logistic regression solves the following
-optimization problem
-
-.. math:: \underset{w, c}{min\,} \|w\|_1 + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1) .
-
-The solvers implemented in the class :class:`LogisticRegression`
-are "liblinear", "newton-cg", "lbfgs" and "sag":
-
-The solver "liblinear" uses a coordinate descent (CD) algorithm, and relies
-on the excellent C++ `LIBLINEAR library
-<http://www.csie.ntu.edu.tw/~cjlin/liblinear/>`_, which is shipped with
-scikit-learn. However, the CD algorithm implemented in liblinear cannot learn
-a true multinomial (multiclass) model; instead, the optimization problem is
-decomposed in a "one-vs-rest" fashion so separate binary classifiers are
-trained for all classes. This happens under the hood, so
-:class:`LogisticRegression` instances using this solver behave as multiclass
-classifiers. For L1 penalization :func:`sklearn.svm.l1_min_c` allows to
-calculate the lower bound for C in order to get a non "null" (all feature
-weights to zero) model.
-
-The "lbfgs", "sag" and "newton-cg" solvers only support L2 penalization and
-are found to converge faster for some high dimensional data. Setting
-`multi_class` to "multinomial" with these solvers learns a true multinomial
-logistic regression model [3]_, which means that its probability estimates
-should be better calibrated than the default "one-vs-rest" setting. The
-"lbfgs", "sag" and "newton-cg"" solvers cannot optimize L1-penalized models,
-therefore the "multinomial" setting does not learn sparse models.
-
-The solver "sag" uses a Stochastic Average Gradient descent [4]_. It is faster
-than other solvers for large datasets, when both the number of samples and the
-number of features are large.
-
-In a nutshell, one may choose the solver with the following rules:
-
-=================================  =============================
-Case                               Solver
-=================================  =============================
-Small dataset or L1 penalty        "liblinear"
-Multinomial loss or large dataset  "lbfgs", "sag" or newton-cg"
-Very Large dataset                 "sag"
-=================================  =============================
-For large dataset, you may also consider using :class:`SGDClassifier` with 'log' loss.
-
-.. topic:: Examples:
-
-  * :ref:`example_linear_model_plot_logistic_l1_l2_sparsity.py`
-
-  * :ref:`example_linear_model_plot_logistic_path.py`
-
-.. _liblinear_differences:
-
-.. topic:: Differences from liblinear:
-
-   There might be a difference in the scores obtained between
-   :class:`LogisticRegression` with ``solver=liblinear``
-   or :class:`LinearSVC` and the external liblinear library directly,
-   when ``fit_intercept=False`` and the fit ``coef_`` (or) the data to
-   be predicted are zeroes. This is because for the sample(s) with
-   ``decision_function`` zero, :class:`LogisticRegression` and :class:`LinearSVC`
-   predict the negative class, while liblinear predicts the positive class.
-   Note that a model with ``fit_intercept=False`` and having many samples with
-   ``decision_function`` zero, is likely to be a underfit, bad model and you are
-   advised to set ``fit_intercept=True`` and increase the intercept_scaling.
-
-.. note:: **Feature selection with sparse logistic regression**
-
-   A logistic regression with L1 penalty yields sparse models, and can
-   thus be used to perform feature selection, as detailed in
-   :ref:`l1_feature_selection`.
-
-:class:`LogisticRegressionCV` implements Logistic Regression with builtin
-cross-validation to find out the optimal C parameter. "newton-cg", "sag" and
-"lbfgs" solvers are found to be faster for high-dimensional dense data, due to
-warm-starting. For the multiclass case, if `multi_class` option is set to
-"ovr", an optimal C is obtained for each class and if the `multi_class` option
-is set to "multinomial", an optimal C is obtained by minimizing the cross-
-entropy loss.
-
-.. topic:: References:
-
-    .. [3] Mark Schmidt, Nicolas Le Roux, and Francis Bach: `Minimizing Finite Sums with the Stochastic Average Gradient. <https://hal.inria.fr/hal-00860051/PDF/sag_journal.pdf>`_
-
-Stochastic Gradient Descent - SGD
-=================================
-
-Stochastic gradient descent is a simple yet very efficient approach
-to fit linear models. It is particularly useful when the number of samples
-(and the number of features) is very large.
-The ``partial_fit`` method allows only/out-of-core learning.
-
-The classes :class:`SGDClassifier` and :class:`SGDRegressor` provide
-functionality to fit linear models for classification and regression
-using different (convex) loss functions and different penalties.
-E.g., with ``loss="log"``, :class:`SGDClassifier`
-fits a logistic regression model,
-while with ``loss="hinge"`` it fits a linear support vector machine (SVM).
-
-.. topic:: References
-
- * :ref:`sgd`
-
-.. _perceptron:
-
-Perceptron
-==========
-
-The :class:`Perceptron` is another simple algorithm suitable for large scale
-learning. By default:
-
-    - It does not require a learning rate.
-
-    - It is not regularized (penalized).
-
-    - It updates its model only on mistakes.
-
-The last characteristic implies that the Perceptron is slightly faster to
-train than SGD with the hinge loss and that the resulting models are
-sparser.
-
-.. _passive_aggressive:
-
-Passive Aggressive Algorithms
-=============================
-
-The passive-aggressive algorithms are a family of algorithms for large-scale
-learning. They are similar to the Perceptron in that they do not require a
-learning rate. However, contrary to the Perceptron, they include a
-regularization parameter ``C``.
-
-For classification, :class:`PassiveAggressiveClassifier` can be used with
-``loss='hinge'`` (PA-I) or ``loss='squared_hinge'`` (PA-II).  For regression,
-:class:`PassiveAggressiveRegressor` can be used with
-``loss='epsilon_insensitive'`` (PA-I) or
-``loss='squared_epsilon_insensitive'`` (PA-II).
-
-.. topic:: References:
-
-
- * `"Online Passive-Aggressive Algorithms"
-   <http://jmlr.csail.mit.edu/papers/volume7/crammer06a/crammer06a.pdf>`_
-   K. Crammer, O. Dekel, J. Keshat, S. Shalev-Shwartz, Y. Singer - JMLR 7 (2006)
-
-
-Robustness regression: outliers and modeling errors
-=====================================================
-
-Robust regression is interested in fitting a regression model in the
-presence of corrupt data: either outliers, or error in the model.
-
-.. figure:: ../auto_examples/linear_model/images/plot_theilsen_001.png
-   :target: ../auto_examples/linear_model/plot_theilsen.html
-   :scale: 50%
-   :align: center
-
-Different scenario and useful concepts
-----------------------------------------
-
-There are different things to keep in mind when dealing with data
-corrupted by outliers:
-
-.. |y_outliers| image:: ../auto_examples/linear_model/images/plot_robust_fit_003.png
-   :target: ../auto_examples/linear_model/plot_robust_fit.html
-   :scale: 60%
-
-.. |X_outliers| image:: ../auto_examples/linear_model/images/plot_robust_fit_002.png
-   :target: ../auto_examples/linear_model/plot_robust_fit.html
-   :scale: 60%
-
-.. |large_y_outliers| image:: ../auto_examples/linear_model/images/plot_robust_fit_005.png
-   :target: ../auto_examples/linear_model/plot_robust_fit.html
-   :scale: 60%
-
-* **Outliers in X or in y**?
-
-  ==================================== ====================================
-  Outliers in the y direction          Outliers in the X direction
-  ==================================== ====================================
-  |y_outliers|                         |X_outliers|
-  ==================================== ====================================
-
-* **Fraction of outliers versus amplitude of error**
-
-  The number of outlying points matters, but also how much they are
-  outliers.
-
-  ==================================== ====================================
-  Small outliers                       Large outliers
-  ==================================== ====================================
-  |y_outliers|                         |large_y_outliers|
-  ==================================== ====================================
-
-An important notion of robust fitting is that of breakdown point: the
-fraction of data that can be outlying for the fit to start missing the
-inlying data.
-
-Note that in general, robust fitting in high-dimensional setting (large
-`n_features`) is very hard. The robust models here will probably not work
-in these settings.
-
-
-.. topic:: **Trade-offs: which estimator?**
-
-   Scikit-learn provides 2 robust regression estimators:
-   :ref:`RANSAC <ransac_regression>` and
-   :ref:`Theil Sen <theil_sen_regression>`
-
-   * :ref:`RANSAC <ransac_regression>` is faster, and scales much better
-     with the number of samples
-
-   * :ref:`RANSAC <ransac_regression>` will deal better with large
-     outliers in the y direction (most common situation)
-
-  * :ref:`Theil Sen <theil_sen_regression>` will cope better with
-    medium-size outliers in the X direction, but this property will
-    disappear in large dimensional settings.
-
- When in doubt, use :ref:`RANSAC <ransac_regression>`
-
-.. _ransac_regression:
-
-RANSAC: RANdom SAmple Consensus
---------------------------------
-
-RANSAC (RANdom SAmple Consensus) fits a model from random subsets of
-inliers from the complete data set.
-
-RANSAC is a non-deterministic algorithm producing only a reasonable result with
-a certain probability, which is dependent on the number of iterations (see
-`max_trials` parameter). It is typically used for linear and non-linear
-regression problems and is especially popular in the fields of photogrammetric
-computer vision.
-
-The algorithm splits the complete input sample data into a set of inliers,
-which may be subject to noise, and outliers, which are e.g. caused by erroneous
-measurements or invalid hypotheses about the data. The resulting model is then
-estimated only from the determined inliers.
-
-.. figure:: ../auto_examples/linear_model/images/plot_ransac_001.png
-   :target: ../auto_examples/linear_model/plot_ransac.html
-   :align: center
-   :scale: 50%
-
-Details of the algorithm
-^^^^^^^^^^^^^^^^^^^^^^^^
-
-Each iteration performs the following steps:
-
-1. Select ``min_samples`` random samples from the original data and check
-   whether the set of data is valid (see ``is_data_valid``).
-2. Fit a model to the random subset (``base_estimator.fit``) and check
-   whether the estimated model is valid (see ``is_model_valid``).
-3. Classify all data as inliers or outliers by calculating the residuals
-   to the estimated model (``base_estimator.predict(X) - y``) - all data
-   samples with absolute residuals smaller than the ``residual_threshold``
-   are considered as inliers.
-4. Save fitted model as best model if number of inlier samples is
-   maximal. In case the current estimated model has the same number of
-   inliers, it is only considered as the best model if it has better score.
-
-These steps are performed either a maximum number of times (``max_trials``) or
-until one of the special stop criteria are met (see ``stop_n_inliers`` and
-``stop_score``). The final model is estimated using all inlier samples (consensus
-set) of the previously determined best model.
-
-The ``is_data_valid`` and ``is_model_valid`` functions allow to identify and reject
-degenerate combinations of random sub-samples. If the estimated model is not
-needed for identifying degenerate cases, ``is_data_valid`` should be used as it
-is called prior to fitting the model and thus leading to better computational
-performance.
-
-
-.. topic:: Examples:
-
-  * :ref:`example_linear_model_plot_ransac.py`
-  * :ref:`example_linear_model_plot_robust_fit.py`
-
-.. topic:: References:
-
- * https://en.wikipedia.org/wiki/RANSAC
- * `"Random Sample Consensus: A Paradigm for Model Fitting with Applications to
-   Image Analysis and Automated Cartography"
-   <http://www.cs.columbia.edu/~belhumeur/courses/compPhoto/ransac.pdf>`_
-   Martin A. Fischler and Robert C. Bolles - SRI International (1981)
- * `"Performance Evaluation of RANSAC Family"
-   <http://www.bmva.org/bmvc/2009/Papers/Paper355/Paper355.pdf>`_
-   Sunglok Choi, Taemin Kim and Wonpil Yu - BMVC (2009)
-
-.. _theil_sen_regression:
-
-Theil-Sen estimator: generalized-median-based estimator
---------------------------------------------------------
-
-The :class:`TheilSenRegressor` estimator uses a generalization of the median in
-multiple dimensions. It is thus robust to multivariate outliers. Note however
-that the robustness of the estimator decreases quickly with the dimensionality
-of the problem. It looses its robustness properties and becomes no
-better than an ordinary least squares in high dimension.
-
-.. topic:: Examples:
-
-  * :ref:`example_linear_model_plot_theilsen.py`
-  * :ref:`example_linear_model_plot_robust_fit.py`
-
-.. topic:: References:
-
- * https://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator
-
-Theoretical considerations
-^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-:class:`TheilSenRegressor` is comparable to the :ref:`Ordinary Least Squares
-(OLS) <ordinary_least_squares>` in terms of asymptotic efficiency and as an
-unbiased estimator. In contrast to OLS, Theil-Sen is a non-parametric
-method which means it makes no assumption about the underlying
-distribution of the data. Since Theil-Sen is a median-based estimator, it
-is more robust against corrupted data aka outliers. In univariate
-setting, Theil-Sen has a breakdown point of about 29.3% in case of a
-simple linear regression which means that it can tolerate arbitrary
-corrupted data of up to 29.3%.
-
-.. figure:: ../auto_examples/linear_model/images/plot_theilsen_001.png
-   :target: ../auto_examples/linear_model/plot_theilsen.html
-   :align: center
-   :scale: 50%
-
-The implementation of :class:`TheilSenRegressor` in scikit-learn follows a
-generalization to a multivariate linear regression model [#f1]_ using the
-spatial median which is a generalization of the median to multiple
-dimensions [#f2]_.
-
-In terms of time and space complexity, Theil-Sen scales according to
-
-.. math::
-    \binom{n_{samples}}{n_{subsamples}}
-
-which makes it infeasible to be applied exhaustively to problems with a
-large number of samples and features. Therefore, the magnitude of a
-subpopulation can be chosen to limit the time and space complexity by
-considering only a random subset of all possible combinations.
-
-.. topic:: Examples:
-
-  * :ref:`example_linear_model_plot_theilsen.py`
-
-.. topic:: References:
-
-    .. [#f1] Xin Dang, Hanxiang Peng, Xueqin Wang and Heping Zhang: `Theil-Sen Estimators in a Multiple Linear Regression Model. <http://home.olemiss.edu/~xdang/papers/MTSE.pdf>`_
-
-    .. [#f2] T. Kärkkäinen and S. Äyrämö: `On Computation of Spatial Median for Robust Data Mining. <http://users.jyu.fi/~samiayr/pdf/ayramo_eurogen05.pdf>`_
-
-.. _polynomial_regression:
-
-Polynomial regression: extending linear models with basis functions
-===================================================================
-
-.. currentmodule:: sklearn.preprocessing
-
-One common pattern within machine learning is to use linear models trained
-on nonlinear functions of the data.  This approach maintains the generally
-fast performance of linear methods, while allowing them to fit a much wider
-range of data.
-
-For example, a simple linear regression can be extended by constructing
-**polynomial features** from the coefficients.  In the standard linear
-regression case, you might have a model that looks like this for
-two-dimensional data:
-
-.. math::    \hat{y}(w, x) = w_0 + w_1 x_1 + w_2 x_2
-
-If we want to fit a paraboloid to the data instead of a plane, we can combine
-the features in second-order polynomials, so that the model looks like this:
-
-.. math::    \hat{y}(w, x) = w_0 + w_1 x_1 + w_2 x_2 + w_3 x_1 x_2 + w_4 x_1^2 + w_5 x_2^2
-
-The (sometimes surprising) observation is that this is *still a linear model*:
-to see this, imagine creating a new variable
-
-.. math::  z = [x_1, x_2, x_1 x_2, x_1^2, x_2^2]
-
-With this re-labeling of the data, our problem can be written
-
-.. math::    \hat{y}(w, x) = w_0 + w_1 z_1 + w_2 z_2 + w_3 z_3 + w_4 z_4 + w_5 z_5
-
-We see that the resulting *polynomial regression* is in the same class of
-linear models we'd considered above (i.e. the model is linear in :math:`w`)
-and can be solved by the same techniques.  By considering linear fits within
-a higher-dimensional space built with these basis functions, the model has the
-flexibility to fit a much broader range of data.
-
-Here is an example of applying this idea to one-dimensional data, using
-polynomial features of varying degrees:
-
-.. figure:: ../auto_examples/linear_model/images/plot_polynomial_interpolation_001.png
-   :target: ../auto_examples/linear_model/plot_polynomial_interpolation.html
-   :align: center
-   :scale: 50%
-
-This figure is created using the :class:`PolynomialFeatures` preprocessor.
-This preprocessor transforms an input data matrix into a new data matrix
-of a given degree.  It can be used as follows::
-
-    >>> from sklearn.preprocessing import PolynomialFeatures
-    >>> import numpy as np
-    >>> X = np.arange(6).reshape(3, 2)
-    >>> X
-    array([[0, 1],
-           [2, 3],
-           [4, 5]])
-    >>> poly = PolynomialFeatures(degree=2)
-    >>> poly.fit_transform(X)
-    array([[  1.,   0.,   1.,   0.,   0.,   1.],
-           [  1.,   2.,   3.,   4.,   6.,   9.],
-           [  1.,   4.,   5.,  16.,  20.,  25.]])
-
-The features of ``X`` have been transformed from :math:`[x_1, x_2]` to
-:math:`[1, x_1, x_2, x_1^2, x_1 x_2, x_2^2]`, and can now be used within
-any linear model.
-
-This sort of preprocessing can be streamlined with the
-:ref:`Pipeline <pipeline>` tools. A single object representing a simple
-polynomial regression can be created and used as follows::
-
-    >>> from sklearn.preprocessing import PolynomialFeatures
-    >>> from sklearn.linear_model import LinearRegression
-    >>> from sklearn.pipeline import Pipeline
-    >>> import numpy as np
-    >>> model = Pipeline([('poly', PolynomialFeatures(degree=3)),
-    ...                   ('linear', LinearRegression(fit_intercept=False))])
-    >>> # fit to an order-3 polynomial data
-    >>> x = np.arange(5)
-    >>> y = 3 - 2 * x + x ** 2 - x ** 3
-    >>> model = model.fit(x[:, np.newaxis], y)
-    >>> model.named_steps['linear'].coef_
-    array([ 3., -2.,  1., -1.])
-
-The linear model trained on polynomial features is able to exactly recover
-the input polynomial coefficients.
-
-In some cases it's not necessary to include higher powers of any single feature,
-but only the so-called *interaction features*
-that multiply together at most :math:`d` distinct features.
-These can be gotten from :class:`PolynomialFeatures` with the setting
-``interaction_only=True``.
-
-For example, when dealing with boolean features,
-:math:`x_i^n = x_i` for all :math:`n` and is therefore useless;
-but :math:`x_i x_j` represents the conjunction of two booleans.
-This way, we can solve the XOR problem with a linear classifier::
-
-    >>> from sklearn.linear_model import Perceptron
-    >>> from sklearn.preprocessing import PolynomialFeatures
-    >>> import numpy as np
-    >>> X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
-    >>> y = X[:, 0] ^ X[:, 1]
-    >>> X = PolynomialFeatures(interaction_only=True).fit_transform(X)
-    >>> X
-    array([[ 1.,  0.,  0.,  0.],
-           [ 1.,  0.,  1.,  0.],
-           [ 1.,  1.,  0.,  0.],
-           [ 1.,  1.,  1.,  1.]])
-    >>> clf = Perceptron(fit_intercept=False, n_iter=10, shuffle=False).fit(X, y)
-    >>> clf.score(X, y)
-    1.0
-
-
-
diff --git a/doc/modules/neighbors.rst.orig b/doc/modules/neighbors.rst.orig
deleted file mode 100644
index ac259272fb076..0000000000000
--- a/doc/modules/neighbors.rst.orig
+++ /dev/null
@@ -1,694 +0,0 @@
-.. _neighbors:
-
-=================
-Nearest Neighbors
-=================
-
-.. sectionauthor:: Jake Vanderplas <vanderplas@astro.washington.edu>
-
-.. currentmodule:: sklearn.neighbors
-
-:mod:`sklearn.neighbors` provides functionality for unsupervised and
-supervised neighbors-based learning methods.  Unsupervised nearest neighbors
-is the foundation of many other learning methods,
-notably manifold learning and spectral clustering.  Supervised neighbors-based
-learning comes in two flavors: `classification`_ for data with
-discrete labels, and `regression`_ for data with continuous labels.
-
-The principle behind nearest neighbor methods is to find a predefined number
-of training samples closest in distance to the new point, and
-predict the label from these.  The number of samples can be a user-defined
-constant (k-nearest neighbor learning), or vary based
-on the local density of points (radius-based neighbor learning).
-The distance can, in general, be any metric measure: standard Euclidean
-distance is the most common choice.
-Neighbors-based methods are known as *non-generalizing* machine
-learning methods, since they simply "remember" all of its training data
-(possibly transformed into a fast indexing structure such as a
-:ref:`Ball Tree <ball_tree>` or :ref:`KD Tree <kd_tree>`.).
-
-Despite its simplicity, nearest neighbors has been successful in a
-large number of classification and regression problems, including
-handwritten digits or satellite image scenes. Being a non-parametric method,
-it is often successful in classification situations where the decision
-boundary is very irregular.
-
-The classes in :mod:`sklearn.neighbors` can handle either Numpy arrays or
-`scipy.sparse` matrices as input.  For dense matrices, a large number of
-possible distance metrics are supported.  For sparse matrices, arbitrary
-Minkowski metrics are supported for searches.
-
-There are many learning routines which rely on nearest neighbors at their
-core.  One example is :ref:`kernel density estimation <kernel_density>`,
-discussed in the :ref:`density estimation <density_estimation>` section.
-
-
-.. _unsupervised_neighbors:
-
-Unsupervised Nearest Neighbors
-==============================
-
-:class:`NearestNeighbors` implements unsupervised nearest neighbors learning.
-It acts as a uniform interface to three different nearest neighbors
-algorithms: :class:`BallTree`, :class:`KDTree`, and a
-brute-force algorithm based on routines in :mod:`sklearn.metrics.pairwise`.
-The choice of neighbors search algorithm is controlled through the keyword
-``'algorithm'``, which must be one of
-``['auto', 'ball_tree', 'kd_tree', 'brute']``.  When the default value
-``'auto'`` is passed, the algorithm attempts to determine the best approach
-from the training data.  For a discussion of the strengths and weaknesses
-of each option, see `Nearest Neighbor Algorithms`_.
-
-    .. warning::
-
-        Regarding the Nearest Neighbors algorithms, if two
-        neighbors, neighbor :math:`k+1` and :math:`k`, have identical distances
-        but different labels, the results will depend on the ordering of the
-        training data.
-
-Finding the Nearest Neighbors
------------------------------
-For the simple task of finding the nearest neighbors between two sets of
-data, the unsupervised algorithms within :mod:`sklearn.neighbors` can be
-used:
-
-    >>> from sklearn.neighbors import NearestNeighbors
-    >>> import numpy as np
-    >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
-    >>> nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
-    >>> distances, indices = nbrs.kneighbors(X)
-    >>> indices                                           # doctest: +ELLIPSIS
-    array([[0, 1],
-           [1, 0],
-           [2, 1],
-           [3, 4],
-           [4, 3],
-           [5, 4]]...)
-    >>> distances
-    array([[ 0.        ,  1.        ],
-           [ 0.        ,  1.        ],
-           [ 0.        ,  1.41421356],
-           [ 0.        ,  1.        ],
-           [ 0.        ,  1.        ],
-           [ 0.        ,  1.41421356]])
-
-Because the query set matches the training set, the nearest neighbor of each
-point is the point itself, at a distance of zero.
-
-It is also possible to efficiently produce a sparse graph showing the
-connections between neighboring points:
-
-    >>> nbrs.kneighbors_graph(X).toarray()
-    array([[ 1.,  1.,  0.,  0.,  0.,  0.],
-           [ 1.,  1.,  0.,  0.,  0.,  0.],
-           [ 0.,  1.,  1.,  0.,  0.,  0.],
-           [ 0.,  0.,  0.,  1.,  1.,  0.],
-           [ 0.,  0.,  0.,  1.,  1.,  0.],
-           [ 0.,  0.,  0.,  0.,  1.,  1.]])
-
-Our dataset is structured such that points nearby in index order are nearby
-in parameter space, leading to an approximately block-diagonal matrix of
-K-nearest neighbors.  Such a sparse graph is useful in a variety of
-circumstances which make use of spatial relationships between points for
-unsupervised learning: in particular, see :class:`sklearn.manifold.Isomap`,
-:class:`sklearn.manifold.LocallyLinearEmbedding`, and
-:class:`sklearn.cluster.SpectralClustering`.
-
-KDTree and BallTree Classes
----------------------------
-Alternatively, one can use the :class:`KDTree` or :class:`BallTree` classes
-directly to find nearest neighbors.  This is the functionality wrapped by
-the :class:`NearestNeighbors` class used above.  The Ball Tree and KD Tree
-have the same interface; we'll show an example of using the KD Tree here:
-
-    >>> from sklearn.neighbors import KDTree
-    >>> import numpy as np
-    >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
-    >>> kdt = KDTree(X, leaf_size=30, metric='euclidean')
-    >>> kdt.query(X, k=2, return_distance=False)          # doctest: +ELLIPSIS
-    array([[0, 1],
-           [1, 0],
-           [2, 1],
-           [3, 4],
-           [4, 3],
-           [5, 4]]...)
-
-Refer to the :class:`KDTree` and :class:`BallTree` class documentation
-for more information on the options available for neighbors searches,
-including specification of query strategies, of various distance metrics, etc.
-For a list of available metrics, see the documentation of the
-:class:`DistanceMetric` class.
-
-.. _classification:
-
-Nearest Neighbors Classification
-================================
-
-Neighbors-based classification is a type of *instance-based learning* or
-*non-generalizing learning*: it does not attempt to construct a general
-internal model, but simply stores instances of the training data.
-Classification is computed from a simple majority vote of the nearest
-neighbors of each point: a query point is assigned the data class which
-has the most representatives within the nearest neighbors of the point.
-
-scikit-learn implements two different nearest neighbors classifiers:
-:class:`KNeighborsClassifier` implements learning based on the :math:`k`
-nearest neighbors of each query point, where :math:`k` is an integer value
-specified by the user.  :class:`RadiusNeighborsClassifier` implements learning
-based on the number of neighbors within a fixed radius :math:`r` of each
-training point, where :math:`r` is a floating-point value specified by
-the user.
-
-The :math:`k`-neighbors classification in :class:`KNeighborsClassifier`
-is the more commonly used of the two techniques.  The
-optimal choice of the value :math:`k` is highly data-dependent: in general
-a larger :math:`k` suppresses the effects of noise, but makes the
-classification boundaries less distinct.
-
-In cases where the data is not uniformly sampled, radius-based neighbors
-classification in :class:`RadiusNeighborsClassifier` can be a better choice.
-The user specifies a fixed radius :math:`r`, such that points in sparser
-neighborhoods use fewer nearest neighbors for the classification.  For
-high-dimensional parameter spaces, this method becomes less effective due
-to the so-called "curse of dimensionality".
-
-The basic nearest neighbors classification uses uniform weights: that is, the
-value assigned to a query point is computed from a simple majority vote of
-the nearest neighbors.  Under some circumstances, it is better to weight the
-neighbors such that nearer neighbors contribute more to the fit.  This can
-be accomplished through the ``weights`` keyword.  The default value,
-``weights = 'uniform'``, assigns uniform weights to each neighbor.
-``weights = 'distance'`` assigns weights proportional to the inverse of the
-distance from the query point.  Alternatively, a user-defined function of the
-distance can be supplied which is used to compute the weights.
-
-
-
-.. |classification_1| image:: ../auto_examples/neighbors/images/plot_classification_001.png
-   :target: ../auto_examples/neighbors/plot_classification.html
-   :scale: 50
-
-.. |classification_2| image:: ../auto_examples/neighbors/images/plot_classification_002.png
-   :target: ../auto_examples/neighbors/plot_classification.html
-   :scale: 50
-
-.. centered:: |classification_1| |classification_2|
-
-.. topic:: Examples:
-
-  * :ref:`example_neighbors_plot_classification.py`: an example of
-    classification using nearest neighbors.
-
-.. _regression:
-
-Nearest Neighbors Regression
-============================
-
-Neighbors-based regression can be used in cases where the data labels are
-continuous rather than discrete variables.  The label assigned to a query
-point is computed based the mean of the labels of its nearest neighbors.
-
-scikit-learn implements two different neighbors regressors:
-:class:`KNeighborsRegressor` implements learning based on the :math:`k`
-nearest neighbors of each query point, where :math:`k` is an integer
-value specified by the user.  :class:`RadiusNeighborsRegressor` implements
-learning based on the neighbors within a fixed radius :math:`r` of the
-query point, where :math:`r` is a floating-point value specified by the
-user.
-
-The basic nearest neighbors regression uses uniform weights: that is,
-each point in the local neighborhood contributes uniformly to the
-classification of a query point.  Under some circumstances, it can be
-advantageous to weight points such that nearby points contribute more
-to the regression than faraway points.  This can be accomplished through
-the ``weights`` keyword.  The default value, ``weights = 'uniform'``,
-assigns equal weights to all points.  ``weights = 'distance'`` assigns
-weights proportional to the inverse of the distance from the query point.
-Alternatively, a user-defined function of the distance can be supplied,
-which will be used to compute the weights.
-
-.. figure:: ../auto_examples/neighbors/images/plot_regression_001.png
-   :target: ../auto_examples/neighbors/plot_regression.html
-   :align: center
-   :scale: 75
-
-The use of multi-output nearest neighbors for regression is demonstrated in
-:ref:`example_plot_multioutput_face_completion.py`. In this example, the inputs
-X are the pixels of the upper half of faces and the outputs Y are the pixels of
-the lower half of those faces.
-
-.. figure:: ../auto_examples/images/plot_multioutput_face_completion_001.png
-   :target: ../auto_examples/plot_multioutput_face_completion.html
-   :scale: 75
-   :align: center
-
-
-.. topic:: Examples:
-
-  * :ref:`example_neighbors_plot_regression.py`: an example of regression
-    using nearest neighbors.
-
-  * :ref:`example_plot_multioutput_face_completion.py`: an example of
-    multi-output regression using nearest neighbors.
-
-
-Nearest Neighbor Algorithms
-===========================
-
-.. _brute_force:
-
-Brute Force
------------
-
-Fast computation of nearest neighbors is an active area of research in
-machine learning.  The most naive neighbor search implementation involves
-the brute-force computation of distances between all pairs of points in the
-dataset: for :math:`N` samples in :math:`D` dimensions, this approach scales
-as :math:`O[D N^2]`.  Efficient brute-force neighbors searches can be very
-competitive for small data samples.
-However, as the number of samples :math:`N` grows, the brute-force
-approach quickly becomes infeasible.  In the classes within
-:mod:`sklearn.neighbors`, brute-force neighbors searches are specified
-using the keyword ``algorithm = 'brute'``, and are computed using the
-routines available in :mod:`sklearn.metrics.pairwise`.
-
-.. _kd_tree:
-
-K-D Tree
---------
-
-To address the computational inefficiencies of the brute-force approach, a
-variety of tree-based data structures have been invented.  In general, these
-structures attempt to reduce the required number of distance calculations
-by efficiently encoding aggregate distance information for the sample.
-The basic idea is that if point :math:`A` is very distant from point
-:math:`B`, and point :math:`B` is very close to point :math:`C`,
-then we know that points :math:`A` and :math:`C`
-are very distant, *without having to explicitly calculate their distance*.
-In this way, the computational cost of a nearest neighbors search can be
-reduced to :math:`O[D N \log(N)]` or better.  This is a significant
-improvement over brute-force for large :math:`N`.
-
-An early approach to taking advantage of this aggregate information was
-the *KD tree* data structure (short for *K-dimensional tree*), which
-generalizes two-dimensional *Quad-trees* and 3-dimensional *Oct-trees*
-to an arbitrary number of dimensions.  The KD tree is a binary tree
-structure which recursively partitions the parameter space along the data
-axes, dividing it into nested orthotopic regions into which data points
-are filed.  The construction of a KD tree is very fast: because partitioning
-is performed only along the data axes, no :math:`D`-dimensional distances
-need to be computed.  Once constructed, the nearest neighbor of a query
-point can be determined with only :math:`O[\log(N)]` distance computations.
-Though the KD tree approach is very fast for low-dimensional (:math:`D < 20`)
-neighbors searches, it becomes inefficient as :math:`D` grows very large:
-this is one manifestation of the so-called "curse of dimensionality".
-In scikit-learn, KD tree neighbors searches are specified using the
-keyword ``algorithm = 'kd_tree'``, and are computed using the class
-:class:`KDTree`.
-
-
-.. topic:: References:
-
-   * `"Multidimensional binary search trees used for associative searching"
-     <http://dl.acm.org/citation.cfm?doid=361002.361007>`_,
-     Bentley, J.L., Communications of the ACM (1975)
-
-
-.. _ball_tree:
-
-Ball Tree
----------
-
-To address the inefficiencies of KD Trees in higher dimensions, the *ball tree*
-data structure was developed.  Where KD trees partition data along
-Cartesian axes, ball trees partition data in a series of nesting
-hyper-spheres.  This makes tree construction more costly than that of the
-KD tree, but
-results in a data structure which can be very efficient on highly-structured
-data, even in very high dimensions.
-
-A ball tree recursively divides the data into
-nodes defined by a centroid :math:`C` and radius :math:`r`, such that each
-point in the node lies within the hyper-sphere defined by :math:`r` and
-:math:`C`. The number of candidate points for a neighbor search
-is reduced through use of the *triangle inequality*:
-
-.. math::   |x+y| \leq |x| + |y|
-
-With this setup, a single distance calculation between a test point and
-the centroid is sufficient to determine a lower and upper bound on the
-distance to all points within the node.
-Because of the spherical geometry of the ball tree nodes, it can out-perform
-a *KD-tree* in high dimensions, though the actual performance is highly
-dependent on the structure of the training data.
-In scikit-learn, ball-tree-based
-neighbors searches are specified using the keyword ``algorithm = 'ball_tree'``,
-and are computed using the class :class:`sklearn.neighbors.BallTree`.
-Alternatively, the user can work with the :class:`BallTree` class directly.
-
-.. topic:: References:
-
-   * `"Five balltree construction algorithms"
-     <http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.8209>`_,
-     Omohundro, S.M., International Computer Science Institute
-     Technical Report (1989)
-
-Choice of Nearest Neighbors Algorithm
--------------------------------------
-The optimal algorithm for a given dataset is a complicated choice, and
-depends on a number of factors:
-
-* number of samples :math:`N` (i.e. ``n_samples``) and dimensionality
-  :math:`D` (i.e. ``n_features``).
-
-  * *Brute force* query time grows as :math:`O[D N]`
-  * *Ball tree* query time grows as approximately :math:`O[D \log(N)]`
-  * *KD tree* query time changes with :math:`D` in a way that is difficult
-    to precisely characterise.  For small :math:`D` (less than 20 or so)
-    the cost is approximately :math:`O[D\log(N)]`, and the KD tree
-    query can be very efficient.
-    For larger :math:`D`, the cost increases to nearly :math:`O[DN]`, and
-    the overhead due to the tree
-    structure can lead to queries which are slower than brute force.
-
-  For small data sets (:math:`N` less than 30 or so), :math:`\log(N)` is
-  comparable to :math:`N`, and brute force algorithms can be more efficient
-  than a tree-based approach.  Both :class:`KDTree` and :class:`BallTree`
-  address this through providing a *leaf size* parameter: this controls the
-  number of samples at which a query switches to brute-force.  This allows both
-  algorithms to approach the efficiency of a brute-force computation for small
-  :math:`N`.
-
-* data structure: *intrinsic dimensionality* of the data and/or *sparsity*
-  of the data. Intrinsic dimensionality refers to the dimension
-  :math:`d \le D` of a manifold on which the data lies, which can be linearly
-  or non-linearly embedded in the parameter space. Sparsity refers to the
-  degree to which the data fills the parameter space (this is to be
-  distinguished from the concept as used in "sparse" matrices.  The data
-  matrix may have no zero entries, but the **structure** can still be
-  "sparse" in this sense).
-
-  * *Brute force* query time is unchanged by data structure.
-  * *Ball tree* and *KD tree* query times can be greatly influenced
-    by data structure.  In general, sparser data with a smaller intrinsic
-    dimensionality leads to faster query times.  Because the KD tree
-    internal representation is aligned with the parameter axes, it will not
-    generally show as much improvement as ball tree for arbitrarily
-    structured data.
-
-  Datasets used in machine learning tend to be very structured, and are
-  very well-suited for tree-based queries.
-
-* number of neighbors :math:`k` requested for a query point.
-
-  * *Brute force* query time is largely unaffected by the value of :math:`k`
-  * *Ball tree* and *KD tree* query time will become slower as :math:`k`
-    increases.  This is due to two effects: first, a larger :math:`k` leads
-    to the necessity to search a larger portion of the parameter space.
-    Second, using :math:`k > 1` requires internal queueing of results
-    as the tree is traversed.
-
-  As :math:`k` becomes large compared to :math:`N`, the ability to prune
-  branches in a tree-based query is reduced.  In this situation, Brute force
-  queries can be more efficient.
-
-* number of query points.  Both the ball tree and the KD Tree
-  require a construction phase.  The cost of this construction becomes
-  negligible when amortized over many queries.  If only a small number of
-  queries will be performed, however, the construction can make up
-  a significant fraction of the total cost.  If very few query points
-  will be required, brute force is better than a tree-based method.
-
-Currently, ``algorithm = 'auto'`` selects ``'kd_tree'`` if :math:`k < N/2` 
-and the ``'effective_metric_'`` is in the ``'VALID_METRICS'`` list of 
-``'kd_tree'``. It selects ``'ball_tree'`` if :math:`k < N/2` and the 
-``'effective_metric_'`` is not in the ``'VALID_METRICS'`` list of 
-``'kd_tree'``. It selects ``'brute'`` if :math:`k >= N/2`. This choice is based on the assumption that the number of query points is at least the 
-same order as the number of training points, and that ``leaf_size`` is 
-close to its default value of ``30``.
-
-Effect of ``leaf_size``
------------------------
-As noted above, for small sample sizes a brute force search can be more
-efficient than a tree-based query.  This fact is accounted for in the ball
-tree and KD tree by internally switching to brute force searches within
-leaf nodes.  The level of this switch can be specified with the parameter
-``leaf_size``.  This parameter choice has many effects:
-
-**construction time**
-  A larger ``leaf_size`` leads to a faster tree construction time, because
-  fewer nodes need to be created
-
-**query time**
-  Both a large or small ``leaf_size`` can lead to suboptimal query cost.
-  For ``leaf_size`` approaching 1, the overhead involved in traversing
-  nodes can significantly slow query times.  For ``leaf_size`` approaching
-  the size of the training set, queries become essentially brute force.
-  A good compromise between these is ``leaf_size = 30``, the default value
-  of the parameter.
-
-**memory**
-  As ``leaf_size`` increases, the memory required to store a tree structure
-  decreases.  This is especially important in the case of ball tree, which
-  stores a :math:`D`-dimensional centroid for each node.  The required
-  storage space for :class:`BallTree` is approximately ``1 / leaf_size`` times
-  the size of the training set.
-
-``leaf_size`` is not referenced for brute force queries.
-
-.. _nearest_centroid_classifier:
-
-Nearest Centroid Classifier
-===========================
-
-The :class:`NearestCentroid` classifier is a simple algorithm that represents
-each class by the centroid of its members. In effect, this makes it
-similar to the label updating phase of the :class:`sklearn.KMeans` algorithm.
-It also has no parameters to choose, making it a good baseline classifier. It
-does, however, suffer on non-convex classes, as well as when classes have
-drastically different variances, as equal variance in all dimensions is
-assumed. See Linear Discriminant Analysis (:class:`sklearn.discriminant_analysis.LinearDiscriminantAnalysis`)
-and Quadratic Discriminant Analysis (:class:`sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis`)
-for more complex methods that do not make this assumption. Usage of the default
-:class:`NearestCentroid` is simple:
-
-    >>> from sklearn.neighbors.nearest_centroid import NearestCentroid
-    >>> import numpy as np
-    >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
-    >>> y = np.array([1, 1, 1, 2, 2, 2])
-    >>> clf = NearestCentroid()
-    >>> clf.fit(X, y)
-    NearestCentroid(metric='euclidean', shrink_threshold=None)
-    >>> print(clf.predict([[-0.8, -1]]))
-    [1]
-
-
-Nearest Shrunken Centroid
--------------------------
-
-The :class:`NearestCentroid` classifier has a ``shrink_threshold`` parameter,
-which implements the nearest shrunken centroid classifier. In effect, the value
-of each feature for each centroid is divided by the within-class variance of
-that feature. The feature values are then reduced by ``shrink_threshold``. Most
-notably, if a particular feature value crosses zero, it is set
-to zero. In effect, this removes the feature from affecting the classification.
-This is useful, for example, for removing noisy features.
-
-In the example below, using a small shrink threshold increases the accuracy of
-the model from 0.81 to 0.82.
-
-.. |nearest_centroid_1| image:: ../auto_examples/neighbors/images/plot_nearest_centroid_001.png
-   :target: ../auto_examples/neighbors/plot_nearest_centroid.html
-   :scale: 50
-
-.. |nearest_centroid_2| image:: ../auto_examples/neighbors/images/plot_nearest_centroid_002.png
-   :target: ../auto_examples/neighbors/plot_nearest_centroid.html
-   :scale: 50
-
-.. centered:: |nearest_centroid_1| |nearest_centroid_2|
-
-.. topic:: Examples:
-
-  * :ref:`example_neighbors_plot_nearest_centroid.py`: an example of
-    classification using nearest centroid with different shrink thresholds.
-
-.. _approximate_nearest_neighbors:
-
-Approximate Nearest Neighbors
-=============================
-
-There are many efficient exact nearest neighbor search algorithms for low
-dimensions :math:`d` (approximately 50). However these algorithms perform poorly
-with respect to space and query time when :math:`d` increases. These algorithms
-are not any better than comparing query point to each point from the database in
-a high dimension (see :ref:`brute_force`). This is a well-known consequence of
-the phenomenon called “The Curse of Dimensionality”.
-
-There are certain applications where we do not need the exact nearest neighbors
-but having a “good guess” would suffice.  When answers do not have to be exact,
-the :class:`LSHForest` class implements an approximate nearest neighbor search.
-Approximate nearest neighbor search methods have been designed to try to speedup
-query time with high dimensional data. These techniques are useful when the aim
-is to characterize the neighborhood rather than identifying the exact neighbors
-themselves (eg: k-nearest neighbors classification and regression). Some of the
-most popular approximate nearest neighbor search techniques are locality
-sensitive hashing, best bin fit and balanced box-decomposition tree based
-search.
-
-Locality Sensitive Hashing Forest
----------------------------------
-
-The vanilla implementation of locality sensitive hashing has a hyper-parameter
-that is hard to tune in practice, therefore scikit-learn implements a variant
-called :class:`LSHForest` that has more reasonable hyperparameters.
-Both methods use internally random hyperplanes to index the samples into
-buckets and actual cosine similarities are only computed for samples that
-collide with the query hence achieving sublinear scaling.
-(see :ref:`Mathematical description of Locality Sensitive
-Hashing <mathematical_description_of_lsh>`).
-
-:class:`LSHForest` has two main hyper-parameters: ``n_estimators`` and
-``n_candidates``. The accuracy of queries can be controlled using these
-parameters as demonstrated in the following plots:
-
-.. figure:: ../auto_examples/neighbors/images/plot_approximate_nearest_neighbors_hyperparameters_001.png
-   :target: ../auto_examples/neighbors/plot_approximate_nearest_neighbors_hyperparameters.html
-   :align: center
-   :scale: 50
-
-.. figure:: ../auto_examples/neighbors/images/plot_approximate_nearest_neighbors_hyperparameters_002.png
-   :target: ../auto_examples/neighbors/plot_approximate_nearest_neighbors_hyperparameters.html
-   :align: center
-   :scale: 50
-
-As a rule of thumb, a user can set ``n_estimators`` to a large enough value
-(e.g. between 10 and 50) and then adjust ``n_candidates`` to trade off accuracy
-for query time.
-
-For small data sets, the brute force method for exact nearest neighbor search
-can be faster than LSH Forest. However LSH Forest has a sub-linear query time
-scalability with the index size. The exact break even point where LSH Forest
-queries become faster than brute force depends on the dimensionality, structure
-of the dataset, required level of precision, characteristics of the runtime
-environment such as availability of BLAS optimizations, number of CPU cores and
-size of the CPU caches. Following graphs depict scalability of LSHForest queries
-with index size.
-
-.. figure:: ../auto_examples/neighbors/images/plot_approximate_nearest_neighbors_scalability_001.png
-   :target: ../auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html
-   :align: center
-   :scale: 50
-
-.. figure:: ../auto_examples/neighbors/images/plot_approximate_nearest_neighbors_scalability_002.png
-   :target: ../auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html
-   :align: center
-   :scale: 50
-
-.. figure:: ../auto_examples/neighbors/images/plot_approximate_nearest_neighbors_scalability_003.png
-   :target: ../auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html
-   :align: center
-   :scale: 50
-
-For fixed :class:`LSHForest` parameters, the accuracy of queries tends to slowly
-decrease with larger datasets. The error bars on the previous plots represent
-standard deviation across different queries.
-
-.. topic:: Examples:
-
-  * :ref:`example_neighbors_plot_approximate_nearest_neighbors_hyperparameters.py`: an example of
-    the behavior of hyperparameters of approximate nearest neighbor search using LSH Forest.
-
-  * :ref:`example_neighbors_plot_approximate_nearest_neighbors_scalability.py`: an example of
-    scalability of approximate nearest neighbor search using LSH Forest.
-
-.. _mathematical_description_of_lsh:
-
-Mathematical description of Locality Sensitive Hashing
-------------------------------------------------------
-
-Locality sensitive hashing (LSH) techniques have been used in many areas where
-nearest neighbor search is performed in high dimensions. The main concept
-behind LSH is to hash each data point in the database using multiple
-(often simple) hash functions to form a digest (also called a *hash*). At this
-point the probability of collision - where two objects have similar digests
-- is much higher for the points which are close to each other than that of the
-distant points. We describe the requirements for a hash function family to be
-locality sensitive as follows.
-
-A family :math:`H` of functions from a domain :math:`S` to a range :math:`U`
-is called :math:`(r, e , p1 , p2 )`-sensitive, with :math:`r, e > 0`,
-:math:`p_1 > p_2 > 0`, if for any :math:`p, q \in S`, the following conditions
-hold (:math:`D` is the distance function):
-
-* If :math:`D(p,q) <= r` then :math:`P_H[h(p) = h(q)] >= p_1`,
-* If :math:`D(p,q) > r(1 + e)` then :math:`P_H[h(p) = h(q)] <= p_2`.
-
-As defined, nearby points within a distance of :math:`r` to each other are
-likely to collide with probability :math:`p_1`. In contrast, distant points
-which are located with the distance more than :math:`r(1 + e)` have a small
-probability of :math:`p_2` of collision. Suppose there is a family of LSH
-function :math:`H`. An *LSH index* is built as follows:
-
-1. Choose :math:`k` functions :math:`h_1, h_2, … h_k` uniformly at
-   random (with replacement) from :math:`H`. For any :math:`p \in S`, place
-   :math:`p` in the bucket with label
-   :math:`g(p) = (h_1(p), h_2(p), … h_k(p))`. Observe that if
-   each :math:`h_i` outputs one “digit”, each bucket has a k-digit label.
-
-2. Independently perform step 1 :math:`l` times to construct :math:`l`
-   separate estimators, with hash functions :math:`g_1, g_2, … g_l`.
-
-The reason to concatenate hash functions in the step 1 is to decrease the
-probability of the collision of distant points as much as possible. The
-probability drops from :math:`p_2` to :math:`p_2^k` which is negligibly
-small for large :math:`k`.  The choice of :math:`k` is strongly dependent on
-the data set size and structure and is therefore hard to tune in practice.
-There is a side effect of having a large :math:`k`; it has the potential of
-decreasing the chance of nearby points getting collided. To address this
-issue, multiple estimators are constructed in step 2.
-
-The requirement to tune :math:`k` for a given dataset makes classical LSH
-cumbersome to use in practice. The LSH Forest variant has benn designed to
-alleviate this requirement by automatically adjusting the number of digits
-used to hash the samples.
-
-LSH Forest is formulated with prefix trees with each leaf of
-a tree corresponding to an actual data point in the database. There are
-:math:`l` such trees which compose the forest and they are constructed using
-independently drawn random sequence of hash functions from :math:`H`. In this
-implementation, "Random Projections" is being used as the LSH technique which
-is an approximation for the cosine distance. The length of the sequence of
-hash functions is kept fixed at 32. Moreover, a prefix tree is implemented
-using sorted arrays and binary search.
-
-There are two phases of tree traversals used in order to answer a query to find
-the :math:`m` nearest neighbors of a point :math:`q`. First, a top-down
-traversal is performed using a binary search to identify the leaf having the
-longest prefix match (maximum depth) with :math:`q`'s label after subjecting
-:math:`q` to the same hash functions. :math:`M >> m` points (total candidates)
-are extracted from the forest, moving up from the previously found maximum 
-depth towards the root synchronously across all trees in the bottom-up
-traversal. `M` is set to  :math:`cl` where :math:`c`, the number of candidates
-extracted from each tree, is a constant. Finally, the similarity of each of
-these :math:`M` points against point :math:`q` is calculated and the top
-:math:`m` points are returned as the nearest neighbors of :math:`q`. Since
-most of the time in these queries is spent calculating the distances to
-candidates, the speedup compared to brute force search is approximately
-:math:`N/M`, where :math:`N` is the number of points in database.
-
-.. topic:: References:
-
-   * `"Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in
-     High Dimensions"
-     <http://web.mit.edu/andoni/www/papers/cSquared.pdf>`_,
-     Alexandr, A., Indyk, P., Foundations of Computer Science, 2006. FOCS
-     '06. 47th Annual IEEE Symposium
-
-   * `“LSH Forest: Self-Tuning Indexes for Similarity Search”
-<<<<<<< HEAD
-     <http://infolab.stanford.edu/~bawa/Pub/similarity.pdf>`_,
-=======
-     <http://wwwconference.org/proceedings/www2005/docs/p651.pdf>`_,
->>>>>>> origin/master
-     Bawa, M., Condie, T., Ganesan, P., WWW '05 Proceedings of the 14th
-     international conference on World Wide Web  Pages 651-660
diff --git a/doc/tutorial/statistical_inference/unsupervised_learning.rst.orig b/doc/tutorial/statistical_inference/unsupervised_learning.rst.orig
deleted file mode 100644
index 166efc553c135..0000000000000
--- a/doc/tutorial/statistical_inference/unsupervised_learning.rst.orig
+++ /dev/null
@@ -1,327 +0,0 @@
-============================================================
-Unsupervised learning: seeking representations of the data
-============================================================
-
-Clustering: grouping observations together
-============================================
-
-.. topic:: The problem solved in clustering
-
-    Given the iris dataset, if we knew that there were 3 types of iris, but
-    did not have access to a taxonomist to label them: we could try a
-    **clustering task**: split the observations into well-separated group
-    called *clusters*.
-
-..
-   >>> # Set the PRNG
-   >>> import numpy as np
-   >>> np.random.seed(1)
-
-K-means clustering
--------------------
-
-Note that there exist a lot of different clustering criteria and associated
-algorithms. The simplest clustering algorithm is
-:ref:`k_means`.
-
-.. image:: ../../auto_examples/cluster/images/plot_cluster_iris_002.png
-    :target: ../../auto_examples/cluster/plot_cluster_iris.html
-    :scale: 70
-    :align: right
-
-
-::
-
-    >>> from sklearn import cluster, datasets
-    >>> iris = datasets.load_iris()
-    >>> X_iris = iris.data
-    >>> y_iris = iris.target
-
-    >>> k_means = cluster.KMeans(n_clusters=3)
-    >>> k_means.fit(X_iris) # doctest: +ELLIPSIS
-    KMeans(copy_x=True, init='k-means++', ...
-    >>> print(k_means.labels_[::10])
-    [1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]
-    >>> print(y_iris[::10])
-    [0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]
-
-.. |k_means_iris_bad_init| image:: ../../auto_examples/cluster/images/plot_cluster_iris_003.png
-   :target: ../../auto_examples/cluster/plot_cluster_iris.html
-   :scale: 63
-
-.. |k_means_iris_8| image:: ../../auto_examples/cluster/images/plot_cluster_iris_001.png
-   :target: ../../auto_examples/cluster/plot_cluster_iris.html
-   :scale: 63
-
-.. |cluster_iris_truth| image:: ../../auto_examples/cluster/images/plot_cluster_iris_004.png
-   :target: ../../auto_examples/cluster/plot_cluster_iris.html
-   :scale: 63
-
-.. warning::
-
-    There is absolutely no guarantee of recovering a ground truth. First,
-    choosing the right number of clusters is hard. Second, the algorithm
-    is sensitive to initialization, and can fall into local minima,
-    although scikit-learn employs several tricks to mitigate this issue.
-
-    .. list-table::
-        :class: centered
-
-        *
-
-            - |k_means_iris_bad_init|
-
-            - |k_means_iris_8|
-
-            - |cluster_iris_truth|
-
-        *
-
-            - **Bad initialization**
-
-            - **8 clusters**
-
-            - **Ground truth**
-
-    **Don't over-interpret clustering results**
-
-.. |face| image:: ../../auto_examples/cluster/images/plot_face_compress_001.png
-   :target: ../../auto_examples/cluster/plot_face_compress.html
-   :scale: 60
-
-.. |face_regular| image:: ../../auto_examples/cluster/images/plot_face_compress_002.png
-   :target: ../../auto_examples/cluster/plot_face_compress.html
-   :scale: 60
-
-.. |face_compressed| image:: ../../auto_examples/cluster/images/plot_face_compress_003.png
-   :target: ../../auto_examples/cluster/plot_face_compress.html
-   :scale: 60
-
-.. |face_histogram| image:: ../../auto_examples/cluster/images/plot_face_compress_004.png
-   :target: ../../auto_examples/cluster/plot_face_compress.html
-   :scale: 60
-
-.. topic:: **Application example: vector quantization**
-
-    Clustering in general and KMeans, in particular, can be seen as a way
-    of choosing a small number of exemplars to compress the information.
-<<<<<<< HEAD
-    The problem is sometimes known as 
-    `vector quantization <https://en.wikipedia.org/wiki/Vector_quantization>`_.
-=======
-    The problem is sometimes known as
-    `vector quantization <http://en.wikipedia.org/wiki/Vector_quantization>`_.
->>>>>>> origin/master
-    For instance, this can be used to posterize an image::
-
-        >>> import scipy as sp
-        >>> try:
-        ...    face = sp.face(gray=True)
-        ... except AttributeError:
-        ...    from scipy import misc
-        ...    face = misc.face(gray=True)
-    	>>> X = face.reshape((-1, 1)) # We need an (n_sample, n_feature) array
-    	>>> k_means = cluster.KMeans(n_clusters=5, n_init=1)
-    	>>> k_means.fit(X) # doctest: +ELLIPSIS
-    	KMeans(copy_x=True, init='k-means++', ...
-    	>>> values = k_means.cluster_centers_.squeeze()
-    	>>> labels = k_means.labels_
-    	>>> face_compressed = np.choose(labels, values)
-    	>>> face_compressed.shape = face.shape
-
-    .. list-table::
-      :class: centered
-
-      *
-        - |face|
-
-        - |face_compressed|
-
-        - |face_regular|
-
-        - |face_histogram|
-
-      *
-
-        - Raw image
-
-        - K-means quantization
-
-        - Equal bins
-
-        - Image histogram
-
-
-Hierarchical agglomerative clustering: Ward
----------------------------------------------
-
-A :ref:`hierarchical_clustering` method is a type of cluster analysis
-that aims to build a hierarchy of clusters. In general, the various approaches
-of this technique are either:
-
-  * **Agglomerative** - bottom-up approaches: each observation starts in its
-    own cluster, and clusters are iterativelly merged in such a way to
-    minimize a *linkage* criterion. This approach is particularly interesting
-    when the clusters of interest are made of only a few observations. When
-    the number of clusters is large, it is much more computationally efficient
-    than k-means.
-
-  * **Divisive** - top-down approaches: all observations start in one
-    cluster, which is iteratively split as one moves down the hierarchy.
-    For estimating large numbers of clusters, this approach is both slow (due
-    to all observations starting as one cluster, which it splits recursively)
-    and statistically ill-posed.
-
-Connectivity-constrained clustering
-.....................................
-
-With agglomerative clustering, it is possible to specify which samples can be
-clustered together by giving a connectivity graph. Graphs in the scikit
-are represented by their adjacency matrix. Often, a sparse matrix is used.
-This can be useful, for instance, to retrieve connected regions (sometimes
-also referred to as connected components) when
-clustering an image:
-
-.. image:: ../../auto_examples/cluster/images/plot_face_ward_segmentation_001.png
-    :target: ../../auto_examples/cluster/plot_face_ward_segmentation.html
-    :scale: 40
-    :align: right
-
-.. literalinclude:: ../../auto_examples/cluster/plot_face_ward_segmentation.py
-    :lines: 21-45
-
-..
-    >>> from sklearn.feature_extraction.image import grid_to_graph
-    >>> connectivity = grid_to_graph(*face.shape)
-
-
-Feature agglomeration
-......................
-
-We have seen that sparsity could be used to mitigate the curse of
-dimensionality, *i.e* an insufficient amount of observations compared to the
-number of features. Another approach is to merge together similar
-features: **feature agglomeration**. This approach can be implemented by
-clustering in the feature direction, in other words clustering the
-transposed data.
-
-.. image:: ../../auto_examples/cluster/images/plot_digits_agglomeration_001.png
-    :target: ../../auto_examples/cluster/plot_digits_agglomeration.html
-    :align: right
-    :scale: 57
-
-::
-
-   >>> digits = datasets.load_digits()
-   >>> images = digits.images
-   >>> X = np.reshape(images, (len(images), -1))
-   >>> connectivity = grid_to_graph(*images[0].shape)
-
-   >>> agglo = cluster.FeatureAgglomeration(connectivity=connectivity,
-   ...                                      n_clusters=32)
-   >>> agglo.fit(X) # doctest: +ELLIPSIS
-   FeatureAgglomeration(affinity='euclidean', compute_full_tree='auto',...
-   >>> X_reduced = agglo.transform(X)
-
-   >>> X_approx = agglo.inverse_transform(X_reduced)
-   >>> images_approx = np.reshape(X_approx, images.shape)
-
-.. topic:: ``transform`` and ``inverse_transform`` methods
-
-   Some estimators expose a ``transform`` method, for instance to reduce
-   the dimensionality of the dataset.
-
-Decompositions: from a signal to components and loadings
-===========================================================
-
-.. topic:: **Components and loadings**
-
-   If X is our multivariate data, then the problem that we are trying to solve
-   is to rewrite it on a different observational basis: we want to learn
-   loadings L and a set of components C such that *X = L C*.
-   Different criteria exist to choose the components
-
-Principal component analysis: PCA
------------------------------------
-
-:ref:`PCA` selects the successive components that
-explain the maximum variance in the signal.
-
-.. |pca_3d_axis| image:: ../../auto_examples/decomposition/images/plot_pca_3d_001.png
-   :target: ../../auto_examples/decomposition/plot_pca_3d.html
-   :scale: 70
-
-.. |pca_3d_aligned| image:: ../../auto_examples/decomposition/images/plot_pca_3d_002.png
-   :target: ../../auto_examples/decomposition/plot_pca_3d.html
-   :scale: 70
-
-.. rst-class:: centered
-
-   |pca_3d_axis| |pca_3d_aligned|
-
-The point cloud spanned by the observations above is very flat in one
-direction: one of the three univariate features can almost be exactly
-computed using the other two. PCA finds the directions in which the data is
-not *flat*
-
-When used to *transform* data, PCA can reduce the dimensionality of the
-data by projecting on a principal subspace.
-
-.. np.random.seed(0)
-
-::
-
-    >>> # Create a signal with only 2 useful dimensions
-    >>> x1 = np.random.normal(size=100)
-    >>> x2 = np.random.normal(size=100)
-    >>> x3 = x1 + x2
-    >>> X = np.c_[x1, x2, x3]
-
-    >>> from sklearn import decomposition
-    >>> pca = decomposition.PCA()
-    >>> pca.fit(X)
-    PCA(copy=True, n_components=None, whiten=False)
-    >>> print(pca.explained_variance_)  # doctest: +SKIP
-    [  2.18565811e+00   1.19346747e+00   8.43026679e-32]
-
-    >>> # As we can see, only the 2 first components are useful
-    >>> pca.n_components = 2
-    >>> X_reduced = pca.fit_transform(X)
-    >>> X_reduced.shape
-    (100, 2)
-
-.. Eigenfaces here?
-
-Independent Component Analysis: ICA
--------------------------------------
-
-:ref:`ICA` selects components so that the distribution of their loadings carries
-a maximum amount of independent information. It is able to recover
-**non-Gaussian** independent signals:
-
-.. image:: ../../auto_examples/decomposition/images/plot_ica_blind_source_separation_001.png
-   :target: ../../auto_examples/decomposition/plot_ica_blind_source_separation.html
-   :scale: 70
-   :align: center
-
-.. np.random.seed(0)
-
-::
-
-    >>> # Generate sample data
-    >>> time = np.linspace(0, 10, 2000)
-    >>> s1 = np.sin(2 * time)  # Signal 1 : sinusoidal signal
-    >>> s2 = np.sign(np.sin(3 * time))  # Signal 2 : square signal
-    >>> S = np.c_[s1, s2]
-    >>> S += 0.2 * np.random.normal(size=S.shape)  # Add noise
-    >>> S /= S.std(axis=0)  # Standardize data
-    >>> # Mix data
-    >>> A = np.array([[1, 1], [0.5, 2]])  # Mixing matrix
-    >>> X = np.dot(S, A.T)  # Generate observations
-
-    >>> # Compute ICA
-    >>> ica = decomposition.FastICA()
-    >>> S_ = ica.fit_transform(X)  # Get the estimated sources
-    >>> A_ = ica.mixing_.T
-    >>> np.allclose(X,  np.dot(S_, A_) + ica.mean_)
-    True
diff --git a/sklearn/metrics/classification.py.orig b/sklearn/metrics/classification.py.orig
deleted file mode 100644
index 3b40c5f566437..0000000000000
--- a/sklearn/metrics/classification.py.orig
+++ /dev/null
@@ -1,1827 +0,0 @@
-"""Metrics to assess performance on classification task given class prediction
-
-Functions named as ``*_score`` return a scalar value to maximize: the higher
-the better
-
-Function named as ``*_error`` or ``*_loss`` return a scalar value to minimize:
-the lower the better
-"""
-
-# Authors: Alexandre Gramfort <alexandre.gramfort@inria.fr>
-#          Mathieu Blondel <mathieu@mblondel.org>
-#          Olivier Grisel <olivier.grisel@ensta.org>
-#          Arnaud Joly <a.joly@ulg.ac.be>
-#          Jochen Wersdorfer <jochen@wersdoerfer.de>
-#          Lars Buitinck <L.J.Buitinck@uva.nl>
-#          Joel Nothman <joel.nothman@gmail.com>
-#          Noel Dawe <noel@dawe.me>
-#          Jatin Shah <jatindshah@gmail.com>
-#          Saurabh Jha <saurabh.jhaa@gmail.com>
-#          Bernardo Stein <bernardovstein@gmail.com>
-# License: BSD 3 clause
-
-from __future__ import division
-
-import warnings
-import numpy as np
-
-from scipy.sparse import coo_matrix
-from scipy.sparse import csr_matrix
-
-from ..preprocessing import LabelBinarizer, label_binarize
-from ..preprocessing import LabelEncoder
-from ..utils import check_array
-from ..utils import check_consistent_length
-from ..utils import column_or_1d
-from ..utils.multiclass import unique_labels
-from ..utils.multiclass import type_of_target
-from ..utils.validation import _num_samples
-from ..utils.sparsefuncs import count_nonzero
-from ..utils.fixes import bincount
-from ..exceptions import UndefinedMetricWarning
-
-
-def _check_targets(y_true, y_pred):
-    """Check that y_true and y_pred belong to the same classification task
-
-    This converts multiclass or binary types to a common shape, and raises a
-    ValueError for a mix of multilabel and multiclass targets, a mix of
-    multilabel formats, for the presence of continuous-valued or multioutput
-    targets, or for targets of different lengths.
-
-    Column vectors are squeezed to 1d, while multilabel formats are returned
-    as CSR sparse label indicators.
-
-    Parameters
-    ----------
-    y_true : array-like
-
-    y_pred : array-like
-
-    Returns
-    -------
-    type_true : one of {'multilabel-indicator', 'multiclass', 'binary'}
-        The type of the true target data, as output by
-        ``utils.multiclass.type_of_target``
-
-    y_true : array or indicator matrix
-
-    y_pred : array or indicator matrix
-    """
-    check_consistent_length(y_true, y_pred)
-    type_true = type_of_target(y_true)
-    type_pred = type_of_target(y_pred)
-
-    y_type = set([type_true, type_pred])
-    if y_type == set(["binary", "multiclass"]):
-        y_type = set(["multiclass"])
-
-    if len(y_type) > 1:
-        raise ValueError("Can't handle mix of {0} and {1}"
-                         "".format(type_true, type_pred))
-
-    # We can't have more than one value on y_type => The set is no more needed
-    y_type = y_type.pop()
-
-    # No metrics support "multiclass-multioutput" format
-    if (y_type not in ["binary", "multiclass", "multilabel-indicator"]):
-        raise ValueError("{0} is not supported".format(y_type))
-
-    if y_type in ["binary", "multiclass"]:
-        y_true = column_or_1d(y_true)
-        y_pred = column_or_1d(y_pred)
-
-    if y_type.startswith('multilabel'):
-        y_true = csr_matrix(y_true)
-        y_pred = csr_matrix(y_pred)
-        y_type = 'multilabel-indicator'
-
-    return y_type, y_true, y_pred
-
-
-def _weighted_sum(sample_score, sample_weight, normalize=False):
-    if normalize:
-        return np.average(sample_score, weights=sample_weight)
-    elif sample_weight is not None:
-        return np.dot(sample_score, sample_weight)
-    else:
-        return sample_score.sum()
-
-
-def accuracy_score(y_true, y_pred, normalize=True, sample_weight=None):
-    """Accuracy classification score.
-
-    In multilabel classification, this function computes subset accuracy:
-    the set of labels predicted for a sample must *exactly* match the
-    corresponding set of labels in y_true.
-
-    Read more in the :ref:`User Guide <accuracy_score>`.
-
-    Parameters
-    ----------
-    y_true : 1d array-like, or label indicator array / sparse matrix
-        Ground truth (correct) labels.
-
-    y_pred : 1d array-like, or label indicator array / sparse matrix
-        Predicted labels, as returned by a classifier.
-
-    normalize : bool, optional (default=True)
-        If ``False``, return the number of correctly classified samples.
-        Otherwise, return the fraction of correctly classified samples.
-
-    sample_weight : array-like of shape = [n_samples], optional
-        Sample weights.
-
-    Returns
-    -------
-    score : float
-        If ``normalize == True``, return the correctly classified samples
-        (float), else it returns the number of correctly classified samples
-        (int).
-
-        The best performance is 1 with ``normalize == True`` and the number
-        of samples with ``normalize == False``.
-
-    See also
-    --------
-    jaccard_similarity_score, hamming_loss, zero_one_loss
-
-    Notes
-    -----
-    In binary and multiclass classification, this function is equal
-    to the ``jaccard_similarity_score`` function.
-
-    Examples
-    --------
-    >>> import numpy as np
-    >>> from sklearn.metrics import accuracy_score
-    >>> y_pred = [0, 2, 1, 3]
-    >>> y_true = [0, 1, 2, 3]
-    >>> accuracy_score(y_true, y_pred)
-    0.5
-    >>> accuracy_score(y_true, y_pred, normalize=False)
-    2
-
-    In the multilabel case with binary label indicators:
-    >>> accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))
-    0.5
-    """
-
-    # Compute accuracy for each possible representation
-    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
-    if y_type.startswith('multilabel'):
-        differing_labels = count_nonzero(y_true - y_pred, axis=1)
-        score = differing_labels == 0
-    else:
-        score = y_true == y_pred
-
-    return _weighted_sum(score, sample_weight, normalize)
-
-
-def confusion_matrix(y_true, y_pred, labels=None, sample_weight=None):
-    """Compute confusion matrix to evaluate the accuracy of a classification
-
-    By definition a confusion matrix :math:`C` is such that :math:`C_{i, j}`
-    is equal to the number of observations known to be in group :math:`i` but
-    predicted to be in group :math:`j`.
-
-    Read more in the :ref:`User Guide <confusion_matrix>`.
-
-    Parameters
-    ----------
-    y_true : array, shape = [n_samples]
-        Ground truth (correct) target values.
-
-    y_pred : array, shape = [n_samples]
-        Estimated targets as returned by a classifier.
-
-    labels : array, shape = [n_classes], optional
-        List of labels to index the matrix. This may be used to reorder
-        or select a subset of labels.
-        If none is given, those that appear at least once
-        in ``y_true`` or ``y_pred`` are used in sorted order.
-
-    sample_weight : array-like of shape = [n_samples], optional
-        Sample weights.
-
-    Returns
-    -------
-    C : array, shape = [n_classes, n_classes]
-        Confusion matrix
-
-    References
-    ----------
-    .. [1] `Wikipedia entry for the Confusion matrix
-           <https://en.wikipedia.org/wiki/Confusion_matrix>`_
-
-    Examples
-    --------
-    >>> from sklearn.metrics import confusion_matrix
-    >>> y_true = [2, 0, 2, 2, 0, 1]
-    >>> y_pred = [0, 0, 2, 2, 0, 2]
-    >>> confusion_matrix(y_true, y_pred)
-    array([[2, 0, 0],
-           [0, 0, 1],
-           [1, 0, 2]])
-
-    >>> y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
-    >>> y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
-    >>> confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])
-    array([[2, 0, 0],
-           [0, 0, 1],
-           [1, 0, 2]])
-
-    """
-    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
-    if y_type not in ("binary", "multiclass"):
-        raise ValueError("%s is not supported" % y_type)
-
-    if labels is None:
-        labels = unique_labels(y_true, y_pred)
-    else:
-        labels = np.asarray(labels)
-
-    if sample_weight is None:
-        sample_weight = np.ones(y_true.shape[0], dtype=np.int)
-    else:
-        sample_weight = np.asarray(sample_weight)
-
-    check_consistent_length(sample_weight, y_true, y_pred)
-
-    n_labels = labels.size
-    label_to_ind = dict((y, x) for x, y in enumerate(labels))
-    # convert yt, yp into index
-    y_pred = np.array([label_to_ind.get(x, n_labels + 1) for x in y_pred])
-    y_true = np.array([label_to_ind.get(x, n_labels + 1) for x in y_true])
-
-    # intersect y_pred, y_true with labels, eliminate items not in labels
-    ind = np.logical_and(y_pred < n_labels, y_true < n_labels)
-    y_pred = y_pred[ind]
-    y_true = y_true[ind]
-    # also eliminate weights of eliminated items
-    sample_weight = sample_weight[ind]
-
-    CM = coo_matrix((sample_weight, (y_true, y_pred)),
-                    shape=(n_labels, n_labels)
-                    ).toarray()
-
-    return CM
-
-
-def cohen_kappa_score(y1, y2, labels=None):
-    """Cohen's kappa: a statistic that measures inter-annotator agreement.
-
-    This function computes Cohen's kappa [1], a score that expresses the level
-    of agreement between two annotators on a classification problem. It is
-    defined as
-
-    .. math::
-        \kappa = (p_o - p_e) / (1 - p_e)
-
-    where :math:`p_o` is the empirical probability of agreement on the label
-    assigned to any sample (the observed agreement ratio), and :math:`p_e` is
-    the expected agreement when both annotators assign labels randomly.
-    :math:`p_e` is estimated using a per-annotator empirical prior over the
-    class labels [2].
-
-    Parameters
-    ----------
-    y1 : array, shape = [n_samples]
-        Labels assigned by the first annotator.
-
-    y2 : array, shape = [n_samples]
-        Labels assigned by the second annotator. The kappa statistic is
-        symmetric, so swapping ``y1`` and ``y2`` doesn't change the value.
-
-    labels : array, shape = [n_classes], optional
-        List of labels to index the matrix. This may be used to select a
-        subset of labels. If None, all labels that appear at least once in
-        ``y1`` or ``y2`` are used.
-
-    Returns
-    -------
-    kappa : float
-        The kappa statistic, which is a number between -1 and 1. The maximum
-        value means complete agreement; zero or lower means chance agreement.
-
-    References
-    ----------
-    .. [1] J. Cohen (1960). "A coefficient of agreement for nominal scales".
-           Educational and Psychological Measurement 20(1):37-46.
-           doi:10.1177/001316446002000104.
-    .. [2] R. Artstein and M. Poesio (2008). "Inter-coder agreement for
-           computational linguistics". Computational Linguistic 34(4):555-596.
-    """
-    confusion = confusion_matrix(y1, y2, labels=labels)
-    P = confusion / float(confusion.sum())
-    p_observed = np.trace(P)
-    p_expected = np.dot(P.sum(axis=0), P.sum(axis=1))
-    return (p_observed - p_expected) / (1 - p_expected)
-
-
-def jaccard_similarity_score(y_true, y_pred, normalize=True,
-                             sample_weight=None):
-    """Jaccard similarity coefficient score
-
-    The Jaccard index [1], or Jaccard similarity coefficient, defined as
-    the size of the intersection divided by the size of the union of two label
-    sets, is used to compare set of predicted labels for a sample to the
-    corresponding set of labels in ``y_true``.
-
-    Read more in the :ref:`User Guide <jaccard_similarity_score>`.
-
-    Parameters
-    ----------
-    y_true : 1d array-like, or label indicator array / sparse matrix
-        Ground truth (correct) labels.
-
-    y_pred : 1d array-like, or label indicator array / sparse matrix
-        Predicted labels, as returned by a classifier.
-
-    normalize : bool, optional (default=True)
-        If ``False``, return the sum of the Jaccard similarity coefficient
-        over the sample set. Otherwise, return the average of Jaccard
-        similarity coefficient.
-
-    sample_weight : array-like of shape = [n_samples], optional
-        Sample weights.
-
-    Returns
-    -------
-    score : float
-        If ``normalize == True``, return the average Jaccard similarity
-        coefficient, else it returns the sum of the Jaccard similarity
-        coefficient over the sample set.
-
-        The best performance is 1 with ``normalize == True`` and the number
-        of samples with ``normalize == False``.
-
-    See also
-    --------
-    accuracy_score, hamming_loss, zero_one_loss
-
-    Notes
-    -----
-    In binary and multiclass classification, this function is equivalent
-    to the ``accuracy_score``. It differs in the multilabel classification
-    problem.
-
-    References
-    ----------
-    .. [1] `Wikipedia entry for the Jaccard index
-           <https://en.wikipedia.org/wiki/Jaccard_index>`_
-
-
-    Examples
-    --------
-    >>> import numpy as np
-    >>> from sklearn.metrics import jaccard_similarity_score
-    >>> y_pred = [0, 2, 1, 3]
-    >>> y_true = [0, 1, 2, 3]
-    >>> jaccard_similarity_score(y_true, y_pred)
-    0.5
-    >>> jaccard_similarity_score(y_true, y_pred, normalize=False)
-    2
-
-    In the multilabel case with binary label indicators:
-
-    >>> jaccard_similarity_score(np.array([[0, 1], [1, 1]]),\
-        np.ones((2, 2)))
-    0.75
-    """
-
-    # Compute accuracy for each possible representation
-    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
-    if y_type.startswith('multilabel'):
-        with np.errstate(divide='ignore', invalid='ignore'):
-            # oddly, we may get an "invalid" rather than a "divide" error here
-            pred_or_true = count_nonzero(y_true + y_pred, axis=1)
-            pred_and_true = count_nonzero(y_true.multiply(y_pred), axis=1)
-            score = pred_and_true / pred_or_true
-
-            # If there is no label, it results in a Nan instead, we set
-            # the jaccard to 1: lim_{x->0} x/x = 1
-            # Note with py2.6 and np 1.3: we can't check safely for nan.
-            score[pred_or_true == 0.0] = 1.0
-    else:
-        score = y_true == y_pred
-
-    return _weighted_sum(score, sample_weight, normalize)
-
-
-def matthews_corrcoef(y_true, y_pred, sample_weight=None):
-    """Compute the Matthews correlation coefficient (MCC) for binary classes
-
-    The Matthews correlation coefficient is used in machine learning as a
-    measure of the quality of binary (two-class) classifications. It takes into
-    account true and false positives and negatives and is generally regarded as
-    a balanced measure which can be used even if the classes are of very
-    different sizes. The MCC is in essence a correlation coefficient value
-    between -1 and +1. A coefficient of +1 represents a perfect prediction, 0
-    an average random prediction and -1 an inverse prediction.  The statistic
-    is also known as the phi coefficient. [source: Wikipedia]
-
-    Only in the binary case does this relate to information about true and
-    false positives and negatives. See references below.
-
-    Read more in the :ref:`User Guide <matthews_corrcoef>`.
-
-    Parameters
-    ----------
-    y_true : array, shape = [n_samples]
-        Ground truth (correct) target values.
-
-    y_pred : array, shape = [n_samples]
-        Estimated targets as returned by a classifier.
-
-    sample_weight : array-like of shape = [n_samples], default None
-        Sample weights.
-
-    Returns
-    -------
-    mcc : float
-        The Matthews correlation coefficient (+1 represents a perfect
-        prediction, 0 an average random prediction and -1 and inverse
-        prediction).
-
-    References
-    ----------
-    .. [1] `Baldi, Brunak, Chauvin, Andersen and Nielsen, (2000). Assessing the
-       accuracy of prediction algorithms for classification: an overview
-       <http://dx.doi.org/10.1093/bioinformatics/16.5.412>`_
-
-    .. [2] `Wikipedia entry for the Matthews Correlation Coefficient
-       <https://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_
-
-    Examples
-    --------
-    >>> from sklearn.metrics import matthews_corrcoef
-    >>> y_true = [+1, +1, +1, -1]
-    >>> y_pred = [+1, -1, +1, +1]
-    >>> matthews_corrcoef(y_true, y_pred)  # doctest: +ELLIPSIS
-    -0.33...
-
-    """
-    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
-
-    if y_type != "binary":
-        raise ValueError("%s is not supported" % y_type)
-
-    lb = LabelEncoder()
-    lb.fit(np.hstack([y_true, y_pred]))
-    y_true = lb.transform(y_true)
-    y_pred = lb.transform(y_pred)
-    mean_yt = np.average(y_true, weights=sample_weight)
-    mean_yp = np.average(y_pred, weights=sample_weight)
-
-    y_true_u_cent = y_true - mean_yt
-    y_pred_u_cent = y_pred - mean_yp
-
-    cov_ytyp = np.average(y_true_u_cent * y_pred_u_cent, weights=sample_weight)
-    var_yt = np.average(y_true_u_cent ** 2, weights=sample_weight)
-    var_yp = np.average(y_pred_u_cent ** 2, weights=sample_weight)
-
-    mcc = cov_ytyp / np.sqrt(var_yt * var_yp)
-
-    if np.isnan(mcc):
-        return 0.
-    else:
-        return mcc
-
-
-def zero_one_loss(y_true, y_pred, normalize=True, sample_weight=None):
-    """Zero-one classification loss.
-
-    If normalize is ``True``, return the fraction of misclassifications
-    (float), else it returns the number of misclassifications (int). The best
-    performance is 0.
-
-    Read more in the :ref:`User Guide <zero_one_loss>`.
-
-    Parameters
-    ----------
-    y_true : 1d array-like, or label indicator array / sparse matrix
-        Ground truth (correct) labels.
-
-    y_pred : 1d array-like, or label indicator array / sparse matrix
-        Predicted labels, as returned by a classifier.
-
-    normalize : bool, optional (default=True)
-        If ``False``, return the number of misclassifications.
-        Otherwise, return the fraction of misclassifications.
-
-    sample_weight : array-like of shape = [n_samples], optional
-        Sample weights.
-
-    Returns
-    -------
-    loss : float or int,
-        If ``normalize == True``, return the fraction of misclassifications
-        (float), else it returns the number of misclassifications (int).
-
-    Notes
-    -----
-    In multilabel classification, the zero_one_loss function corresponds to
-    the subset zero-one loss: for each sample, the entire set of labels must be
-    correctly predicted, otherwise the loss for that sample is equal to one.
-
-    See also
-    --------
-    accuracy_score, hamming_loss, jaccard_similarity_score
-
-    Examples
-    --------
-    >>> from sklearn.metrics import zero_one_loss
-    >>> y_pred = [1, 2, 3, 4]
-    >>> y_true = [2, 2, 3, 4]
-    >>> zero_one_loss(y_true, y_pred)
-    0.25
-    >>> zero_one_loss(y_true, y_pred, normalize=False)
-    1
-
-    In the multilabel case with binary label indicators:
-
-    >>> zero_one_loss(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))
-    0.5
-    """
-    score = accuracy_score(y_true, y_pred,
-                           normalize=normalize,
-                           sample_weight=sample_weight)
-
-    if normalize:
-        return 1 - score
-    else:
-        if sample_weight is not None:
-            n_samples = np.sum(sample_weight)
-        else:
-            n_samples = _num_samples(y_true)
-        return n_samples - score
-
-
-def f1_score(y_true, y_pred, labels=None, pos_label=1, average='binary',
-             sample_weight=None):
-    """Compute the F1 score, also known as balanced F-score or F-measure
-
-    The F1 score can be interpreted as a weighted average of the precision and
-    recall, where an F1 score reaches its best value at 1 and worst score at 0.
-    The relative contribution of precision and recall to the F1 score are
-    equal. The formula for the F1 score is::
-
-        F1 = 2 * (precision * recall) / (precision + recall)
-
-    In the multi-class and multi-label case, this is the weighted average of
-    the F1 score of each class.
-
-    Read more in the :ref:`User Guide <precision_recall_f_measure_metrics>`.
-
-    Parameters
-    ----------
-    y_true : 1d array-like, or label indicator array / sparse matrix
-        Ground truth (correct) target values.
-
-    y_pred : 1d array-like, or label indicator array / sparse matrix
-        Estimated targets as returned by a classifier.
-
-    labels : list, optional
-        The set of labels to include when ``average != 'binary'``, and their
-        order if ``average is None``. Labels present in the data can be
-        excluded, for example to calculate a multiclass average ignoring a
-        majority negative class, while labels not present in the data will
-        result in 0 components in a macro average. For multilabel targets,
-        labels are column indices. By default, all labels in ``y_true`` and
-        ``y_pred`` are used in sorted order.
-
-        .. versionchanged:: 0.17
-           parameter *labels* improved for multiclass problem.
-
-    pos_label : str or int, 1 by default
-        The class to report if ``average='binary'``. Until version 0.18 it is
-        necessary to set ``pos_label=None`` if seeking to use another averaging
-        method over binary targets.
-
-    average : string, [None, 'binary' (default), 'micro', 'macro', 'samples', \
-                       'weighted']
-        This parameter is required for multiclass/multilabel targets.
-        If ``None``, the scores for each class are returned. Otherwise, this
-        determines the type of averaging performed on the data:
-
-        ``'binary'``:
-            Only report results for the class specified by ``pos_label``.
-            This is applicable only if targets (``y_{true,pred}``) are binary.
-        ``'micro'``:
-            Calculate metrics globally by counting the total true positives,
-            false negatives and false positives.
-        ``'macro'``:
-            Calculate metrics for each label, and find their unweighted
-            mean.  This does not take label imbalance into account.
-        ``'weighted'``:
-            Calculate metrics for each label, and find their average, weighted
-            by support (the number of true instances for each label). This
-            alters 'macro' to account for label imbalance; it can result in an
-            F-score that is not between precision and recall.
-        ``'samples'``:
-            Calculate metrics for each instance, and find their average (only
-            meaningful for multilabel classification where this differs from
-            :func:`accuracy_score`).
-
-        Note that if ``pos_label`` is given in binary classification with
-        `average != 'binary'`, only that positive class is reported. This
-        behavior is deprecated and will change in version 0.18.
-
-    sample_weight : array-like of shape = [n_samples], optional
-        Sample weights.
-
-    Returns
-    -------
-    f1_score : float or array of float, shape = [n_unique_labels]
-        F1 score of the positive class in binary classification or weighted
-        average of the F1 scores of each class for the multiclass task.
-
-    References
-    ----------
-<<<<<<< HEAD
-    .. [1] `Wikipedia entry for the F1-score <https://en.wikipedia.org/wiki/F1_score>`_
-=======
-    .. [1] `Wikipedia entry for the F1-score
-            <http://en.wikipedia.org/wiki/F1_score>`_
->>>>>>> origin/master
-
-    Examples
-    --------
-    >>> from sklearn.metrics import f1_score
-    >>> y_true = [0, 1, 2, 0, 1, 2]
-    >>> y_pred = [0, 2, 1, 0, 0, 1]
-    >>> f1_score(y_true, y_pred, average='macro')  # doctest: +ELLIPSIS
-    0.26...
-    >>> f1_score(y_true, y_pred, average='micro')  # doctest: +ELLIPSIS
-    0.33...
-    >>> f1_score(y_true, y_pred, average='weighted')  # doctest: +ELLIPSIS
-    0.26...
-    >>> f1_score(y_true, y_pred, average=None)
-    array([ 0.8,  0. ,  0. ])
-
-
-    """
-    return fbeta_score(y_true, y_pred, 1, labels=labels,
-                       pos_label=pos_label, average=average,
-                       sample_weight=sample_weight)
-
-
-def fbeta_score(y_true, y_pred, beta, labels=None, pos_label=1,
-                average='binary', sample_weight=None):
-    """Compute the F-beta score
-
-    The F-beta score is the weighted harmonic mean of precision and recall,
-    reaching its optimal value at 1 and its worst value at 0.
-
-    The `beta` parameter determines the weight of precision in the combined
-    score. ``beta < 1`` lends more weight to precision, while ``beta > 1``
-    favors recall (``beta -> 0`` considers only precision, ``beta -> inf``
-    only recall).
-
-    Read more in the :ref:`User Guide <precision_recall_f_measure_metrics>`.
-
-    Parameters
-    ----------
-    y_true : 1d array-like, or label indicator array / sparse matrix
-        Ground truth (correct) target values.
-
-    y_pred : 1d array-like, or label indicator array / sparse matrix
-        Estimated targets as returned by a classifier.
-
-    beta: float
-        Weight of precision in harmonic mean.
-
-    labels : list, optional
-        The set of labels to include when ``average != 'binary'``, and their
-        order if ``average is None``. Labels present in the data can be
-        excluded, for example to calculate a multiclass average ignoring a
-        majority negative class, while labels not present in the data will
-        result in 0 components in a macro average. For multilabel targets,
-        labels are column indices. By default, all labels in ``y_true`` and
-        ``y_pred`` are used in sorted order.
-
-        .. versionchanged:: 0.17
-           parameter *labels* improved for multiclass problem.
-
-    pos_label : str or int, 1 by default
-        The class to report if ``average='binary'``. Until version 0.18 it is
-        necessary to set ``pos_label=None`` if seeking to use another averaging
-        method over binary targets.
-
-    average : string, [None, 'binary' (default), 'micro', 'macro', 'samples', \
-                       'weighted']
-        This parameter is required for multiclass/multilabel targets.
-        If ``None``, the scores for each class are returned. Otherwise, this
-        determines the type of averaging performed on the data:
-
-        ``'binary'``:
-            Only report results for the class specified by ``pos_label``.
-            This is applicable only if targets (``y_{true,pred}``) are binary.
-        ``'micro'``:
-            Calculate metrics globally by counting the total true positives,
-            false negatives and false positives.
-        ``'macro'``:
-            Calculate metrics for each label, and find their unweighted
-            mean.  This does not take label imbalance into account.
-        ``'weighted'``:
-            Calculate metrics for each label, and find their average, weighted
-            by support (the number of true instances for each label). This
-            alters 'macro' to account for label imbalance; it can result in an
-            F-score that is not between precision and recall.
-        ``'samples'``:
-            Calculate metrics for each instance, and find their average (only
-            meaningful for multilabel classification where this differs from
-            :func:`accuracy_score`).
-
-        Note that if ``pos_label`` is given in binary classification with
-        `average != 'binary'`, only that positive class is reported. This
-        behavior is deprecated and will change in version 0.18.
-
-    sample_weight : array-like of shape = [n_samples], optional
-        Sample weights.
-
-    Returns
-    -------
-    fbeta_score : float (if average is not None) or array of float, shape =\
-        [n_unique_labels]
-        F-beta score of the positive class in binary classification or weighted
-        average of the F-beta score of each class for the multiclass task.
-
-    References
-    ----------
-    .. [1] R. Baeza-Yates and B. Ribeiro-Neto (2011).
-           Modern Information Retrieval. Addison Wesley, pp. 327-328.
-
-    .. [2] `Wikipedia entry for the F1-score
-           <https://en.wikipedia.org/wiki/F1_score>`_
-
-    Examples
-    --------
-    >>> from sklearn.metrics import fbeta_score
-    >>> y_true = [0, 1, 2, 0, 1, 2]
-    >>> y_pred = [0, 2, 1, 0, 0, 1]
-    >>> fbeta_score(y_true, y_pred, average='macro', beta=0.5)
-    ... # doctest: +ELLIPSIS
-    0.23...
-    >>> fbeta_score(y_true, y_pred, average='micro', beta=0.5)
-    ... # doctest: +ELLIPSIS
-    0.33...
-    >>> fbeta_score(y_true, y_pred, average='weighted', beta=0.5)
-    ... # doctest: +ELLIPSIS
-    0.23...
-    >>> fbeta_score(y_true, y_pred, average=None, beta=0.5)
-    ... # doctest: +ELLIPSIS
-    array([ 0.71...,  0.        ,  0.        ])
-
-    """
-    _, _, f, _ = precision_recall_fscore_support(y_true, y_pred,
-                                                 beta=beta,
-                                                 labels=labels,
-                                                 pos_label=pos_label,
-                                                 average=average,
-                                                 warn_for=('f-score',),
-                                                 sample_weight=sample_weight)
-    return f
-
-
-def _prf_divide(numerator, denominator, metric, modifier, average, warn_for):
-    """Performs division and handles divide-by-zero.
-
-    On zero-division, sets the corresponding result elements to zero
-    and raises a warning.
-
-    The metric, modifier and average arguments are used only for determining
-    an appropriate warning.
-    """
-    result = numerator / denominator
-    mask = denominator == 0.0
-    if not np.any(mask):
-        return result
-
-    # remove infs
-    result[mask] = 0.0
-
-    # build appropriate warning
-    # E.g. "Precision and F-score are ill-defined and being set to 0.0 in
-    # labels with no predicted samples"
-    axis0 = 'sample'
-    axis1 = 'label'
-    if average == 'samples':
-        axis0, axis1 = axis1, axis0
-
-    if metric in warn_for and 'f-score' in warn_for:
-        msg_start = '{0} and F-score are'.format(metric.title())
-    elif metric in warn_for:
-        msg_start = '{0} is'.format(metric.title())
-    elif 'f-score' in warn_for:
-        msg_start = 'F-score is'
-    else:
-        return result
-
-    msg = ('{0} ill-defined and being set to 0.0 {{0}} '
-           'no {1} {2}s.'.format(msg_start, modifier, axis0))
-    if len(mask) == 1:
-        msg = msg.format('due to')
-    else:
-        msg = msg.format('in {0}s with'.format(axis1))
-    warnings.warn(msg, UndefinedMetricWarning, stacklevel=2)
-    return result
-
-
-def precision_recall_fscore_support(y_true, y_pred, beta=1.0, labels=None,
-                                    pos_label=1, average=None,
-                                    warn_for=('precision', 'recall',
-                                              'f-score'),
-                                    sample_weight=None):
-    """Compute precision, recall, F-measure and support for each class
-
-    The precision is the ratio ``tp / (tp + fp)`` where ``tp`` is the number of
-    true positives and ``fp`` the number of false positives. The precision is
-    intuitively the ability of the classifier not to label as positive a sample
-    that is negative.
-
-    The recall is the ratio ``tp / (tp + fn)`` where ``tp`` is the number of
-    true positives and ``fn`` the number of false negatives. The recall is
-    intuitively the ability of the classifier to find all the positive samples.
-
-    The F-beta score can be interpreted as a weighted harmonic mean of
-    the precision and recall, where an F-beta score reaches its best
-    value at 1 and worst score at 0.
-
-    The F-beta score weights recall more than precision by a factor of
-    ``beta``. ``beta == 1.0`` means recall and precision are equally important.
-
-    The support is the number of occurrences of each class in ``y_true``.
-
-    If ``pos_label is None`` and in binary classification, this function
-    returns the average precision, recall and F-measure if ``average``
-    is one of ``'micro'``, ``'macro'``, ``'weighted'`` or ``'samples'``.
-
-    Read more in the :ref:`User Guide <precision_recall_f_measure_metrics>`.
-
-    Parameters
-    ----------
-    y_true : 1d array-like, or label indicator array / sparse matrix
-        Ground truth (correct) target values.
-
-    y_pred : 1d array-like, or label indicator array / sparse matrix
-        Estimated targets as returned by a classifier.
-
-    beta : float, 1.0 by default
-        The strength of recall versus precision in the F-score.
-
-    labels : list, optional
-        The set of labels to include when ``average != 'binary'``, and their
-        order if ``average is None``. Labels present in the data can be
-        excluded, for example to calculate a multiclass average ignoring a
-        majority negative class, while labels not present in the data will
-        result in 0 components in a macro average. For multilabel targets,
-        labels are column indices. By default, all labels in ``y_true`` and
-        ``y_pred`` are used in sorted order.
-
-    pos_label : str or int, 1 by default
-        The class to report if ``average='binary'``. Until version 0.18 it is
-        necessary to set ``pos_label=None`` if seeking to use another averaging
-        method over binary targets.
-
-    average : string, [None (default), 'binary', 'micro', 'macro', 'samples', \
-                       'weighted']
-        If ``None``, the scores for each class are returned. Otherwise, this
-        determines the type of averaging performed on the data:
-
-        ``'binary'``:
-            Only report results for the class specified by ``pos_label``.
-            This is applicable only if targets (``y_{true,pred}``) are binary.
-        ``'micro'``:
-            Calculate metrics globally by counting the total true positives,
-            false negatives and false positives.
-        ``'macro'``:
-            Calculate metrics for each label, and find their unweighted
-            mean.  This does not take label imbalance into account.
-        ``'weighted'``:
-            Calculate metrics for each label, and find their average, weighted
-            by support (the number of true instances for each label). This
-            alters 'macro' to account for label imbalance; it can result in an
-            F-score that is not between precision and recall.
-        ``'samples'``:
-            Calculate metrics for each instance, and find their average (only
-            meaningful for multilabel classification where this differs from
-            :func:`accuracy_score`).
-
-        Note that if ``pos_label`` is given in binary classification with
-        `average != 'binary'`, only that positive class is reported. This
-        behavior is deprecated and will change in version 0.18.
-
-    warn_for : tuple or set, for internal use
-        This determines which warnings will be made in the case that this
-        function is being used to return only one of its metrics.
-
-    sample_weight : array-like of shape = [n_samples], optional
-        Sample weights.
-
-    Returns
-    -------
-    precision: float (if average is not None) or array of float, shape =\
-        [n_unique_labels]
-
-    recall: float (if average is not None) or array of float, , shape =\
-        [n_unique_labels]
-
-    fbeta_score: float (if average is not None) or array of float, shape =\
-        [n_unique_labels]
-
-    support: int (if average is not None) or array of int, shape =\
-        [n_unique_labels]
-        The number of occurrences of each label in ``y_true``.
-
-    References
-    ----------
-    .. [1] `Wikipedia entry for the Precision and recall
-           <https://en.wikipedia.org/wiki/Precision_and_recall>`_
-
-    .. [2] `Wikipedia entry for the F1-score
-           <https://en.wikipedia.org/wiki/F1_score>`_
-
-    .. [3] `Discriminative Methods for Multi-labeled Classification Advances
-           in Knowledge Discovery and Data Mining (2004), pp. 22-30 by Shantanu
-           Godbole, Sunita Sarawagi
-           <http://www.godbole.net/shantanu/pubs/multilabelsvm-pakdd04.pdf>`
-
-    Examples
-    --------
-    >>> from sklearn.metrics import precision_recall_fscore_support
-    >>> y_true = np.array(['cat', 'dog', 'pig', 'cat', 'dog', 'pig'])
-    >>> y_pred = np.array(['cat', 'pig', 'dog', 'cat', 'cat', 'dog'])
-    >>> precision_recall_fscore_support(y_true, y_pred, average='macro')
-    ... # doctest: +ELLIPSIS
-    (0.22..., 0.33..., 0.26..., None)
-    >>> precision_recall_fscore_support(y_true, y_pred, average='micro')
-    ... # doctest: +ELLIPSIS
-    (0.33..., 0.33..., 0.33..., None)
-    >>> precision_recall_fscore_support(y_true, y_pred, average='weighted')
-    ... # doctest: +ELLIPSIS
-    (0.22..., 0.33..., 0.26..., None)
-
-    It is possible to compute per-label precisions, recalls, F1-scores and
-    supports instead of averaging:
-    >>> precision_recall_fscore_support(y_true, y_pred, average=None,
-    ... labels=['pig', 'dog', 'cat'])
-    ... # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE
-    (array([ 0. ,  0. ,  0.66...]),
-     array([ 0.,  0.,  1.]),
-     array([ 0. ,  0. ,  0.8]),
-     array([2, 2, 2]))
-
-    """
-    average_options = (None, 'micro', 'macro', 'weighted', 'samples')
-    if average not in average_options and average != 'binary':
-        raise ValueError('average has to be one of ' +
-                         str(average_options))
-    if beta <= 0:
-        raise ValueError("beta should be >0 in the F-beta score")
-
-    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
-    present_labels = unique_labels(y_true, y_pred)
-
-    if average == 'binary' and (y_type != 'binary' or pos_label is None):
-        warnings.warn('The default `weighted` averaging is deprecated, '
-                      'and from version 0.18, use of precision, recall or '
-                      'F-score with multiclass or multilabel data or '
-                      'pos_label=None will result in an exception. '
-                      'Please set an explicit value for `average`, one of '
-                      '%s. In cross validation use, for instance, '
-                      'scoring="f1_weighted" instead of scoring="f1".'
-                      % str(average_options), DeprecationWarning, stacklevel=2)
-        average = 'weighted'
-
-    if y_type == 'binary' and pos_label is not None and average is not None:
-        if average != 'binary':
-            warnings.warn('From version 0.18, binary input will not be '
-                          'handled specially when using averaged '
-                          'precision/recall/F-score. '
-                          'Please use average=\'binary\' to report only the '
-                          'positive class performance.', DeprecationWarning)
-        if labels is None or len(labels) <= 2:
-            if pos_label not in present_labels:
-                if len(present_labels) < 2:
-                    # Only negative labels
-                    return (0., 0., 0., 0)
-                else:
-                    raise ValueError("pos_label=%r is not a valid label: %r" %
-                                     (pos_label, present_labels))
-            labels = [pos_label]
-    if labels is None:
-        labels = present_labels
-        n_labels = None
-    else:
-        n_labels = len(labels)
-        labels = np.hstack([labels, np.setdiff1d(present_labels, labels,
-                                                 assume_unique=True)])
-
-    # Calculate tp_sum, pred_sum, true_sum ###
-
-    if y_type.startswith('multilabel'):
-        sum_axis = 1 if average == 'samples' else 0
-
-        # All labels are index integers for multilabel.
-        # Select labels:
-        if not np.all(labels == present_labels):
-            if np.max(labels) > np.max(present_labels):
-                raise ValueError('All labels must be in [0, n labels). '
-                                 'Got %d > %d' %
-                                 (np.max(labels), np.max(present_labels)))
-            if np.min(labels) < 0:
-                raise ValueError('All labels must be in [0, n labels). '
-                                 'Got %d < 0' % np.min(labels))
-
-            y_true = y_true[:, labels[:n_labels]]
-            y_pred = y_pred[:, labels[:n_labels]]
-
-        # calculate weighted counts
-        true_and_pred = y_true.multiply(y_pred)
-        tp_sum = count_nonzero(true_and_pred, axis=sum_axis,
-                               sample_weight=sample_weight)
-        pred_sum = count_nonzero(y_pred, axis=sum_axis,
-                                 sample_weight=sample_weight)
-        true_sum = count_nonzero(y_true, axis=sum_axis,
-                                 sample_weight=sample_weight)
-
-    elif average == 'samples':
-        raise ValueError("Sample-based precision, recall, fscore is "
-                         "not meaningful outside multilabel "
-                         "classification. See the accuracy_score instead.")
-    else:
-        le = LabelEncoder()
-        le.fit(labels)
-        y_true = le.transform(y_true)
-        y_pred = le.transform(y_pred)
-        sorted_labels = le.classes_
-
-        # labels are now from 0 to len(labels) - 1 -> use bincount
-        tp = y_true == y_pred
-        tp_bins = y_true[tp]
-        if sample_weight is not None:
-            tp_bins_weights = np.asarray(sample_weight)[tp]
-        else:
-            tp_bins_weights = None
-
-        if len(tp_bins):
-            tp_sum = bincount(tp_bins, weights=tp_bins_weights,
-                              minlength=len(labels))
-        else:
-            # Pathological case
-            true_sum = pred_sum = tp_sum = np.zeros(len(labels))
-        if len(y_pred):
-            pred_sum = bincount(y_pred, weights=sample_weight,
-                                minlength=len(labels))
-        if len(y_true):
-            true_sum = bincount(y_true, weights=sample_weight,
-                                minlength=len(labels))
-
-        # Retain only selected labels
-        indices = np.searchsorted(sorted_labels, labels[:n_labels])
-        tp_sum = tp_sum[indices]
-        true_sum = true_sum[indices]
-        pred_sum = pred_sum[indices]
-
-    if average == 'micro':
-        tp_sum = np.array([tp_sum.sum()])
-        pred_sum = np.array([pred_sum.sum()])
-        true_sum = np.array([true_sum.sum()])
-
-    # Finally, we have all our sufficient statistics. Divide! #
-
-    beta2 = beta ** 2
-    with np.errstate(divide='ignore', invalid='ignore'):
-        # Divide, and on zero-division, set scores to 0 and warn:
-
-        # Oddly, we may get an "invalid" rather than a "divide" error
-        # here.
-        precision = _prf_divide(tp_sum, pred_sum,
-                                'precision', 'predicted', average, warn_for)
-        recall = _prf_divide(tp_sum, true_sum,
-                             'recall', 'true', average, warn_for)
-        # Don't need to warn for F: either P or R warned, or tp == 0 where pos
-        # and true are nonzero, in which case, F is well-defined and zero
-        f_score = ((1 + beta2) * precision * recall /
-                   (beta2 * precision + recall))
-        f_score[tp_sum == 0] = 0.0
-
-    # Average the results
-
-    if average == 'weighted':
-        weights = true_sum
-        if weights.sum() == 0:
-            return 0, 0, 0, None
-    elif average == 'samples':
-        weights = sample_weight
-    else:
-        weights = None
-
-    if average is not None:
-        assert average != 'binary' or len(precision) == 1
-        precision = np.average(precision, weights=weights)
-        recall = np.average(recall, weights=weights)
-        f_score = np.average(f_score, weights=weights)
-        true_sum = None  # return no support
-
-    return precision, recall, f_score, true_sum
-
-
-def precision_score(y_true, y_pred, labels=None, pos_label=1,
-                    average='binary', sample_weight=None):
-    """Compute the precision
-
-    The precision is the ratio ``tp / (tp + fp)`` where ``tp`` is the number of
-    true positives and ``fp`` the number of false positives. The precision is
-    intuitively the ability of the classifier not to label as positive a sample
-    that is negative.
-
-    The best value is 1 and the worst value is 0.
-
-    Read more in the :ref:`User Guide <precision_recall_f_measure_metrics>`.
-
-    Parameters
-    ----------
-    y_true : 1d array-like, or label indicator array / sparse matrix
-        Ground truth (correct) target values.
-
-    y_pred : 1d array-like, or label indicator array / sparse matrix
-        Estimated targets as returned by a classifier.
-
-    labels : list, optional
-        The set of labels to include when ``average != 'binary'``, and their
-        order if ``average is None``. Labels present in the data can be
-        excluded, for example to calculate a multiclass average ignoring a
-        majority negative class, while labels not present in the data will
-        result in 0 components in a macro average. For multilabel targets,
-        labels are column indices. By default, all labels in ``y_true`` and
-        ``y_pred`` are used in sorted order.
-
-        .. versionchanged:: 0.17
-           parameter *labels* improved for multiclass problem.
-
-    pos_label : str or int, 1 by default
-        The class to report if ``average='binary'``. Until version 0.18 it is
-        necessary to set ``pos_label=None`` if seeking to use another averaging
-        method over binary targets.
-
-    average : string, [None, 'binary' (default), 'micro', 'macro', 'samples', \
-                       'weighted']
-        This parameter is required for multiclass/multilabel targets.
-        If ``None``, the scores for each class are returned. Otherwise, this
-        determines the type of averaging performed on the data:
-
-        ``'binary'``:
-            Only report results for the class specified by ``pos_label``.
-            This is applicable only if targets (``y_{true,pred}``) are binary.
-        ``'micro'``:
-            Calculate metrics globally by counting the total true positives,
-            false negatives and false positives.
-        ``'macro'``:
-            Calculate metrics for each label, and find their unweighted
-            mean.  This does not take label imbalance into account.
-        ``'weighted'``:
-            Calculate metrics for each label, and find their average, weighted
-            by support (the number of true instances for each label). This
-            alters 'macro' to account for label imbalance; it can result in an
-            F-score that is not between precision and recall.
-        ``'samples'``:
-            Calculate metrics for each instance, and find their average (only
-            meaningful for multilabel classification where this differs from
-            :func:`accuracy_score`).
-
-        Note that if ``pos_label`` is given in binary classification with
-        `average != 'binary'`, only that positive class is reported. This
-        behavior is deprecated and will change in version 0.18.
-
-    sample_weight : array-like of shape = [n_samples], optional
-        Sample weights.
-
-    Returns
-    -------
-    precision : float (if average is not None) or array of float, shape =\
-        [n_unique_labels]
-        Precision of the positive class in binary classification or weighted
-        average of the precision of each class for the multiclass task.
-
-    Examples
-    --------
-
-    >>> from sklearn.metrics import precision_score
-    >>> y_true = [0, 1, 2, 0, 1, 2]
-    >>> y_pred = [0, 2, 1, 0, 0, 1]
-    >>> precision_score(y_true, y_pred, average='macro')  # doctest: +ELLIPSIS
-    0.22...
-    >>> precision_score(y_true, y_pred, average='micro')  # doctest: +ELLIPSIS
-    0.33...
-    >>> precision_score(y_true, y_pred, average='weighted')
-    ... # doctest: +ELLIPSIS
-    0.22...
-    >>> precision_score(y_true, y_pred, average=None)  # doctest: +ELLIPSIS
-    array([ 0.66...,  0.        ,  0.        ])
-
-    """
-    p, _, _, _ = precision_recall_fscore_support(y_true, y_pred,
-                                                 labels=labels,
-                                                 pos_label=pos_label,
-                                                 average=average,
-                                                 warn_for=('precision',),
-                                                 sample_weight=sample_weight)
-    return p
-
-
-def recall_score(y_true, y_pred, labels=None, pos_label=1, average='binary',
-                 sample_weight=None):
-    """Compute the recall
-
-    The recall is the ratio ``tp / (tp + fn)`` where ``tp`` is the number of
-    true positives and ``fn`` the number of false negatives. The recall is
-    intuitively the ability of the classifier to find all the positive samples.
-
-    The best value is 1 and the worst value is 0.
-
-    Read more in the :ref:`User Guide <precision_recall_f_measure_metrics>`.
-
-    Parameters
-    ----------
-    y_true : 1d array-like, or label indicator array / sparse matrix
-        Ground truth (correct) target values.
-
-    y_pred : 1d array-like, or label indicator array / sparse matrix
-        Estimated targets as returned by a classifier.
-
-    labels : list, optional
-        The set of labels to include when ``average != 'binary'``, and their
-        order if ``average is None``. Labels present in the data can be
-        excluded, for example to calculate a multiclass average ignoring a
-        majority negative class, while labels not present in the data will
-        result in 0 components in a macro average. For multilabel targets,
-        labels are column indices. By default, all labels in ``y_true`` and
-        ``y_pred`` are used in sorted order.
-
-        .. versionchanged:: 0.17
-           parameter *labels* improved for multiclass problem.
-
-    pos_label : str or int, 1 by default
-        The class to report if ``average='binary'``. Until version 0.18 it is
-        necessary to set ``pos_label=None`` if seeking to use another averaging
-        method over binary targets.
-
-    average : string, [None, 'binary' (default), 'micro', 'macro', 'samples', \
-                       'weighted']
-        This parameter is required for multiclass/multilabel targets.
-        If ``None``, the scores for each class are returned. Otherwise, this
-        determines the type of averaging performed on the data:
-
-        ``'binary'``:
-            Only report results for the class specified by ``pos_label``.
-            This is applicable only if targets (``y_{true,pred}``) are binary.
-        ``'micro'``:
-            Calculate metrics globally by counting the total true positives,
-            false negatives and false positives.
-        ``'macro'``:
-            Calculate metrics for each label, and find their unweighted
-            mean.  This does not take label imbalance into account.
-        ``'weighted'``:
-            Calculate metrics for each label, and find their average, weighted
-            by support (the number of true instances for each label). This
-            alters 'macro' to account for label imbalance; it can result in an
-            F-score that is not between precision and recall.
-        ``'samples'``:
-            Calculate metrics for each instance, and find their average (only
-            meaningful for multilabel classification where this differs from
-            :func:`accuracy_score`).
-
-        Note that if ``pos_label`` is given in binary classification with
-        `average != 'binary'`, only that positive class is reported. This
-        behavior is deprecated and will change in version 0.18.
-
-    sample_weight : array-like of shape = [n_samples], optional
-        Sample weights.
-
-    Returns
-    -------
-    recall : float (if average is not None) or array of float, shape =\
-        [n_unique_labels]
-        Recall of the positive class in binary classification or weighted
-        average of the recall of each class for the multiclass task.
-
-    Examples
-    --------
-    >>> from sklearn.metrics import recall_score
-    >>> y_true = [0, 1, 2, 0, 1, 2]
-    >>> y_pred = [0, 2, 1, 0, 0, 1]
-    >>> recall_score(y_true, y_pred, average='macro')  # doctest: +ELLIPSIS
-    0.33...
-    >>> recall_score(y_true, y_pred, average='micro')  # doctest: +ELLIPSIS
-    0.33...
-    >>> recall_score(y_true, y_pred, average='weighted')  # doctest: +ELLIPSIS
-    0.33...
-    >>> recall_score(y_true, y_pred, average=None)
-    array([ 1.,  0.,  0.])
-
-
-    """
-    _, r, _, _ = precision_recall_fscore_support(y_true, y_pred,
-                                                 labels=labels,
-                                                 pos_label=pos_label,
-                                                 average=average,
-                                                 warn_for=('recall',),
-                                                 sample_weight=sample_weight)
-    return r
-
-
-def classification_report(y_true, y_pred, labels=None, target_names=None,
-                          sample_weight=None, digits=2):
-    """Build a text report showing the main classification metrics
-
-    Read more in the :ref:`User Guide <classification_report>`.
-
-    Parameters
-    ----------
-    y_true : 1d array-like, or label indicator array / sparse matrix
-        Ground truth (correct) target values.
-
-    y_pred : 1d array-like, or label indicator array / sparse matrix
-        Estimated targets as returned by a classifier.
-
-    labels : array, shape = [n_labels]
-        Optional list of label indices to include in the report.
-
-    target_names : list of strings
-        Optional display names matching the labels (same order).
-
-    sample_weight : array-like of shape = [n_samples], optional
-        Sample weights.
-
-    digits : int
-        Number of digits for formatting output floating point values
-
-    Returns
-    -------
-    report : string
-        Text summary of the precision, recall, F1 score for each class.
-
-    Examples
-    --------
-    >>> from sklearn.metrics import classification_report
-    >>> y_true = [0, 1, 2, 2, 2]
-    >>> y_pred = [0, 0, 2, 2, 1]
-    >>> target_names = ['class 0', 'class 1', 'class 2']
-    >>> print(classification_report(y_true, y_pred, target_names=target_names))
-                 precision    recall  f1-score   support
-    <BLANKLINE>
-        class 0       0.50      1.00      0.67         1
-        class 1       0.00      0.00      0.00         1
-        class 2       1.00      0.67      0.80         3
-    <BLANKLINE>
-    avg / total       0.70      0.60      0.61         5
-    <BLANKLINE>
-
-    """
-
-    if labels is None:
-        labels = unique_labels(y_true, y_pred)
-    else:
-        labels = np.asarray(labels)
-
-    last_line_heading = 'avg / total'
-
-    if target_names is None:
-        target_names = ['%s' % l for l in labels]
-    name_width = max(len(cn) for cn in target_names)
-    width = max(name_width, len(last_line_heading), digits)
-
-    headers = ["precision", "recall", "f1-score", "support"]
-    fmt = '%% %ds' % width  # first column: class name
-    fmt += '  '
-    fmt += ' '.join(['% 9s' for _ in headers])
-    fmt += '\n'
-
-    headers = [""] + headers
-    report = fmt % tuple(headers)
-    report += '\n'
-
-    p, r, f1, s = precision_recall_fscore_support(y_true, y_pred,
-                                                  labels=labels,
-                                                  average=None,
-                                                  sample_weight=sample_weight)
-
-    for i, label in enumerate(labels):
-        values = [target_names[i]]
-        for v in (p[i], r[i], f1[i]):
-            values += ["{0:0.{1}f}".format(v, digits)]
-        values += ["{0}".format(s[i])]
-        report += fmt % tuple(values)
-
-    report += '\n'
-
-    # compute averages
-    values = [last_line_heading]
-    for v in (np.average(p, weights=s),
-              np.average(r, weights=s),
-              np.average(f1, weights=s)):
-        values += ["{0:0.{1}f}".format(v, digits)]
-    values += ['{0}'.format(np.sum(s))]
-    report += fmt % tuple(values)
-    return report
-
-
-def hamming_loss(y_true, y_pred, classes=None, sample_weight=None):
-    """Compute the average Hamming loss.
-
-    The Hamming loss is the fraction of labels that are incorrectly predicted.
-
-    Read more in the :ref:`User Guide <hamming_loss>`.
-
-    Parameters
-    ----------
-    y_true : 1d array-like, or label indicator array / sparse matrix
-        Ground truth (correct) labels.
-
-    y_pred : 1d array-like, or label indicator array / sparse matrix
-        Predicted labels, as returned by a classifier.
-
-    classes : array, shape = [n_labels], optional
-        Integer array of labels.
-
-    sample_weight : array-like of shape = [n_samples], optional
-        Sample weights.
-
-    Returns
-    -------
-    loss : float or int,
-        Return the average Hamming loss between element of ``y_true`` and
-        ``y_pred``.
-
-    See Also
-    --------
-    accuracy_score, jaccard_similarity_score, zero_one_loss
-
-    Notes
-    -----
-    In multiclass classification, the Hamming loss correspond to the Hamming
-    distance between ``y_true`` and ``y_pred`` which is equivalent to the
-    subset ``zero_one_loss`` function.
-
-    In multilabel classification, the Hamming loss is different from the
-    subset zero-one loss. The zero-one loss considers the entire set of labels
-    for a given sample incorrect if it does entirely match the true set of
-    labels. Hamming loss is more forgiving in that it penalizes the individual
-    labels.
-
-    The Hamming loss is upperbounded by the subset zero-one loss. When
-    normalized over samples, the Hamming loss is always between 0 and 1.
-
-    References
-    ----------
-    .. [1] Grigorios Tsoumakas, Ioannis Katakis. Multi-Label Classification:
-           An Overview. International Journal of Data Warehousing & Mining,
-           3(3), 1-13, July-September 2007.
-
-    .. [2] `Wikipedia entry on the Hamming distance
-           <https://en.wikipedia.org/wiki/Hamming_distance>`_
-
-    Examples
-    --------
-    >>> from sklearn.metrics import hamming_loss
-    >>> y_pred = [1, 2, 3, 4]
-    >>> y_true = [2, 2, 3, 4]
-    >>> hamming_loss(y_true, y_pred)
-    0.25
-
-    In the multilabel case with binary label indicators:
-
-    >>> hamming_loss(np.array([[0, 1], [1, 1]]), np.zeros((2, 2)))
-    0.75
-    """
-    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
-
-    if classes is None:
-        classes = unique_labels(y_true, y_pred)
-    else:
-        classes = np.asarray(classes)
-
-    if sample_weight is None:
-        weight_average = 1.
-    else:
-        weight_average = np.mean(sample_weight)
-
-    if y_type.startswith('multilabel'):
-        n_differences = count_nonzero(y_true - y_pred,
-                                      sample_weight=sample_weight)
-        return (n_differences /
-                (y_true.shape[0] * len(classes) * weight_average))
-
-    elif y_type in ["binary", "multiclass"]:
-        return _weighted_sum(y_true != y_pred, sample_weight, normalize=True)
-    else:
-        raise ValueError("{0} is not supported".format(y_type))
-
-
-def log_loss(y_true, y_pred, eps=1e-15, normalize=True, sample_weight=None):
-    """Log loss, aka logistic loss or cross-entropy loss.
-
-    This is the loss function used in (multinomial) logistic regression
-    and extensions of it such as neural networks, defined as the negative
-    log-likelihood of the true labels given a probabilistic classifier's
-    predictions. For a single sample with true label yt in {0,1} and
-    estimated probability yp that yt = 1, the log loss is
-
-        -log P(yt|yp) = -(yt log(yp) + (1 - yt) log(1 - yp))
-
-    Read more in the :ref:`User Guide <log_loss>`.
-
-    Parameters
-    ----------
-    y_true : array-like or label indicator matrix
-        Ground truth (correct) labels for n_samples samples.
-
-    y_pred : array-like of float, shape = (n_samples, n_classes)
-        Predicted probabilities, as returned by a classifier's
-        predict_proba method.
-
-    eps : float
-        Log loss is undefined for p=0 or p=1, so probabilities are
-        clipped to max(eps, min(1 - eps, p)).
-
-    normalize : bool, optional (default=True)
-        If true, return the mean loss per sample.
-        Otherwise, return the sum of the per-sample losses.
-
-    sample_weight : array-like of shape = [n_samples], optional
-        Sample weights.
-
-    Returns
-    -------
-    loss : float
-
-    Examples
-    --------
-    >>> log_loss(["spam", "ham", "ham", "spam"],  # doctest: +ELLIPSIS
-    ...          [[.1, .9], [.9, .1], [.8, .2], [.35, .65]])
-    0.21616...
-
-    References
-    ----------
-    C.M. Bishop (2006). Pattern Recognition and Machine Learning. Springer,
-    p. 209.
-
-    Notes
-    -----
-    The logarithm used is the natural logarithm (base-e).
-    """
-    lb = LabelBinarizer()
-    T = lb.fit_transform(y_true)
-    if T.shape[1] == 1:
-        T = np.append(1 - T, T, axis=1)
-
-    y_pred = check_array(y_pred, ensure_2d=False)
-    # Clipping
-    Y = np.clip(y_pred, eps, 1 - eps)
-
-    # This happens in cases when elements in y_pred have type "str".
-    if not isinstance(Y, np.ndarray):
-        raise ValueError("y_pred should be an array of floats.")
-
-    # If y_pred is of single dimension, assume y_true to be binary
-    # and then check.
-    if Y.ndim == 1:
-        Y = Y[:, np.newaxis]
-    if Y.shape[1] == 1:
-        Y = np.append(1 - Y, Y, axis=1)
-
-    # Check if dimensions are consistent.
-    check_consistent_length(T, Y)
-    T = check_array(T)
-    Y = check_array(Y)
-    if T.shape[1] != Y.shape[1]:
-        raise ValueError("y_true and y_pred have different number of classes "
-                         "%d, %d" % (T.shape[1], Y.shape[1]))
-
-    # Renormalize
-    Y /= Y.sum(axis=1)[:, np.newaxis]
-    loss = -(T * np.log(Y)).sum(axis=1)
-
-    return _weighted_sum(loss, sample_weight, normalize)
-
-
-def hinge_loss(y_true, pred_decision, labels=None, sample_weight=None):
-    """Average hinge loss (non-regularized)
-
-    In binary class case, assuming labels in y_true are encoded with +1 and -1,
-    when a prediction mistake is made, ``margin = y_true * pred_decision`` is
-    always negative (since the signs disagree), implying ``1 - margin`` is
-    always greater than 1.  The cumulated hinge loss is therefore an upper
-    bound of the number of mistakes made by the classifier.
-
-    In multiclass case, the function expects that either all the labels are
-    included in y_true or an optional labels argument is provided which
-    contains all the labels. The multilabel margin is calculated according
-    to Crammer-Singer's method. As in the binary case, the cumulated hinge loss
-    is an upper bound of the number of mistakes made by the classifier.
-
-    Read more in the :ref:`User Guide <hinge_loss>`.
-
-    Parameters
-    ----------
-    y_true : array, shape = [n_samples]
-        True target, consisting of integers of two values. The positive label
-        must be greater than the negative label.
-
-    pred_decision : array, shape = [n_samples] or [n_samples, n_classes]
-        Predicted decisions, as output by decision_function (floats).
-
-    labels : array, optional, default None
-        Contains all the labels for the problem. Used in multiclass hinge loss.
-
-    sample_weight : array-like of shape = [n_samples], optional
-        Sample weights.
-
-    Returns
-    -------
-    loss : float
-
-    References
-    ----------
-    .. [1] `Wikipedia entry on the Hinge loss
-           <https://en.wikipedia.org/wiki/Hinge_loss>`_
-
-    .. [2] Koby Crammer, Yoram Singer. On the Algorithmic
-           Implementation of Multiclass Kernel-based Vector
-           Machines. Journal of Machine Learning Research 2,
-           (2001), 265-292
-
-    .. [3] `L1 AND L2 Regularization for Multiclass Hinge Loss Models
-           by Robert C. Moore, John DeNero.
-           <http://www.ttic.edu/sigml/symposium2011/papers/
-           Moore+DeNero_Regularization.pdf>`_
-
-    Examples
-    --------
-    >>> from sklearn import svm
-    >>> from sklearn.metrics import hinge_loss
-    >>> X = [[0], [1]]
-    >>> y = [-1, 1]
-    >>> est = svm.LinearSVC(random_state=0)
-    >>> est.fit(X, y)
-    LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
-         intercept_scaling=1, loss='squared_hinge', max_iter=1000,
-         multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
-         verbose=0)
-    >>> pred_decision = est.decision_function([[-2], [3], [0.5]])
-    >>> pred_decision  # doctest: +ELLIPSIS
-    array([-2.18...,  2.36...,  0.09...])
-    >>> hinge_loss([-1, 1, 1], pred_decision)  # doctest: +ELLIPSIS
-    0.30...
-
-    In the multiclass case:
-
-    >>> X = np.array([[0], [1], [2], [3]])
-    >>> Y = np.array([0, 1, 2, 3])
-    >>> labels = np.array([0, 1, 2, 3])
-    >>> est = svm.LinearSVC()
-    >>> est.fit(X, Y)
-    LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
-         intercept_scaling=1, loss='squared_hinge', max_iter=1000,
-         multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
-         verbose=0)
-    >>> pred_decision = est.decision_function([[-1], [2], [3]])
-    >>> y_true = [0, 2, 3]
-    >>> hinge_loss(y_true, pred_decision, labels)  #doctest: +ELLIPSIS
-    0.56...
-    """
-    check_consistent_length(y_true, pred_decision, sample_weight)
-    pred_decision = check_array(pred_decision, ensure_2d=False)
-    y_true = column_or_1d(y_true)
-    y_true_unique = np.unique(y_true)
-    if y_true_unique.size > 2:
-        if (labels is None and pred_decision.ndim > 1 and
-                (np.size(y_true_unique) != pred_decision.shape[1])):
-            raise ValueError("Please include all labels in y_true "
-                             "or pass labels as third argument")
-        if labels is None:
-            labels = y_true_unique
-        le = LabelEncoder()
-        le.fit(labels)
-        y_true = le.transform(y_true)
-        mask = np.ones_like(pred_decision, dtype=bool)
-        mask[np.arange(y_true.shape[0]), y_true] = False
-        margin = pred_decision[~mask]
-        margin -= np.max(pred_decision[mask].reshape(y_true.shape[0], -1),
-                         axis=1)
-
-    else:
-        # Handles binary class case
-        # this code assumes that positive and negative labels
-        # are encoded as +1 and -1 respectively
-        pred_decision = column_or_1d(pred_decision)
-        pred_decision = np.ravel(pred_decision)
-
-        lbin = LabelBinarizer(neg_label=-1)
-        y_true = lbin.fit_transform(y_true)[:, 0]
-
-        try:
-            margin = y_true * pred_decision
-        except TypeError:
-            raise TypeError("pred_decision should be an array of floats.")
-
-    losses = 1 - margin
-    # The hinge_loss doesn't penalize good enough predictions.
-    losses[losses <= 0] = 0
-    return np.average(losses, weights=sample_weight)
-
-
-def _check_binary_probabilistic_predictions(y_true, y_prob):
-    """Check that y_true is binary and y_prob contains valid probabilities"""
-    check_consistent_length(y_true, y_prob)
-
-    labels = np.unique(y_true)
-
-    if len(labels) != 2:
-        raise ValueError("Only binary classification is supported. "
-                         "Provided labels %s." % labels)
-
-    if y_prob.max() > 1:
-        raise ValueError("y_prob contains values greater than 1.")
-
-    if y_prob.min() < 0:
-        raise ValueError("y_prob contains values less than 0.")
-
-    return label_binarize(y_true, labels)[:, 0]
-
-
-def brier_score_loss(y_true, y_prob, sample_weight=None, pos_label=None):
-    """Compute the Brier score.
-
-    The smaller the Brier score, the better, hence the naming with "loss".
-
-    Across all items in a set N predictions, the Brier score measures the
-    mean squared difference between (1) the predicted probability assigned
-    to the possible outcomes for item i, and (2) the actual outcome.
-    Therefore, the lower the Brier score is for a set of predictions, the
-    better the predictions are calibrated. Note that the Brier score always
-    takes on a value between zero and one, since this is the largest
-    possible difference between a predicted probability (which must be
-    between zero and one) and the actual outcome (which can take on values
-    of only 0 and 1).
-
-    The Brier score is appropriate for binary and categorical outcomes that
-    can be structured as true or false, but is inappropriate for ordinal
-    variables which can take on three or more values (this is because the
-    Brier score assumes that all possible outcomes are equivalently
-    "distant" from one another). Which label is considered to be the positive
-    label is controlled via the parameter pos_label, which defaults to 1.
-
-    Read more in the :ref:`User Guide <calibration>`.
-
-    Parameters
-    ----------
-    y_true : array, shape (n_samples,)
-        True targets.
-
-    y_prob : array, shape (n_samples,)
-        Probabilities of the positive class.
-
-    sample_weight : array-like of shape = [n_samples], optional
-        Sample weights.
-
-    pos_label : int (default: None)
-        Label of the positive class. If None, the maximum label is used as
-        positive class
-
-    Returns
-    -------
-    score : float
-        Brier score
-
-    Examples
-    --------
-    >>> import numpy as np
-    >>> from sklearn.metrics import brier_score_loss
-    >>> y_true = np.array([0, 1, 1, 0])
-    >>> y_true_categorical = np.array(["spam", "ham", "ham", "spam"])
-    >>> y_prob = np.array([0.1, 0.9, 0.8, 0.3])
-    >>> brier_score_loss(y_true, y_prob)  # doctest: +ELLIPSIS
-    0.037...
-    >>> brier_score_loss(y_true, 1-y_prob, pos_label=0)  # doctest: +ELLIPSIS
-    0.037...
-    >>> brier_score_loss(y_true_categorical, y_prob, \
-                         pos_label="ham")  # doctest: +ELLIPSIS
-    0.037...
-    >>> brier_score_loss(y_true, np.array(y_prob) > 0.5)
-    0.0
-
-    References
-    ----------
-    https://en.wikipedia.org/wiki/Brier_score
-    """
-    y_true = column_or_1d(y_true)
-    y_prob = column_or_1d(y_prob)
-    if pos_label is None:
-        pos_label = y_true.max()
-    y_true = np.array(y_true == pos_label, int)
-    y_true = _check_binary_probabilistic_predictions(y_true, y_prob)
-    return np.average((y_true - y_prob) ** 2, weights=sample_weight)

From 1f0c36845a0ab25a7b72f504fe272ba127522688 Mon Sep 17 00:00:00 2001
From: Nelson Liu <nelson.liu.2009@gmail.com>
Date: Mon, 22 Feb 2016 02:12:20 -0800
Subject: [PATCH 4/4] doc: address some comments by @ogrisel

---
 doc/developers/advanced_installation.rst | 6 ++----
 doc/modules/feature_selection.rst        | 4 ++--
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/doc/developers/advanced_installation.rst b/doc/developers/advanced_installation.rst
index 8b5a295675d0a..29e8e54d275d3 100644
--- a/doc/developers/advanced_installation.rst
+++ b/doc/developers/advanced_installation.rst
@@ -279,10 +279,8 @@ path environment variable.
 -------------
 
 for 32-bit python it is possible use the standalone installers for
-`microsoft visual c++ express 2008 <http://download.microsoft.com/download/A/5/4/A54BADB6-9C3F-478D-8657-93B3FC9FE62D/vcsetup.exe>`_
-for python 2 or
-`microsoft visual c++ express 2010 <https://www.visualstudio.com/products/visual-studio-dev-essentials-vs>`_
-or python 3.
+`microsoft visual c++ express 2008 <http://go.microsoft.com/?linkid=7729279>`_
+for python 2 or microsoft visual c++ express 2010 for python 3.
 
 once installed you should be able to build scikit-learn without any
 particular configuration by running the following command in the scikit-learn
diff --git a/doc/modules/feature_selection.rst b/doc/modules/feature_selection.rst
index 081cdba4cc97e..d28fbacce3ddd 100644
--- a/doc/modules/feature_selection.rst
+++ b/doc/modules/feature_selection.rst
@@ -265,7 +265,7 @@ of features non zero.
 
    * N. Meinshausen, P. Buhlmann, "Stability selection",
      Journal of the Royal Statistical Society, 72 (2010)
-     http://arxiv.org/pdf/0809.2932.pdf
+     http://arxiv.org/abs/0809.2932
 
    * F. Bach, "Model-Consistent Sparse Estimation through the Bootstrap"
      https://hal.inria.fr/hal-00354771/
@@ -324,4 +324,4 @@ Then, a :class:`sklearn.ensemble.RandomForestClassifier` is trained on the
 transformed output, i.e. using only relevant features. You can perform
 similar operations with the other feature selection methods and also
 classifiers that provide a way to evaluate feature importances of course.
-See the :class:`sklearn.pipeline.Pipeline` examples for more details.
\ No newline at end of file
+See the :class:`sklearn.pipeline.Pipeline` examples for more details.