From 2684d279e490cf90131b8cb506c8066f4f4c24b0 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Aur=C3=A9lien?= <aurelien.bellet@inria.fr>
Date: Wed, 3 Jul 2019 11:10:04 +0200
Subject: [PATCH 1/5] get started

---
 doc/supervised.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/supervised.rst b/doc/supervised.rst
index 5520ce8e..6bc884af 100644
--- a/doc/supervised.rst
+++ b/doc/supervised.rst
@@ -11,8 +11,8 @@ from each other.
 General API
 ===========
 
-Supervised Metric Learning Algorithms are the easiest metric-learn algorithms
-to use, since they use the same API as ``scikit-learn``.
+Supervised metric learning algorithms essentially use the same API as
+``scikit-learn``.
 
 Input data
 ----------

From 3a8e063a3267ccdfefd5b9c938c7b0979a0276c7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Aur=C3=A9lien?= <aurelien.bellet@inria.fr>
Date: Wed, 3 Jul 2019 11:26:54 +0200
Subject: [PATCH 2/5] supervised.rst done

---
 doc/supervised.rst | 43 ++++++++++++++++++++++++-------------------
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/doc/supervised.rst b/doc/supervised.rst
index 6bc884af..ed32d8a2 100644
--- a/doc/supervised.rst
+++ b/doc/supervised.rst
@@ -20,13 +20,14 @@ In order to train a model, you need two `array-like <https://scikit-learn\
 .org/stable/glossary.html#term-array-like>`_ objects, `X` and `y`. `X`
 should be a 2D array-like of shape `(n_samples, n_features)`, where
 `n_samples` is the number of points of your dataset and `n_features` is the
-number of attributes of each of your points. `y` should be a 1D array-like
+number of attributes describing each point. `y` should be a 1D
+array-like
 of shape `(n_samples,)`, containing for each point in `X` the class it
 belongs to (or the value to regress for this sample, if you use `MLKR` for
 instance).
 
 Here is an example of a dataset of two dogs and one
-cat (the classes are 'dog' and 'cat') an animal being being represented by
+cat (the classes are 'dog' and 'cat') an animal being represented by
 two numbers.
 
 >>> import numpy as np
@@ -83,9 +84,10 @@ array([0.49627072, 3.65287282])
 
 .. note::
 
-    If the metric learner that you use learns a Mahalanobis Matrix (like it is
-    the case for all algorithms currently in metric-learn), you can get the
-    plain learned Mahalanobis matrix using `get_mahalanobis_matrix`.
+    If the metric learner that you use learns a :ref:`Mahalanobis distance
+    <mahalanobis_distances>` (like it is the case for all algorithms
+    currently in metric-learn), you can get the plain learned Mahalanobis
+    matrix using `get_mahalanobis_matrix`.
 
     >>> nca.get_mahalanobis_matrix()
     array([[0.43680409, 0.89169412],
@@ -96,9 +98,13 @@ array([0.49627072, 3.65287282])
 Scikit-learn compatibility
 --------------------------
 
-All supervised algorithms are scikit-learn `sklearn.base.Estimators`, and
-`sklearn.base.TransformerMixin` so they are compatible with Pipelining and
-scikit-learn model selection routines.
+All supervised algorithms are scikit-learn estimators 
+(`sklearn.base.BaseEstimator`) and transformers 
+(`sklearn.base.TransformerMixin`) so they are compatible with pipelines 
+(`sklearn.pipeline.Pipeline`) and
+scikit-learn model selection routines 
+(`sklearn.model_selection.cross_val_score`,
+`sklearn.model_selection.GridSearchCV`, etc).
 
 Algorithms
 ==========
@@ -365,14 +371,14 @@ calculating a weighted average of all the training samples:
 Supervised versions of weakly-supervised algorithms
 ---------------------------------------------------
 
-Note that each :ref:`weakly-supervised algorithm <weakly_supervised_section>`
+Each :ref:`weakly-supervised algorithm <weakly_supervised_section>`
 has a supervised version of the form `*_Supervised` where similarity tuples are
-generated from the labels information and passed to the underlying algorithm.
-These constraints are sampled randomly under the hood.
+randomly generated from the labels information and passed to the underlying
+algorithm.
 
 For pairs learners (see :ref:`learning_on_pairs`), pairs (tuple of two points
-from the dataset), and labels (`int` indicating whether the two points are
-similar (+1) or dissimilar (-1)), are sampled with the function
+from the dataset), and pair labels (`int` indicating whether the two points
+are similar (+1) or dissimilar (-1)), are sampled with the function
 `metric_learn.constraints.positive_negative_pairs`. To sample positive pairs
 (of label +1), this method will look at all the samples from the same label and
 sample randomly a pair among them. To sample negative pairs (of label -1), this
@@ -383,12 +389,11 @@ of one of those, so forcing `same_length=True` will return both times the
 minimum of the two lenghts.
 
 For using quadruplets learners (see :ref:`learning_on_quadruplets`) in a
-supervised way, we will basically sample positive and negative pairs like
-before, but we'll just concatenate them, so that we have a 3D array of
-quadruplets, where for each quadruplet the two first points are in fact points
-from the same class, and the two last points are in fact points from a
-different class (so indeed the two last points should be less similar than the
-two first points).
+supervised way, positive and negative pairs are sampled as above and
+concatenated so that we have a 3D array of
+quadruplets, where for each quadruplet the two first points are from the same
+class, and the two last points are from a different class (so indeed the two
+last points should be less similar than the two first points).
 
 .. topic:: Example Code:
 

From 3e443fdd67b86214c29964023956a33c7fe01fc7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Aur=C3=A9lien?= <aurelien.bellet@inria.fr>
Date: Wed, 3 Jul 2019 11:28:05 +0200
Subject: [PATCH 3/5] unsupervised.rst done

---
 doc/unsupervised.rst | 6 +++---
 doc/user_guide.rst   | 1 +
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/doc/unsupervised.rst b/doc/unsupervised.rst
index 1d5bef43..1191e805 100644
--- a/doc/unsupervised.rst
+++ b/doc/unsupervised.rst
@@ -2,9 +2,9 @@
 Unsupervised Metric Learning
 ============================
 
-Unsupervised metric learning algorithms just take as input points `X`. For
-now, in metric-learn, there only is `Covariance`, which is a simple
-baseline algorithm (see below).
+Unsupervised metric learning algorithms only take as input an (unlabeled)
+dataset `X`. For now, in metric-learn, there only is `Covariance`, which is a
+simple baseline algorithm (see below).
 
 
 Algorithms
diff --git a/doc/user_guide.rst b/doc/user_guide.rst
index fb7060ce..5472107a 100644
--- a/doc/user_guide.rst
+++ b/doc/user_guide.rst
@@ -12,4 +12,5 @@ User Guide
    introduction.rst
    supervised.rst
    weakly_supervised.rst
+   unsupervised.rst
    preprocessor.rst
\ No newline at end of file

From 6e67bc1fca4d6020ff84e54146542688d8e6abd6 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Aur=C3=A9lien?= <aurelien.bellet@inria.fr>
Date: Wed, 3 Jul 2019 12:06:12 +0200
Subject: [PATCH 4/5] finish addressing comments

---
 doc/supervised.rst        |   8 +-
 doc/weakly_supervised.rst | 170 +++++++++++++++++++-------------------
 2 files changed, 88 insertions(+), 90 deletions(-)

diff --git a/doc/supervised.rst b/doc/supervised.rst
index ed32d8a2..2511cc69 100644
--- a/doc/supervised.rst
+++ b/doc/supervised.rst
@@ -12,7 +12,7 @@ General API
 ===========
 
 Supervised metric learning algorithms essentially use the same API as
-``scikit-learn``.
+scikit-learn.
 
 Input data
 ----------
@@ -168,7 +168,7 @@ indicates :math:`\mathbf{x}_{i}, \mathbf{x}_{j}` belong to different class,
 :py:class:`NCA <metric_learn.NCA>`
 --------------------------------------
 
-Neighborhood Components Analysis(:py:class:`NCA <metric_learn.NCA>`)
+Neighborhood Components Analysis (:py:class:`NCA <metric_learn.NCA>`)
 
 `NCA` is a distance metric learning algorithm which aims to improve the 
 accuracy of nearest neighbors classification compared to the standard 
@@ -232,7 +232,7 @@ the sum of probability of being correctly classified:
 :py:class:`LFDA <metric_learn.LFDA>`
 -----------------------------------------
 
-Local Fisher Discriminant Analysis(:py:class:`LFDA <metric_learn.LFDA>`)
+Local Fisher Discriminant Analysis (:py:class:`LFDA <metric_learn.LFDA>`)
 
 `LFDA` is a linear supervised dimensionality reduction method. It is
 particularly useful when dealing with multi-modality, where one ore more classes
@@ -306,7 +306,7 @@ same class are not imposed to be close.
 :py:class:`MLKR <metric_learn.MLKR>`
 -----------------------------------------
 
-Metric Learning for Kernel Regression(:py:class:`MLKR <metric_learn.MLKR>`)
+Metric Learning for Kernel Regression (:py:class:`MLKR <metric_learn.MLKR>`)
 
 `MLKR` is an algorithm for supervised metric learning, which learns a
 distance function by directly minimizing the leave-one-out regression error.
diff --git a/doc/weakly_supervised.rst b/doc/weakly_supervised.rst
index 7e488ac7..cb83e24b 100644
--- a/doc/weakly_supervised.rst
+++ b/doc/weakly_supervised.rst
@@ -31,22 +31,21 @@ two points, three points, etc...). The label is some information we have
 about this set of points (e.g. "these two points are similar"). Note that
 some information can be contained in the ordering of these tuples (see for
 instance the section :ref:`learning_on_quadruplets`). For more details about
-the specific of each algorithms, refer to the appropriate section: either
-:ref:`learning_on_pairs` or :ref:`learning_on_quadruplets`)
+specific forms of tuples, refer to the appropriate sections 
+(:ref:`learning_on_pairs` or :ref:`learning_on_quadruplets`).
 
-
-The `tuples` argument is the first argument of every method (like the X
+The `tuples` argument is the first argument of every method (like the `X`
 argument for classical algorithms in scikit-learn). The second argument is the
 label of the tuple: its semantic depends on the algorithm used. For instance
-for pairs learners ``y`` is a label indicating whether the pair is of similar
+for pairs learners `y` is a label indicating whether the pair is of similar
 samples or dissimilar samples.
 
 Then one can fit a Weakly Supervised Metric Learner on this tuple, like this:
 
 >>> my_algo.fit(tuples, y)
 
-Like in a classical setting we split the points ``X`` between train and test,
-here we split the ``tuples`` between train and test.
+Like in a classical setting we split the points `X` between train and test,
+here we split the `tuples` between train and test.
 
 >>> from sklearn.model_selection import train_test_split
 >>> pairs_train, pairs_test, y_train, y_test = train_test_split(pairs, y)
@@ -58,9 +57,9 @@ learn:
 ^^^^^^^^^^^^^^^^^^
 
 The most intuitive way to represent tuples is to provide the algorithm with a
-3D array-like of tuples of shape ``(n_tuples, t, n_features)``, where
-``n_tuples`` is the number of tuples, ``tuple_size`` is the number of elements
-in a tuple (2 for pairs, 3 for triplets for instance), and ``n_features`` is
+3D array-like of tuples of shape `(n_tuples, t, n_features)`, where
+`n_tuples` is the number of tuples, `tuple_size` is the number of elements
+in a tuple (2 for pairs, 3 for triplets for instance), and `n_features` is
 the number of features of each point.
 
 .. topic:: Example:
@@ -91,8 +90,8 @@ the number of features of each point.
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Instead of forming each point in each tuple, a more efficient representation
-would be to keep the dataset of points ``X`` aside, and just represent tuples
-as a collection of tuples of *indices* from the points in ``X``. Since we loose
+would be to keep the dataset of points `X` aside, and just represent tuples
+as a collection of tuples of *indices* from the points in `X`. Since we loose
 the feature dimension there, the resulting array is 2D.
 
 .. topic:: Example: An equivalent representation of the above pairs would be:
@@ -110,7 +109,7 @@ the feature dimension there, the resulting array is 2D.
 >>> y = np.array([-1, 1, 1, -1])
 
 In order to fit metric learning algorithms with this type of input, we need to
-give the original dataset of points ``X`` to the estimator so that it knows
+give the original dataset of points `X` to the estimator so that it knows
 the points the indices refer to. We do this when initializing the estimator,
 through the argument `preprocessor` (see below :ref:`fit_ws`)
 
@@ -118,7 +117,7 @@ through the argument `preprocessor` (see below :ref:`fit_ws`)
 .. note::
 
    Instead of an array-like, you can give a callable in the argument
-   ``preprocessor``, which will go fetch and form the tuples. This allows to
+   `preprocessor`, which will go fetch and form the tuples. This allows to
    give more general indicators than just indices from an array (for instance
    paths in the filesystem, name of records in a database etc...) See section
    :ref:`preprocessor_section` for more details on how to use the preprocessor.
@@ -157,7 +156,7 @@ Here we transform two points in the new embedding space.
 array([[-3.24667162e+01,  4.62622348e-07,  3.88325421e-08],
        [-3.61531114e+01,  4.86778289e-07,  2.12654397e-08]])
 
-Also, as explained before, our metric learners has learn a distance between
+Also, as explained before, our metric learner has learned a distance between
 points. You can use this distance in two main ways:
 
 - You can either return the distance between pairs of points using the
@@ -178,9 +177,10 @@ array([7.27607365, 0.88853014])
 
 .. note::
 
-    If the metric learner that you use learns a Mahalanobis Matrix (like it is
-    the case for all algorithms currently in metric-learn), you can get the
-    plain Mahalanobis matrix using `get_mahalanobis_matrix`.
+    If the metric learner that you use learns a :ref:`Mahalanobis distance
+    <mahalanobis_distances>` (like it is the case for all algorithms
+    currently in metric-learn), you can get the plain Mahalanobis matrix using
+    `get_mahalanobis_matrix`.
 
 >>> mmc.get_mahalanobis_matrix()
 array([[ 0.58603894, -5.69883982, -1.66614919],
@@ -190,53 +190,51 @@ array([[ 0.58603894, -5.69883982, -1.66614919],
 .. TODO: remove the "like it is the case etc..." if it's not the case anymore
 
 .. _sklearn_compat_ws:
-    
+
+Prediction and scoring
+----------------------
+
+Since weakly supervised are also able, after being fitted, to predict for a
+given tuple what is its label (for pairs) or ordering (for quadruplets). See
+the appropriate section for more details, either :ref:`this
+one <pairs_predicting>` for pairs, or :ref:`this one
+<quadruplets_predicting>` for quadruplets.
+
+They also implement a default scoring method, `score`, that can be
+used to evaluate the performance of a metric-learner on a test dataset. See
+the appropriate section for more details, either :ref:`this
+one <pairs_scoring>` for pairs, or :ref:`this one <learning_on_quadruplets>`
+for quadruplets.
+
 Scikit-learn compatibility
 --------------------------
 
 Weakly supervised estimators are compatible with scikit-learn routines for
-model selection (grid-search, cross-validation etc). See the scoring section
-of the appropriate algorithm (:ref:`pairs learners <learning_on_pairs>`
-or :ref:`quadruplets learners <learning_on_quadruplets>`)
-for more details on the scoring used in the case of Weakly Supervised Metric
-Learning.
+model selection (`sklearn.model_selection.cross_val_score`,
+`sklearn.model_selection.GridSearchCV`, etc).
 
 Example:
 
 >>> from metric_learn import MMC
+>>> import numpy as np
 >>> from sklearn.datasets import load_iris
 >>> from sklearn.model_selection import cross_val_score
 >>> rng = np.random.RandomState(42)
 >>> X, _ = load_iris(return_X_y=True)
 >>> # let's sample 30 random pairs and labels of pairs
 >>> pairs_indices = rng.randint(X.shape[0], size=(30, 2))
->>> y = rng.randint(2, size=30)
+>>> y = 2 * rng.randint(2, size=30) - 1
 >>> mmc = MMC(preprocessor=X)
 >>> cross_val_score(mmc, pairs_indices, y)
 
-Prediction and scoring
-----------------------
-
-Since weakly supervised are also able, after being fitted, to predict for a
-given tuple what is its label (for pairs) or ordering (for quadruplets). See
-the appropriate section for more details, either :ref:`this
-one <pairs_predicting>` for pairs, or :ref:`this one
-<quadruplets_predicting>` for quadruplets.
-
-They also implement a default scoring method, `score`, that can be
-used to evaluate the performance of a metric-learner on a test dataset. See
-the appropriate section for more details, either :ref:`this
-one <pairs_scoring>` for pairs, or :ref:`this one <learning_on_quadruplets>`
-for quadruplets.
-
 .. _learning_on_pairs:
 
 Learning on pairs
 =================
 
 Some metric learning algorithms learn on pairs of samples. In this case, one
-should provide the algorithm with ``n_samples`` pairs of points, with a
-corresponding target containing ``n_samples`` values being either +1 or -1.
+should provide the algorithm with `n_samples` pairs of points, with a
+corresponding target containing `n_samples` values being either +1 or -1.
 These values indicate whether the given pairs are similar points or
 dissimilar points.
 
@@ -262,11 +260,11 @@ each other.
 
 .. _pairs_predicting:
 
-Predicting
+Prediction
 ----------
 
-When a pairs learner is fitted, it is also able to predict, for an
-upcoming pair, whether it is a pair of similar or dissimilar points.
+When a pairs learner is fitted, it is also able to predict, for an unseen
+pair, whether it is a pair of similar or dissimilar points.
 
 >>> mmc.predict([[[0.6, 1.6], [1.15, 2.75]],
 ...              [[3.2, 1.1], [5.4, 6.1]]])
@@ -274,34 +272,37 @@ array([1, -1])
 
 .. _calibration:
 
-Thresholding
-------------
-In order to predict whether a new pair represents similar or dissimilar
-samples, we in fact need to set a distance threshold, so that points closer (in
-the learned space) than this threshold are predicted as similar, and points
-further away are predicted as dissimilar. Several methods are possible for this
-thresholding.
+Prediction threshold
+^^^^^^^^^^^^^^^^^^^^
 
-- **At fit time**: The threshold is set with `calibrate_threshold` (see
-  below) on the trainset. You can specify the calibration parameters directly
+Predicting whether a new pair represents similar or dissimilar
+samples requires to set a threshold on the learned distance, so that points
+closer (in the learned space) than this threshold are predicted as similar,
+and points further away are predicted as dissimilar. Several methods are
+possible for this thresholding.
+
+- **Calibration at fit time**: The threshold is set with `calibrate_threshold`
+  (see below) on the training set. You can specify the calibration
+  parameters directly
   in the `fit` method with the `threshold_params` parameter (see the
   documentation of the `fit` method of any metric learner that learns on pairs
-  of points for more information). This method can cause a little bit of
-  overfitting. If you want to avoid that, calibrate the threshold after
-  fitting, on a validation set.
+  of points for more information). Note that calibrating on the training set
+  may cause some overfitting. If you want to avoid that, calibrate the
+  threshold after fitting, on a validation set.
 
   >>> mmc.fit(pairs, y) # will fit the threshold automatically after fitting
 
-- **Manual**: calling `set_threshold` will set the threshold to a
-  particular value.
+- **Calibration on validation set**: calling `calibrate_threshold` will
+  calibrate the threshold to achieve a particular score on a validation set,
+  the score being among the classical scores for classification (accuracy, f1
+  score...).
 
-  >>> mmc.set_threshold(0.4)
+  >>> mmc.calibrate_threshold(pairs, y)
 
-- **Calibration**: calling `calibrate_threshold` will calibrate the
-  threshold to achieve a particular score on a validation set, the score
-  being among the classical scores for classification (accuracy, f1 score...).
+- **Manual threshold**: calling `set_threshold` will set the threshold to a
+  particular value.
 
-  >>> mmc.calibrate_threshold(pairs, y)
+  >>> mmc.set_threshold(0.4)
 
 See also: `sklearn.calibration`.
 
@@ -310,18 +311,17 @@ See also: `sklearn.calibration`.
 Scoring
 -------
 
-Not only are they able to predict the label of given pairs, they can also
-return a `decision_function` for a set of pairs. It is basically the "score"
-that will be thresholded to find the prediction for the pair. In fact this
-"score" is the opposite of the distance in the new space (higher score means
- points are similar, and lower score dissimilar).
+Pair metric learners can also return a `decision_function` for a set of pairs.
+It is basically the "score" that will be thresholded to find the prediction
+for the pair. This score corresponds to the opposite of the distance in the
+new space (higher score means points are similar, and lower score dissimilar).
 
 >>> mmc.decision_function([[[0.6, 1.6], [1.15, 2.75]],
 ...                        [[3.2, 1.1], [5.4, 6.1]]])
 array([-0.12811124, -0.74750256])
 
-This allows to return all kinds of estimator scoring usually used in classic
-classification tasks, like `sklearn.metrics.accuracy` for instance, which
+This allows to use common scoring functions for binary classification, like
+`sklearn.metrics.accuracy_score` for instance, which
 can be used inside cross-validation routines:
 
 >>> from sklearn.model_selection import cross_val_score
@@ -333,15 +333,14 @@ can be used inside cross-validation routines:
 array([1., 0., 1.])
 
 Pairs learners also have a default score, which basically
-returns the `sklearn.metrics.roc_auc_score` (therefore is not dependent on
-the threshold).
+returns the `sklearn.metrics.roc_auc_score` (which is threshold-independent).
 
 >>> pairs_test = np.array([[[0.6, 1.6], [1.15, 2.75]],
 ...                        [[3.2, 1.1], [5.4, 6.1]],
 ...                        [[7.7, 5.6], [1.23, 8.4]]])
->>> y_test = np.array([-1., 1., -1.])
+>>> y_test = np.array([1., -1., -1.])
 >>> mmc.score(pairs_test, y_test)
-0.5
+1.0
 
 .. note::
    See :ref:`fit_ws` for more details on metric learners functions that are
@@ -356,7 +355,7 @@ Algorithms
 :py:class:`ITML <metric_learn.ITML>`
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Information Theoretic Metric Learning(:py:class:`ITML <metric_learn.ITML>`)
+Information Theoretic Metric Learning (:py:class:`ITML <metric_learn.ITML>`)
 
 `ITML` minimizes the (differential) relative entropy, aka Kullback–Leibler 
 divergence, between two multivariate Gaussians subject to constraints on the 
@@ -608,11 +607,10 @@ points, while constrains the sum of distances between dissimilar points:
 Learning on quadruplets
 =======================
 
-
-
-The goal of weakly-supervised metric-learning algorithms is to transform
-points in a new space, in which the tuple-wise constraints between points
-are respected.
+Some metric learning algorithms learn on quadruplets of samples. In this case,
+one should provide the algorithm with `n_samples` quadruplets of points. Th
+semantic of each quadruplet is that the first two points should be closer
+together than the last two points.
 
 Fitting
 -------
@@ -659,7 +657,7 @@ last points.
 
 .. _quadruplets_predicting:
 
-Predicting
+Prediction
 ----------
 
 When a quadruplets learner is fitted, it is also able to predict, for an
@@ -677,10 +675,10 @@ array([-1.,  1.])
 Scoring
 -------
 
-Not only are they able to predict the label of given pairs, they can also
-return a `decision_function` for a set of pairs. It is basically the "score"
-which sign will be taken to find the prediction for the pair. In fact this
-"score" is the difference between the distance between the two last points,
+Quadruplet metric learners can also
+return a `decision_function` for a set of pairs. This is basically the "score"
+which sign will be taken to find the prediction for the pair, which
+corresponds to the difference between the distance between the two last points,
 and the distance between the two last points of the quadruplet (higher
 score means the two last points are more likely to be more dissimilar than
 the two first points (i.e. more likely to have a +1 prediction since it's

From 86062c49c6ddc4f2699f03f7834d7006d70c42c2 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Aur=C3=A9lien?= <aurelien.bellet@inria.fr>
Date: Wed, 3 Jul 2019 13:46:00 +0200
Subject: [PATCH 5/5] fix links and uniformize refs

---
 doc/supervised.rst        | 38 +++++++++++++++++++-------------------
 doc/weakly_supervised.rst | 34 ++++++++++++++--------------------
 2 files changed, 33 insertions(+), 39 deletions(-)

diff --git a/doc/supervised.rst b/doc/supervised.rst
index 2511cc69..3c941b20 100644
--- a/doc/supervised.rst
+++ b/doc/supervised.rst
@@ -157,11 +157,13 @@ indicates :math:`\mathbf{x}_{i}, \mathbf{x}_{j}` belong to different class,
 
 .. topic:: References:
 
-    .. [1] `Distance Metric Learning for Large Margin Nearest Neighbor
-       Classification
-       <http://papers.nips.cc/paper/2795-distance-metric-learning-for-large
-       -margin -nearest-neighbor-classification>`_ Kilian Q. Weinberger, John
-       Blitzer, Lawrence K. Saul
+    .. [1] Weinberger et al. `Distance Metric Learning for Large Margin
+       Nearest Neighbor Classification
+       <http://jmlr.csail.mit.edu/papers/volume10/weinberger09a/weinberger09a.pdf>`_.
+       JMLR 2009
+
+    .. [2] `Wikipedia entry on Large Margin Nearest Neighbor <https://en.wikipedia.org/wiki/Large_margin_nearest_neighbor>`_
+       
 
 .. _nca:
 
@@ -219,13 +221,12 @@ the sum of probability of being correctly classified:
 
 .. topic:: References:
 
-    .. [1] J. Goldberger, G. Hinton, S. Roweis, R. Salakhutdinov.
-       "Neighbourhood Components Analysis". Advances in Neural Information
-       Processing Systems. 17, 513-520, 2005.
-       http://www.cs.nyu.edu/~roweis/papers/ncanips.pdf
+    .. [1] Goldberger et al.
+       `Neighbourhood Components Analysis <https://papers.nips.cc/paper/2566-neighbourhood-components-analysis.pdf>`_.
+       NIPS 2005
 
-    .. [2] Wikipedia entry on Neighborhood Components Analysis
-       https://en.wikipedia.org/wiki/Neighbourhood_components_analysis
+    .. [2] `Wikipedia entry on Neighborhood Components Analysis <https://en.wikipedia.org/wiki/Neighbourhood_components_analysis>`_
+       
 
 .. _lfda:
 
@@ -293,13 +294,13 @@ same class are not imposed to be close.
 
 .. topic:: References:
 
-    .. [1] `Dimensionality Reduction of Multimodal Labeled Data by Local
-       Fisher Discriminant Analysis <http://www.ms.k.u-tokyo.ac.jp/2007/LFDA
-       .pdf>`_ Masashi Sugiyama.
+    .. [1] Sugiyama. `Dimensionality Reduction of Multimodal Labeled Data by Local
+       Fisher Discriminant Analysis <http://www.jmlr.org/papers/volume8/sugiyama07b/sugiyama07b.pdf>`_.
+       JMLR 2007
 
-    .. [2] `Local Fisher Discriminant Analysis on Beer Style Clustering
+    .. [2] Tang. `Local Fisher Discriminant Analysis on Beer Style Clustering
        <https://gastrograph.com/resources/whitepapers/local-fisher
-       -discriminant-analysis-on-beer-style-clustering.html#>`_ Yuan Tang.
+       -discriminant-analysis-on-beer-style-clustering.html#>`_.
 
 .. _mlkr:
 
@@ -361,9 +362,8 @@ calculating a weighted average of all the training samples:
 
 .. topic:: References:
 
-    .. [1] `Metric Learning for Kernel Regression <http://proceedings.mlr.
-       press/v2/weinberger07a/weinberger07a.pdf>`_ Kilian Q. Weinberger,
-       Gerald Tesauro
+    .. [1] Weinberger et al. `Metric Learning for Kernel Regression <http://proceedings.mlr.
+       press/v2/weinberger07a/weinberger07a.pdf>`_. AISTATS 2007
 
 
 .. _supervised_version:
diff --git a/doc/weakly_supervised.rst b/doc/weakly_supervised.rst
index cb83e24b..38f08fbe 100644
--- a/doc/weakly_supervised.rst
+++ b/doc/weakly_supervised.rst
@@ -421,12 +421,9 @@ is the prior distance metric, set to identity matrix by default,
 
 .. topic:: References:
 
-    .. [1] `Information-theoretic Metric Learning <http://machinelearning.wustl
-       .edu/mlpapers/paper_files/icml2007_DavisKJSD07.pdf>`_ Jason V. Davis,
-       et al.
+    .. [1] Jason V. Davis, et al. `Information-theoretic Metric Learning <https://icml.cc/imls/conferences/2007/proceedings/papers/404.pdf>`_. ICML 2007
 
-    .. [2] Adapted from Matlab code at http://www.cs.utexas.edu/users/pjain/
-       itml/
+    .. [2] Adapted from Matlab code at http://www.cs.utexas.edu/users/pjain/itml/
 
 
 .. _sdml:
@@ -482,10 +479,9 @@ is the off-diagonal L1 norm.
 .. topic:: References:
 
     .. [1] Qi et al.
-       An efficient sparse metric learning in high-dimensional space via
-       L1-penalized log-determinant regularization. ICML 2009.
-       http://lms.comp.nus.edu.sg/sites/default/files/publication-attachments/
-       icml09-guojun.pdf
+       `An efficient sparse metric learning in high-dimensional space via
+       L1-penalized log-determinant regularization <https://icml.cc/Conferences/2009/papers/46.pdf>`_.
+       ICML 2009.
 
     .. [2] Adapted from https://gist.github.com/kcarnold/5439945
 
@@ -536,14 +532,13 @@ as the Mahalanobis matrix.
 
 .. topic:: References:
 
-    .. [1] `Adjustment learning and relevant component analysis
+    .. [1] Shental et al. `Adjustment learning and relevant component analysis
        <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.2871
-       &rep=rep1&type=pdf>`_ Noam Shental, et al.
+       &rep=rep1&type=pdf>`_. ECCV 2002
 
-    .. [2] 'Learning distance functions using equivalence relations', ICML 2003
+    .. [2] Bar-Hillel et al. `Learning distance functions using equivalence relations <https://aaai.org/Papers/ICML/2003/ICML03-005.pdf>`_. ICML 2003
 
-    .. [3]'Learning a Mahalanobis metric from equivalence constraints', JMLR
-       2005
+    .. [3] Bar-Hillel et al. `Learning a Mahalanobis metric from equivalence constraints <http://www.jmlr.org/papers/volume6/bar-hillel05a/bar-hillel05a.pdf>`_. JMLR 2005
 
 .. _mmc:
 
@@ -594,12 +589,11 @@ points, while constrains the sum of distances between dissimilar points:
 
 .. topic:: References:
 
-  .. [1] `Distance metric learning with application to clustering with
+  .. [1] Xing et al. `Distance metric learning with application to clustering with
         side-information <http://papers.nips
         .cc/paper/2164-distance-metric-learning-with-application-to-clustering
-        -with-side-information.pdf>`_ Xing, Jordan, Russell, Ng.
-  .. [2] Adapted from Matlab code `here <http://www.cs.cmu
-     .edu/%7Eepxing/papers/Old_papers/code_Metric_online.tar.gz>`_.
+        -with-side-information.pdf>`_. NIPS 2002
+  .. [2] Adapted from Matlab code http://www.cs.cmu.edu/%7Eepxing/papers/Old_papers/code_Metric_online.tar.gz
 
 
 .. _learning_on_quadruplets:
@@ -800,8 +794,8 @@ by default, :math:`D_{ld}(\mathbf{\cdot, \cdot})` is the LogDet divergence:
 .. topic:: References:
 
     .. [1] Liu et al.
-       "Metric Learning from Relative Comparisons by Minimizing Squared
-       Residual". ICDM 2012. http://www.cs.ucla.edu/~weiwang/paper/ICDM12.pdf
+       `Metric Learning from Relative Comparisons by Minimizing Squared
+       Residual <http://www.cs.ucla.edu/~weiwang/paper/ICDM12.pdf>`_. ICDM 2012
 
     .. [2] Adapted from https://gist.github.com/kcarnold/5439917