Skip to content

[WIP] New API proposal #85

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 35 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
3e5fbc3
Add new class structure
Feb 26, 2018
0cbf1ae
Put back TransformerMixin in BaseEstimator to inherit Transformer beh…
Feb 26, 2018
300dada
add ConstrainedDataset object
Feb 27, 2018
8615634
simplify constraints to always keep a view on X
Feb 28, 2018
a478baa
add check for input formats
Mar 2, 2018
3744bec
add basic testing to ConstrainedDataset
Mar 2, 2018
214d991
correct asterisk bug
Mar 2, 2018
4f4ce8b
begin work to dissociate classes
Mar 5, 2018
ac00b8b
update MMC with constrained_dataset
Mar 5, 2018
33561ab
Fixes according to review https://github.com/metric-learn/metric-lear…
Mar 6, 2018
7f40c56
make mixins rather than classes hierarchy for inheriting special methods
Mar 6, 2018
402f397
Merge branch 'new_api' into feat/class_dissociation
Mar 6, 2018
47a9372
Make changes according to review https://github.com/metric-learn/metr…
Mar 13, 2018
41dc123
Finalize class dissociation into mixins
Mar 6, 2018
5f63f24
Merge branch 'feat/class_dissociation' into new_api
Mar 19, 2018
fb0d118
separate valid and invalid input testing
Mar 20, 2018
df8a340
correct too long line syntax
Mar 20, 2018
e3e7e0c
clarify definition of variables in tests
Mar 20, 2018
5a9c2e5
simplify unwrap pairs and make it more robust to y dimension
Mar 20, 2018
cf94740
fix bug due to bad refactoring of c_shape
Mar 20, 2018
52f4516
simplify wrap pairs
Mar 20, 2018
079bb13
make QuadrupletsMixin inherit from WeaklySupervisedMixin
Mar 21, 2018
da7c8e7
add NotImplementedError for abstract mixins
Mar 21, 2018
8192d11
put TransformerMixin inline
Mar 21, 2018
2d0f1ca
put random state at top of file
Mar 21, 2018
6c59a1a
add transform, predict, decision_function, and scoring for weakly sup…
Mar 6, 2018
b70163a
Add tests
Mar 19, 2018
a12eb9a
Add documentation
Mar 23, 2018
b1f6c23
fix typo or/of
Mar 30, 2018
b0ec33b
Add tests for sparse matrices, dataframes and lists
Apr 12, 2018
64f5762
Fix Transformer interface (cf. review https://github.com/metric-learn…
Apr 12, 2018
2cf78dd
Do not separate classes if not needed (cf. https://github.com/metric-…
Apr 12, 2018
11a8ff1
Fix ascii invisible character
Apr 12, 2018
a768cbf
Fix test attribute error and numerical problems with new dataset
Apr 12, 2018
335d8f4
Fix unittest hierarchy of classes
Apr 12, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@
'numpydoc',
]

autodoc_default_flags = ['members', 'inherited-members']

default_role='any'

templates_path = ['_templates']
source_suffix = '.rst'
master_doc = 'index'
Expand Down
24 changes: 22 additions & 2 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,34 @@ This package contains efficient Python implementations of several popular
metric learning algorithms.

.. toctree::
:caption: Algorithms
:caption: Unsupervised Algorithms
:maxdepth: 1

metric_learn.covariance
metric_learn.lmnn

.. toctree::
:caption: Weakly Supervised algorithms
:maxdepth: 1

metric_learn.weakly_supervised
metric_learn.mmc
metric_learn.itml
metric_learn.sdml
metric_learn.lsml

Note that all Weakly Supervised Metric Learners have a supervised version. See
:ref:`this section<supervised_version>` for more details.


.. toctree::
:caption: Supervised algorithms
:maxdepth: 1

metric_learn.lmnn
metric_learn.nca
metric_learn.lfda
metric_learn.rca
metric_learn.mlkr

Each metric supports the following methods:

Expand All @@ -34,6 +51,9 @@ Each metric supports the following methods:
data matrix :math:`X \in \mathbb{R}^{n \times d}` to the
:math:`D`-dimensional learned metric space :math:`X L^{\top}`,
in which standard Euclidean distances may be used.

.. _transform_ml:

- ``transform(X)``, which applies the aforementioned transformation.
- ``metric()``, which returns a Mahalanobis matrix
:math:`M = L^{\top}L` such that distance between vectors ``x`` and
Expand Down
1 change: 1 addition & 0 deletions doc/metric_learn.base_metric.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ metric_learn.base_metric module
:members:
:undoc-members:
:show-inheritance:

8 changes: 8 additions & 0 deletions doc/metric_learn.constrained_dataset.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
ConstrainedDataset
==================

.. autoclass:: metric_learn.constraints.ConstrainedDataset
:members:
:undoc-members:
:show-inheritance:

220 changes: 220 additions & 0 deletions doc/metric_learn.weakly_supervised.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
.. _wsml:

Weakly Supervised Learning (General information)
================================================

Introduction
------------

In Distance Metric Learning, we are interested in learning a metric between
points that takes into account some supervised information about the
similarity between those points. If each point has a class, we can use this
information by saying that all intra-class points are similar, and inter-class
points are dissimilar.

However, sometimes we do not have a class for each sample. Instead, we can have
pairs of points and a label for each saying whether the points in each pair are
similar or not. Indeed, if we imagine a hand labeled dataset of images with a
huge number of classes, it will be easier for a human to say whether two images
are similar rather that telling, among the huge number of classes, which one is
that of the shown image. We can also have a dataset of triplets of points where
we know the first sample is more similar to the second than the third. Or we
could also have quadruplets of points where the two first points are more
similar than the two last are. In fact, some metric learning algorithms are
made to use this kind of data. These are Weakly Supervised Metric Learners. For
instance, `ITML`, `MMC` and `SDML` work on labeled pairs, and `LSML` works on
unlabeled quadruplets.

In the ``metric-learn`` package, we use an object called `ConstrainedDataset`
to store these kinds of datasets where each sample/line is a tuple of points
from an initial dataset. Contrary to a 3D numpy array where each line would be
a tuple of ``t`` points from an initial dataset, `ConstrainedDataset` is
memory efficient as it does not duplicate points in the underlying memory.
Instead, it stores indices of points involved in every tuple, as well as the
initial dataset. Plus, it supports slicing on tuples to be compatible with
scikit-learn utilities for cross-validation (see :ref:`performance_ws`).

See documentation of `ConstrainedDataset` `here<ConstrainedDataset>` for more
information.



.. _workflow_ws:

Basic worflow
-------------

Let us see how we can use weakly supervised metric learners in a basic
scikit-learn like workflow with ``fit``, ``predict``, ``transform``,
``score`` etc.

- Fitting

Let's say we have a dataset of samples and we also know for some pairs of them
if they are similar of dissimilar. We want to fit a metric learner on this
data. First, we recognize this data is made of labeled pairs. What we will need
to do first is therefore to make a `ConstrainedDataset` with as input the
points ``X`` (an array of shape ``(n_samples, n_features)``, and the
constraints ``c`` (an array of shape ``(n_constraints, 2))`` of indices of
pairs. We also need to have a vector ``y_constraints`` of shape
``(n_constraints,)`` where each ``y_constraints_i`` is 1 if sample
``X[c[i,0]]`` is similar to sample ``X[c[i, 1]]`` and 0 if they are dissimilar.

.. code:: python

from metric_learn import ConstrainedDataset
X_constrained = ConstrainedDataset(X, c)

Then we can fit a Weakly Supervised Metric Learner (here that inherits from
`PairsMixin`), on this data (let's use `MMC` for instance):

.. code:: python

from metric_learn import MMC
mmc = MMC()
mmc.fit(X_constrained, y_constraints)

.. _transform_ws:

- Transforming

Weakly supervised metric learners can also be used as transformers. Let us say
we have a fitted estimator. At ``transform`` time, they can independently be
used on arrays of samples as well as `ConstrainedDataset`s. Indeed, they will
return transformed samples and thus only need input samples (they will ignore
any information on constraints in the input). The transformed samples are the
new points in an embedded space. See :ref:`this section<transform_ml>` for
more details about this transformation.

.. code:: python

mmc.transform(X)

- Predicting

Weakly Supervised Metric Learners work on lines of data where each line is a
tuple of points of an original dataset. For some of these, we should also have
a label for each line (for instance in the cases of learning on pairs, each
label ``y_constraints_i`` should tell whether the pair in line ``i`` is a
similar or dissimilar pair). So for these algorithm, applying ``predict`` to an
input ConstrainedDataset will predict scalars related to this task for each
tuple. For instance in the case of pairs, ``predict`` will return for each
input pair a float measuring the similarity between samples in the pair.

See the API documentation for `WeaklySupervisedMixin`'s childs
( `PairsMixin`,
`TripletsMixin`, `QuadrupletsMixin`) for the particular prediction functions of
each type of Weakly Supervised Metric Learner.

.. code:: python

mmc.predict(X_constrained)

- Scoring

We can also use scoring functions like this, calling the default scoring
function of the Weakly Supervised Learner we use:

.. code:: python

mmc.score(X_constrained, y_constraints)

The type of score depends on the type of Weakly Supervised Metric Learner
used. See the API documentation for `WeaklySupervisedMixin`'s childs
(`PairsMixin`, `TripletsMixin`, `QuadrupletsMixin`) for the particular
default scoring functions of each type of estimator.

See also :ref:`performance_ws`, for how to use scikit-learn's
cross-validation routines with Weakly Supervised Metric Learners.


.. _supervised_version:

Supervised Version
------------------

Weakly Supervised Metric Learners can also be used in a supervised way: the
corresponding supervised algorithm will create a
`ConstrainedDataset` ``X_constrained``
and labels
``y_constraints`` of tuples from a supervised dataset with labels. For
instance if we want to use the algorithm `MMC` on a dataset of points and
labels
(``X`` and ``y``),
we should use ``MMC_Supervised`` (the underlying code will create pairs of
samples from the same class and labels saying that they are similar, and pairs
of samples from a different class and labels saying that they are
dissimilar, before calling `MMC`).

Example:

.. code:: python

from sklearn.datasets import make_classification

X, y = make_classification()
mmc_supervised = MMC_Supervised()
mmc_supervised.fit_transform(X, y)


.. _performance_ws:

Evaluating the performance of weakly supervised metric learning algorithms
--------------------------------------------------------------------------

To evaluate the performance of a classical supervised algorithm that takes in
an input dataset ``X`` and some labels ``y``, we can compute a cross-validation
score. However, weakly supervised algorithms cannot ``predict`` on one sample,
so we cannot split on samples to make a training set and a test set the same
way as we do with usual estimators. Instead, metric learning algorithms output
a score on a **tuple** of samples: for instance a similarity score on pairs of
samples. So doing cross-validation scoring for metric learning algorithms
implies to split on **tuples** of samples. Hopefully, `ConstrainedDataset`
allows to do so naturally.

Here is how we would get the cross-validation score for the ``MMC`` algorithm:

.. code:: python

from sklearn.model_selection import cross_val_score
cross_val_score(mmc, X_constrained, y_constraints)


Pipelining
----------

Weakly Supervised Learners can also be embedded in scikit-learn pipelines.
However, they can only be combined with Transformers. This is because there
is already supervision from constraints and we cannot add more
supervision that would be used from scikit-learn's supervised estimators.

For instance, you can combine it with another transformer like PCA or KMeans:

.. code:: python

from sklearn.decomposition import PCA
from sklearn.clustering import KMeans
from sklearn.pipeline import make_pipeline

pipe_pca = make_pipeline(MMC(), PCA())
pipe_pca.fit(X_constrained, y)
pipe_clustering = make_pipeline(MMC(), KMeans())
pipe_clustering.fit(X_constrained, y)

There are also some other things to keep in mind:

- The ``X`` type input of the pipeline should be a `ConstrainedDataset` when
fitting, but when transforming or predicting it can be an array of samples.
Therefore, all the following lines are valid:

.. code:: python

pipe_pca.transform(X_constrained)
pipe_pca.fit_transform(X_constrained)
pipe_pca.transform(X_constrained.X)

- You should also not try to cross-validate those pipelines with scikit-learn's
cross-validation functions (as their input data is a `ConstrainedDataset`
which when splitting can contain same points between train and test (but
of course not the same tuple of points)).

2 changes: 1 addition & 1 deletion metric_learn/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from __future__ import absolute_import

from .constraints import Constraints
from .constraints import Constraints, ConstrainedDataset
from .covariance import Covariance
from .itml import ITML, ITML_Supervised
from .lmnn import LMNN
Expand Down
Loading