Skip to content

[WIP] New API proposal #85

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 35 commits into from

Conversation

wdevazelhes
Copy link
Member

@wdevazelhes wdevazelhes commented Feb 26, 2018

This PR intends to provide a new API for the package, that would allow to use scikit-learn's tools for model selection and pipelining not only for supervised algorithms but also for weakly supervised ones. Indeed we would like to be able for instance to have a dataset of pairs, an algorithm that works on labeled pairs (i.e. pairwise constraints) and plug it into say scikit-learn's 'roc_auc' cross-validation scoring. This cross-validation scoring would split on the pairs and return the score of a classification of pairs. However we don't want to build an explicit dataset of pairs: indeed if we duplicate a point as many times as it is involved in a pair, the dataset would be huge. Instead, we would build a object that acts as if it was a list of pairs, but in fact stores the initial dataset as well as indices. The behaviour of such an object and use basic use cases can be found in this binder notebook: Binder.

This PR will be complete when the following will be done:

  • Keep the same API but introduce the objects that will be used in the new API
  • Create an object ConstrainedDataset that will allow the behaviour presented in the description
  • Modify existing algorithms to use this object. The API will then be changed (only for algorithms that take as an input an some labeled pairs of samples instead of labeled samples).
  • Write the appropriate documentation

William de Vazelhes added 2 commits February 26, 2018 15:07
Copy link
Contributor

@perimosocordiae perimosocordiae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. I'd be fine with merging this without the rest of the implementation, as it doesn't break anything or mess with compatibility.



class Covariance(BaseMetricLearner):
class Covariance(SupervisedMetricLearner):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why this was chosen, but this particular base class made me stop and consider a moment. We may eventually want a base class for unsupervised methods as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this simple covariance method should not be a SupervisedMetricLearner (as it is completely unsupervised). Whether we will really need an unsupervised class in the long run is unclear, but maybe the best for now is to create an UnsupervisedMetricLearner class which takes only X in fit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review ! I agree, I did not notice but indeed Covariance is unsupervised, so I will change this in a following PR

Copy link
Member

@bellet bellet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. Some remarks:

1/ It may be useful to add ways to update a ConstraintDataset object, although I am not sure how far we should go (append new constraints? append new points in X? ...). I think it would probably be useful to be able to add new constraint indices in c without modifying X, and maybe to merge two ConstrainedDataset objects (by concatenating the two X's, and the two constraints after re-indexing). A potential use case to keep in mind is to create constraints from a small set of labeled points and then complement this with constraints on a different/larger set of points.

2/ Do we want to consider a ConstrainedDataset object to be characterized primarily by (a) a fixed dataset X, along with some constraints on X, or (b) by a set of constraints with X giving a memory-efficient representation? In (b), one could remove points in X which are not used in the constraints (at creation of the object, or when slicing etc). I think (a) makes more sense, and probably simplifies the implementation of stuff in 1/

return self.X[self.c]

@staticmethod
def _check_index(length, indices):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this could also check for potential duplicates? could simply show a warning when this is the case. (one could also remove them but this might create problems later when constraint labels are used)

Copy link
Member Author

@wdevazelhes wdevazelhes Mar 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I will implement it in a next commit

William de Vazelhes added 2 commits March 5, 2018 09:15
move pairs wrapper and unwrapper functions in constraints.py, and update itml with constrained_dataset

update sdml to use constrained_dataset

update sdml doc

update LSML with ConstrainedDataset API
@wdevazelhes
Copy link
Member Author

wdevazelhes commented Mar 5, 2018

1/ It may be useful to add ways to update a ConstraintDataset object, although I am not sure how far we should go (append new constraints? append new points in X? ...). I think it would probably be useful to be able to add new constraint indices in c without modifying X, and maybe to merge two ConstrainedDataset objects (by concatenating the two X's, and the two constraints after re-indexing). A potential use case to keep in mind is to create constraints from a small set of labeled points and then complement this with constraints on a different/larger set of points.

I agree, I will implement it in a next commit

2/ Do we want to consider a ConstrainedDataset object to be characterized primarily by (a) a fixed dataset X, along with some constraints on X, or (b) by a set of constraints with X giving a memory-efficient representation? In (b), one could remove points in X which are not used in the constraints (at creation of the object, or when slicing etc). I think (a) makes more sense, and probably simplifies the implementation of stuff in 1/

Yes indeed, while in the original notebook (b) is implemented, (a) (the solution of this PR) has a lot of advantages: it is simpler (I agree it will simplify 1/ implementation), also the behaviour is consistent between creation and slicing (in the original notebook X was pruned at slicing but we had to had to keep the full X at creation in the case of pipelining transformers, (ex: make_pipeline(MetricLearner(), PCA())) because they will be applied to points and not constraints). Plus, in case of (b), if we want to prune X when slicing, in fact we will often have to copy the samples X that we keep (except in the case where slicing on constraints gives a real slicing on X (ex: X[5, 6, 7, 8] that we could transform in X[5:9] to have a view)): indeed we cannot have a view of X except if we take a real slice of it (fancy indexing of a numpy array returns a copy). Therefore, (a) allows to keep only the reference to an initial dataset, which is memory efficient.

@wdevazelhes
Copy link
Member Author

wdevazelhes commented Mar 5, 2018

I just commited some API changes in the algorithms, making the Weakly Supervised algorithms use ConstrainedDataset and y as an input in fit. For now I just wrapped the previous behaviour inside fit with some helper functions (unwrap_pairs, wrap_pairs, unwrap_to_graph) as a proof of concept, and made the tests work, but I will make a cleaner implementation later on.

Copy link
Member

@bellet bellet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unwrap_pairs helper function is nice as an intermediary step but in the future we should aim at avoiding as much as possible to unpack the ConstrainedDataset object, unless really necessary.

wrap_pairs is useful to turn the input data from previous API to a format compatible with the new one. If we want we could probably use this to provide backward compatibility in the next release (with deprecation warnings)?

@@ -59,7 +59,7 @@ def fit(self, X, y):

class WeaklySupervisedMetricLearner(BaseMetricLearner):

def fit(self, X, constraints):
def fit(self, constrained_dataset, y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe think of a more concise naming convention instead of constrained_dataset as it is gonna be used all over the place. X_constrained perhaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, X_constrained seems good

negative pairs
constrained_dataset : ConstrainedDataset
with constraints being an array of shape [n_constraints, 2]
y : array-like, shape (n x 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be n_constraints instead of n

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed, will change this

dissimilar pairs
constrained_dataset : ConstrainedDataset
with constraints being an array of shape [n_constraints, 2]
y : array-like, shape (n x 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

connectivity graph, with +1 for positive pairs and -1 for negative
constrained_dataset : ConstrainedDataset
with constraints being an array of shape [n_constraints, 2]
y : array-like, shape (n x 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

William de Vazelhes added 3 commits March 6, 2018 13:36
change constrained_dataset into X_constrained

correct shape of y for weakly supervised algorithms
# Conflicts:
#	metric_learn/base_metric.py
#	metric_learn/itml.py
#	metric_learn/lsml.py
#	metric_learn/mmc.py
#	metric_learn/sdml.py
return LSML.fit(self, X, pairs, weights=self.weights)
X_constrained = ConstrainedDataset(X, np.hstack([pairs[i][:, None]
for i in
range(4)]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.column_stack(pairs) seems to be what you want here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks !

from .constraints import Constraints
from .base_metric import PairsMetricLearner, SupervisedMetricLearner
from .constraints import Constraints, ConstrainedDataset, unwrap_pairs, \
wrap_pairs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nitpick: Use parens to break up long import lists:

from .constraints import (
    Constraints, ConstrainedDataset, unwrap_pairs, wrap_pairs)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

self.check_indexing(item)

def test_repr(self):
assert repr(cd) == repr(X[c])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use self.assertEqual(...) instead of bare asserts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

simplify columns concatenation

better syntax for long imports

change to better syntax for asserting equalities
np.testing.assert_array_equal(cd[idx].toarray(), X[c][idx])

def test_inputs(self):
# test the allowed and forbidden ways to create a ConstrainedDataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd split this into two separate tests, test_allowed_inputs and test_invalid_inputs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, will do

out_of_range_constraints = [[1, 2], [0, 1]]
msg = "ConstrainedDataset cannot be created: the length of " \
"the dataset is 2, so index 2 is out of " \
"range."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to imports, use parentheses here:

msg = ("First part of string. "
       "Second part of string.")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, will do

y = np.random.randint(0, 2, c.shape[0])
group = np.random.randint(0, 3, c.shape[0])

c_shape = c.shape[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used by test_getitem, so I'd move it there. (And probably name it num_constraints)

Copy link
Member

@bellet bellet Mar 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this is also used in test_shape (although not through this variable)
it could be defined before y and group since they also use c.shape[0]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I will rename it and put it before the definitions that use it


def unwrap_pairs(X_constrained, y):
a = X_constrained.c[(y == 0)[:, 0]][:, 0]
b = X_constrained.c[(y == 0)[:, 0]][:, 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant. How about:

y_zero = (y == 0).ravel()
a, b = X_constrained[y_zero].T
c, d = X_constrained[~y_zero].T

Copy link
Member Author

@wdevazelhes wdevazelhes Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will do

c = np.array(constraints[2])
d = np.array(constraints[3])
constraints = np.vstack([np.hstack([a[:, None], b[:, None]]),
np.hstack([c[:, None], d[:, None]])])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a, b, c, d = constraints
constraints = np.vstack((np.column_stack((a, b)), np.column_stack((c, d))))
# or if we have numpy 1.13+
constraints = np.block([[a, b], [c, d]])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, will do

from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.utils.testing import assert_raise_message

X = np.random.randn(20, 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to test sparse X as well. To do so, I'd write something like this:

class _BaseTestConstrainedDataset(unittest.TestCase):
   # everything currently under TestConstrainedDataset, but using self.X instead of X

class TestDenseConstrainedDataset(_BaseTestConstrainedDataset):
  def setUp(self):
    self.X = np.random.randn(20, 5)
    self.c = ... # and so on

class TestSparseConstrainedDataset(_BaseTestConstrainedDataset):
   # similar, but setUp creates a dense X

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it would make sense to also test other data types? like lists instead of numpy arrays

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I will add these tests in a next commit

William de Vazelhes added 2 commits March 19, 2018 11:22
fix WeaklySupervisedMixin not imported

make lower level mixins inherit from higher level ones and remove unnecessary imports

make link between shogun lmnn and python lmnn more coherent with the new class architecture

update lmnn testing
# Conflicts:
#	metric_learn/lsml.py
#	metric_learn/mmc.py
@wdevazelhes
Copy link
Member Author

wdevazelhes commented Mar 19, 2018

Hi, I just pushed a modification of the class structure. As discussed with @nvauquie and @bellet, the following fact was problematic: MMC_Supervised for instance, inherited from both SupervisedMetricLearner (through MMC inheritance), and WeaklySupervisedMetricLearner. This was weird and could also cause some bugs for instance if I want to put an abstract predict in WeaklySupervisedMetricLearning. Indeed, when testing MMC_Supervised with scikit-learn's check-estimator, scikit-learn will list all the functions that are implemented by MMC_Supervised, and as it implements predict (through WeaklySupervisedMetricLearner inheritance), it would run some tests on it with synthetic a dataset. But the predict function is not a real scikit-learn's one as it takes as input a ConstrainedDataset and not numpy array so it will bug.
One way to solve this problem is to replace the class hierarchy with mixins, like it is the case in many of scikit-learn's estimators: see for instance KNeighborsClassifier https://github.com/scikit-learn/scikit-learn/blob/788a458bba353c2cf3cfa5a15d6f68315149ef9e/sklearn/neighbors/classification.py#L23 and KNeighborsRegressor https://github.com/scikit-learn/scikit-learn/blob/788a458bba353c2cf3cfa5a15d6f68315149ef9e/sklearn/neighbors/regression.py#L19 Also the same is done for LinearRegression: https://github.com/scikit-learn/scikit-learn/blob/788a458bba353c2cf3cfa5a15d6f68315149ef9e/sklearn/linear_model/logistic.py#L952 https://github.com/scikit-learn/scikit-learn/blob/788a458bba353c2cf3cfa5a15d6f68315149ef9e/sklearn/linear_model/base.py#L284
For now this architecture will allow to develop the next features (like the predict function in weakly supervised metric learners), but it can be changed later.

from sklearn.utils.validation import check_array


class TransformerMixin(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does any class other than BaseMetricLearner use this mixin? If not, I'd just inline it into BaseMetricLearner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# TODO: introduce specific scoring functions etc


class QuadrupletsMixin(UnsupervisedMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quadruplets imply weak supervision.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree
Done

return NotImplementedError


class WeaklySupervisedMixin(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add NotImplementedError for the __init__ methods of this and the above abstract mixins.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -34,3 +34,8 @@ def fit(self, X, y=None):
else:
self.M_ = np.linalg.inv(self.M_)
return self

class Covariance(_Covariance, UnsupervisedMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nothing else will use _Covariance, I'd prefer to eliminate it and use the UnsupervisedMixin directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I don't think it's worth separating out the algorithm implementation unless/until we have >1 class that needs it.

X : array-like of shape [n_samples, n_features], or ConstrainedDataset
Training set.

y : numpy array of shape [n_samples] or 4-tuple of arrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the new API y should only be constraint labels or None, shouldn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, indeed I forgot to update the 4-tuple part
Originally I wanted to allow fit_transform to be used in a classical way (so with labels as y) but also in the weakly supervised way (with constraints labels) (and it could be None in both cases). So something like

y : None or numpy array of shape [n_samples] or [n_constraints]
        Target values, or constraints labels

However now I am not sure it it a good idea: maybe we should make this fit_transform specific to weakly supervised learning (so with y being constraints labels or None as you say) and inline it into WeaklySupervisedMixin rather than BaseEstimator, and make SupervisedMixin inherit from scikit-learn's TransformerMixin to have the classical version of fit_transform for supervised estimators, like this:

class WeaklySupervisedMixin(object):
     def fit_transform(self, X_constrained, y=None, **fit_params):
    """
[...]
    y : None, or numpy array of shape [n_constraints]
        Constraints labels.
    """
class SupervisedMixin(TransformerMixin):

@wdevazelhes
Copy link
Member Author

I just pushed a commit that allows to do predictions and decision functions with weakly supervised learners. I added some testing but we cannot do check_estimator on them (as they are training on ConstrainedDatasets rather than array-like data). So I tried to take a relevant subset of scikit-learn's tests that are runned by check_estimator, and adapt it to WeaklySupervisedMetricLearners. Some are maybe too much now, but I thought they might be useful later when developing new algorithms. Let me know if they add some confusion to the project and should rather be removed.
Note that some tests fail for now due to numerical errors, but they will be fixed in the future (I think the synthetic random dataset I used for testing is not really learnable so I will replace it by a better one).

In another branch, I am also trying to use scikit-learn's check_estimator anyway, trying to wrap weakly supervised metric learners into estimators compatible with check_estimator (but not through supervised metric learners, which is already tested in test_sklearn_compat). This would allow to test the prediction, scoring etc behaviour of weakly supervised metric learners, through scikit-learn's check_estimator. However, this is not easy because check_estimator routines launch a serie of tests with classical array-like data as inputs, so I am trying to do a wrapper estimator that create some synthetic ConstrainedDataset from an array-like input, and forcing the y to be boolean in order to look like the label of a pair of constraints. I am not sure about this ... The branch is here https://github.com/wdevazelhes/metric-learn/commits/feat/predictions_more_tests

@wdevazelhes
Copy link
Member Author

Hi, I just added some documentation. Most of it is in the Weakly Supervised page, but there is also some in the docstrings of ConstrainedDataset and that of PairsMixin, TripletsMixin and QuadrupletsMixin.

wdevazelhes pushed a commit to wdevazelhes/metric-learn that referenced this pull request May 22, 2018
Basically these are the tests from PR scikit-learn-contrib#85, but reformatted to use pytest, and formed tuples instead of ConstrainedDatasets.
wdevazelhes pushed a commit to wdevazelhes/metric-learn that referenced this pull request May 24, 2018
- Make PairsClassifierMixin and QuadrupletsClassifierMixin classes, to implement scoring functions
- Implement a new API for supervised wrappers of weakly supervised learning estimators (through the use of base classes, (ex: BaseMMC), from which inherit child classes (ex: MMC and MMC_Supervised) (which is the same idea as in PR scikit-learn-contrib#85
- Delete tests that use tuples learners as transformers (as we do not want to support this behaviour anymore: it is too complicated to allow such different input types (tuples or points) for the same estimator
@bellet
Copy link
Member

bellet commented Nov 15, 2018

For clarity I will close this one since it is about a previous API attempt

@bellet bellet closed this Nov 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants