-
Notifications
You must be signed in to change notification settings - Fork 228
[WIP] New API proposal #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…aviour even for WeaklySupervisedMetricLearner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far. I'd be fine with merging this without the rest of the implementation, as it doesn't break anything or mess with compatibility.
metric_learn/covariance.py
Outdated
|
||
|
||
class Covariance(BaseMetricLearner): | ||
class Covariance(SupervisedMetricLearner): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why this was chosen, but this particular base class made me stop and consider a moment. We may eventually want a base class for unsupervised methods as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this simple covariance method should not be a SupervisedMetricLearner (as it is completely unsupervised). Whether we will really need an unsupervised class in the long run is unclear, but maybe the best for now is to create an UnsupervisedMetricLearner class which takes only X in fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review ! I agree, I did not notice but indeed Covariance is unsupervised, so I will change this in a following PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far. Some remarks:
1/ It may be useful to add ways to update a ConstraintDataset object, although I am not sure how far we should go (append new constraints? append new points in X? ...). I think it would probably be useful to be able to add new constraint indices in c without modifying X, and maybe to merge two ConstrainedDataset objects (by concatenating the two X's, and the two constraints after re-indexing). A potential use case to keep in mind is to create constraints from a small set of labeled points and then complement this with constraints on a different/larger set of points.
2/ Do we want to consider a ConstrainedDataset object to be characterized primarily by (a) a fixed dataset X, along with some constraints on X, or (b) by a set of constraints with X giving a memory-efficient representation? In (b), one could remove points in X which are not used in the constraints (at creation of the object, or when slicing etc). I think (a) makes more sense, and probably simplifies the implementation of stuff in 1/
metric_learn/constraints.py
Outdated
return self.X[self.c] | ||
|
||
@staticmethod | ||
def _check_index(length, indices): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this could also check for potential duplicates? could simply show a warning when this is the case. (one could also remove them but this might create problems later when constraint labels are used)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I will implement it in a next commit
move pairs wrapper and unwrapper functions in constraints.py, and update itml with constrained_dataset update sdml to use constrained_dataset update sdml doc update LSML with ConstrainedDataset API
I agree, I will implement it in a next commit
Yes indeed, while in the original notebook (b) is implemented, (a) (the solution of this PR) has a lot of advantages: it is simpler (I agree it will simplify 1/ implementation), also the behaviour is consistent between creation and slicing (in the original notebook X was pruned at slicing but we had to had to keep the full X at creation in the case of pipelining transformers, (ex: |
I just commited some API changes in the algorithms, making the Weakly Supervised algorithms use ConstrainedDataset and y as an input in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unwrap_pairs
helper function is nice as an intermediary step but in the future we should aim at avoiding as much as possible to unpack the ConstrainedDataset
object, unless really necessary.
wrap_pairs
is useful to turn the input data from previous API to a format compatible with the new one. If we want we could probably use this to provide backward compatibility in the next release (with deprecation warnings)?
metric_learn/base_metric.py
Outdated
@@ -59,7 +59,7 @@ def fit(self, X, y): | |||
|
|||
class WeaklySupervisedMetricLearner(BaseMetricLearner): | |||
|
|||
def fit(self, X, constraints): | |||
def fit(self, constrained_dataset, y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe think of a more concise naming convention instead of constrained_dataset
as it is gonna be used all over the place. X_constrained
perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, X_constrained
seems good
metric_learn/itml.py
Outdated
negative pairs | ||
constrained_dataset : ConstrainedDataset | ||
with constraints being an array of shape [n_constraints, 2] | ||
y : array-like, shape (n x 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be n_constraints
instead of n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed, will change this
metric_learn/mmc.py
Outdated
dissimilar pairs | ||
constrained_dataset : ConstrainedDataset | ||
with constraints being an array of shape [n_constraints, 2] | ||
y : array-like, shape (n x 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
metric_learn/sdml.py
Outdated
connectivity graph, with +1 for positive pairs and -1 for negative | ||
constrained_dataset : ConstrainedDataset | ||
with constraints being an array of shape [n_constraints, 2] | ||
y : array-like, shape (n x 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
change constrained_dataset into X_constrained correct shape of y for weakly supervised algorithms
# Conflicts: # metric_learn/base_metric.py # metric_learn/itml.py # metric_learn/lsml.py # metric_learn/mmc.py # metric_learn/sdml.py
metric_learn/lsml.py
Outdated
return LSML.fit(self, X, pairs, weights=self.weights) | ||
X_constrained = ConstrainedDataset(X, np.hstack([pairs[i][:, None] | ||
for i in | ||
range(4)])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.column_stack(pairs)
seems to be what you want here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks !
metric_learn/mmc.py
Outdated
from .constraints import Constraints | ||
from .base_metric import PairsMetricLearner, SupervisedMetricLearner | ||
from .constraints import Constraints, ConstrainedDataset, unwrap_pairs, \ | ||
wrap_pairs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nitpick: Use parens to break up long import lists:
from .constraints import (
Constraints, ConstrainedDataset, unwrap_pairs, wrap_pairs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
test/test_constrained_dataset.py
Outdated
self.check_indexing(item) | ||
|
||
def test_repr(self): | ||
assert repr(cd) == repr(X[c]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use self.assertEqual(...)
instead of bare asserts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
simplify columns concatenation better syntax for long imports change to better syntax for asserting equalities
test/test_constrained_dataset.py
Outdated
np.testing.assert_array_equal(cd[idx].toarray(), X[c][idx]) | ||
|
||
def test_inputs(self): | ||
# test the allowed and forbidden ways to create a ConstrainedDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd split this into two separate tests, test_allowed_inputs
and test_invalid_inputs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, will do
test/test_constrained_dataset.py
Outdated
out_of_range_constraints = [[1, 2], [0, 1]] | ||
msg = "ConstrainedDataset cannot be created: the length of " \ | ||
"the dataset is 2, so index 2 is out of " \ | ||
"range." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to imports, use parentheses here:
msg = ("First part of string. "
"Second part of string.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, will do
test/test_constrained_dataset.py
Outdated
y = np.random.randint(0, 2, c.shape[0]) | ||
group = np.random.randint(0, 3, c.shape[0]) | ||
|
||
c_shape = c.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used by test_getitem
, so I'd move it there. (And probably name it num_constraints
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this is also used in test_shape
(although not through this variable)
it could be defined before y and group since they also use c.shape[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree I will rename it and put it before the definitions that use it
metric_learn/constraints.py
Outdated
|
||
def unwrap_pairs(X_constrained, y): | ||
a = X_constrained.c[(y == 0)[:, 0]][:, 0] | ||
b = X_constrained.c[(y == 0)[:, 0]][:, 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems redundant. How about:
y_zero = (y == 0).ravel()
a, b = X_constrained[y_zero].T
c, d = X_constrained[~y_zero].T
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, will do
metric_learn/constraints.py
Outdated
c = np.array(constraints[2]) | ||
d = np.array(constraints[3]) | ||
constraints = np.vstack([np.hstack([a[:, None], b[:, None]]), | ||
np.hstack([c[:, None], d[:, None]])]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a, b, c, d = constraints
constraints = np.vstack((np.column_stack((a, b)), np.column_stack((c, d))))
# or if we have numpy 1.13+
constraints = np.block([[a, b], [c, d]])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, will do
test/test_constrained_dataset.py
Outdated
from sklearn.model_selection import StratifiedKFold, KFold | ||
from sklearn.utils.testing import assert_raise_message | ||
|
||
X = np.random.randn(20, 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to test sparse X
as well. To do so, I'd write something like this:
class _BaseTestConstrainedDataset(unittest.TestCase):
# everything currently under TestConstrainedDataset, but using self.X instead of X
class TestDenseConstrainedDataset(_BaseTestConstrainedDataset):
def setUp(self):
self.X = np.random.randn(20, 5)
self.c = ... # and so on
class TestSparseConstrainedDataset(_BaseTestConstrainedDataset):
# similar, but setUp creates a dense X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it would make sense to also test other data types? like lists instead of numpy arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I will add these tests in a next commit
fix WeaklySupervisedMixin not imported make lower level mixins inherit from higher level ones and remove unnecessary imports make link between shogun lmnn and python lmnn more coherent with the new class architecture update lmnn testing
# Conflicts: # metric_learn/lsml.py # metric_learn/mmc.py
Hi, I just pushed a modification of the class structure. As discussed with @nvauquie and @bellet, the following fact was problematic: |
metric_learn/base_metric.py
Outdated
from sklearn.utils.validation import check_array | ||
|
||
|
||
class TransformerMixin(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does any class other than BaseMetricLearner
use this mixin? If not, I'd just inline it into BaseMetricLearner
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
metric_learn/base_metric.py
Outdated
# TODO: introduce specific scoring functions etc | ||
|
||
|
||
class QuadrupletsMixin(UnsupervisedMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quadruplets imply weak supervision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
Done
return NotImplementedError | ||
|
||
|
||
class WeaklySupervisedMixin(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add NotImplementedError
for the __init__
methods of this and the above abstract mixins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
metric_learn/covariance.py
Outdated
@@ -34,3 +34,8 @@ def fit(self, X, y=None): | |||
else: | |||
self.M_ = np.linalg.inv(self.M_) | |||
return self | |||
|
|||
class Covariance(_Covariance, UnsupervisedMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If nothing else will use _Covariance
, I'd prefer to eliminate it and use the UnsupervisedMixin
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I don't think it's worth separating out the algorithm implementation unless/until we have >1 class that needs it.
…ervised algorithms
metric_learn/base_metric.py
Outdated
X : array-like of shape [n_samples, n_features], or ConstrainedDataset | ||
Training set. | ||
|
||
y : numpy array of shape [n_samples] or 4-tuple of arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the new API y
should only be constraint labels or None
, shouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, indeed I forgot to update the 4-tuple part
Originally I wanted to allow fit_transform
to be used in a classical way (so with labels as y) but also in the weakly supervised way (with constraints labels) (and it could be None in both cases). So something like
y : None or numpy array of shape [n_samples] or [n_constraints]
Target values, or constraints labels
However now I am not sure it it a good idea: maybe we should make this fit_transform
specific to weakly supervised learning (so with y
being constraints labels or None as you say) and inline it into WeaklySupervisedMixin
rather than BaseEstimator
, and make SupervisedMixin
inherit from scikit-learn's TransformerMixin
to have the classical version of fit_transform
for supervised estimators, like this:
class WeaklySupervisedMixin(object):
def fit_transform(self, X_constrained, y=None, **fit_params):
"""
[...]
y : None, or numpy array of shape [n_constraints]
Constraints labels.
"""
class SupervisedMixin(TransformerMixin):
I just pushed a commit that allows to do predictions and decision functions with weakly supervised learners. I added some testing but we cannot do In another branch, I am also trying to use scikit-learn's |
Hi, I just added some documentation. Most of it is in the Weakly Supervised page, but there is also some in the docstrings of |
Basically these are the tests from PR scikit-learn-contrib#85, but reformatted to use pytest, and formed tuples instead of ConstrainedDatasets.
- Make PairsClassifierMixin and QuadrupletsClassifierMixin classes, to implement scoring functions - Implement a new API for supervised wrappers of weakly supervised learning estimators (through the use of base classes, (ex: BaseMMC), from which inherit child classes (ex: MMC and MMC_Supervised) (which is the same idea as in PR scikit-learn-contrib#85 - Delete tests that use tuples learners as transformers (as we do not want to support this behaviour anymore: it is too complicated to allow such different input types (tuples or points) for the same estimator
For clarity I will close this one since it is about a previous API attempt |
This PR intends to provide a new API for the package, that would allow to use scikit-learn's tools for model selection and pipelining not only for supervised algorithms but also for weakly supervised ones. Indeed we would like to be able for instance to have a dataset of pairs, an algorithm that works on labeled pairs (i.e. pairwise constraints) and plug it into say scikit-learn's 'roc_auc' cross-validation scoring. This cross-validation scoring would split on the pairs and return the score of a classification of pairs. However we don't want to build an explicit dataset of pairs: indeed if we duplicate a point as many times as it is involved in a pair, the dataset would be huge. Instead, we would build a object that acts as if it was a list of pairs, but in fact stores the initial dataset as well as indices. The behaviour of such an object and use basic use cases can be found in this binder notebook:
.
This PR will be complete when the following will be done: