[MRG] Add preprocessor option #117

wdevazelhes · 2018-08-21T16:13:35Z

This PR adds an argument preprocessor to weakly supervised algorithms initialization: an option that allows them to accept two types of input: either 3D arrays of tuples of points as before, or 2D arrays of indices/identifiers of points. In the latter case, the preprocessor given as input to the Metric Learner will allow to retreive points from identifiers. The preprocessor is basically a callable that is called on a list of identifiers (or just on one identifier ? we'll have to decide), and that returns the points associated to this identifiers.
If instead of a callable the user provides an array X, it will automatically create the "indexing" preprocessor (i.e. a preprocessor such that preprocessor(id) returns X[id]). We could also imagine other shortcuts like this (for instance the user could provide a string of a root folder containing images and then it would create a preprocessor such that preprocessor("img2.png") would return the vector associated to the image located at "rootfolder/img2.png")

Note: This PR branched from MahalanobisMixin PR, so as soon as MahalanobisMixin is merged the diff should become more readable.

TODO:

Add comments and more info to this PR
Add tests: in progress: add tests to check that the output format of check_input is as specified
- ~~[ ] Add documentation~~ -> probably in another PR
Make it simpler with unified check_input function
Refactor the check_input function and its tests to be cleaner
Write docstrings
Refactor tests to have the list of metric learners at one place
Fix the linalg bug

- Make them inherit from Mahalanobis Mixin, and implement the metric_ property - Improve metric_ property by checking if it exists and raising the appropriate warning if not - Make tests work, by replacing metric() with metric_

# Conflicts: # metric_learn/base_metric.py # metric_learn/covariance.py # metric_learn/itml.py # metric_learn/lfda.py # metric_learn/lmnn.py # metric_learn/lsml.py # metric_learn/mlkr.py # metric_learn/mmc.py # metric_learn/nca.py # metric_learn/rca.py # metric_learn/sdml.py

… new check_no_attributes_set_in_init" This new function was introduced through PR scikit-learn/scikit-learn#9450 in scikit-learn. It also allows to pass tests that would otherwise not pass: indeed having abstract attributes as properties threw an error. But the new test functions handles well this property inheritance.

…scikit-learn#9450 Indeed, in the PR this function is modified to support python 2. This should solve the CI error.

…() in MahalanobisMixin

… computed from the transformer

… and metric()

- Make MahalanobisMixin inherit from BaseMetricLearner to give a concrete implementation of score_pairs - Use score_pairs to compute more easily predict - Add docstring - TST: for every algorithm: - test that using score_pairs pairwise returns an euclidean distance matrix - test that score_pairs works for 3D arrays of several pairs as well as 2D arrays of one pair (and there returns only a scalar) - test that score_pairs always returns a finite output

- add the function and docstring - use it for score_pairs - TST : - should be finite - have right output dimension - embedding should be linear - should work on a toy example

- remove unusual s to test functions - remove redundant parenthesis

…t algorithms

- replace embed by transform and add always the input X in calling the function - mutualize _transformer_from_metric not to be overwritten in MMC - improve test_mahalanobis_mixin.test_score_pairs_pairwise according to scikit-learn-contrib#96 (comment) - improve test_mahalanobis_mixin.check_is_distance_matrix - correct typos and nitpicks

…ure_min_features

The previous error message would have said "[...], shape=(shape_of_the_2D_array_extracted_from_3D)" But it is clearer to print the shape of the actual 3D initial array (tuples in input)

…xt in custom functions but to give name to check_array

…nts should follow expression'

…est_same_with_or_without_preprocessor_classic

wdevazelhes · 2018-12-06T13:04:09Z

Hi,
I think the PR should be good to merge now !
I addressed most of the comments, except those remaining ones:

[MRG] Add preprocessor option #117 (review) : preprocessor can be a class, which is indeed a callable but not a function (fn) anymore, plus it can also be an X (array-like). But maybe in tensorflow too they are not necessarily functions ? @terrytang how would you give an input which is just a numpy array in tf.Estimators ? with input_fn=X or creating an input function from X with something like input_fn=make_an_input_function(X) ?
[MRG] Add preprocessor option #117 (comment): I assume I fixed it by adding a comment on why to return two times X but I'm not sure it's what you refered to @bellet ? (Note that I recently refactored build_pairs so also maybe the comment is outdated)
[MRG] Add preprocessor option #117 (comment) : I went for a detailed error message for now, but maybe in the future we could refactor with enums ?

terrytangyuan · 2018-12-07T21:36:13Z

@wdevazelhes Basically input_fn is a function that returns a dictionary of features and targets. It is exposed to users so that they can transform their batches of data to avoid feeding the entire dataset at once and to speed up (asynchronous) training with (distributed) data pipeline. It's a function that gets called that provides batches of data to each fit step so that more flexible data types can be fed in, e.g. iterators or streaming input. For input that is not a function, there are utility methods such as numpy_input_fn and pandas_input_fn that can be used as input_fn, e.g.

age = np.arange(4) * 1.0
height = np.arange(32, 36)
x = {'age': age, 'height': height}
y = np.arange(-32, -28)

with tf.Session() as session:
  input_fn = numpy_input_fn(
      x, y, batch_size=2, shuffle=False, num_epochs=1)

Basically numpy_input_fn facilitates the process of creating a function/callable from numpy arrays.

Regarding "I went for a detailed error message for now, but maybe in the future we could refactor with enums ?" - sure, sounds good.

bellet

LGTM!

bellet · 2018-12-10T16:57:00Z

metric_learn/mlkr.py

    """
    self.num_dims = num_dims
    self.A0 = A0
    self.epsilon = epsilon
    self.alpha = alpha
    self.max_iter = max_iter
+    super(MLKR, self).__init__(preprocessor)

  def _process_inputs(self, X, y):


Looks like you forgot to remove this function after putting things in fit

That's right, thanks :p Done

bellet · 2018-12-11T12:47:13Z

I would say that this is ready to merge, unless @perimosocordiae has comments?

perimosocordiae

I have a few very minor comments, but overall LGTM as well.

This is a massive feature, great job @wdevazelhes !

perimosocordiae · 2018-12-12T02:29:34Z

metric_learn/_util.py

-  Equivalent of `check_array` in scikit-learn.
+def check_input(input_data, y=None, preprocessor=None,
+                type_of_inputs='classic', tuple_size=None, accept_sparse=False,
+                dtype="numeric", order=None,


Super nitpick: use single quotes for dtype='numeric' to keep consistency with the rest of the file.

I agree, thanks
Done

perimosocordiae · 2018-12-12T02:31:31Z

metric_learn/_util.py

+                               args_for_sk_checks['ensure_min_features'],
+                               context))
+  #  normally we don't need to check_tuple_size too because tuple_size
+  # should'nt be able to be modified by any preprocessor


typo: shouldn't

Thanks, done

perimosocordiae · 2018-12-12T02:32:20Z

metric_learn/_util.py

+  #  normally we don't need to check_tuple_size too because tuple_size
+  # should'nt be able to be modified by any preprocessor
+  if input_data.ndim != 3:  # we have to ensure this because check_array
+    # above does not


This is a weird comment line-break. I'd prefer moving the whole comment either above or below the if input_data.ndim != 3: line.

I agree, done

perimosocordiae · 2018-12-12T02:35:02Z

metric_learn/_util.py

+def preprocess_points(points, preprocessor):
+  """form points if there is a preprocessor else keep them as such (assumes
+  that check_points has already been called)"""
+  print("Preprocessing points...")


Remove this print call.

I agree it is not very useful (and not well formatted)

perimosocordiae · 2018-12-12T02:41:02Z

metric_learn/lsml.py

-                              ensure_2d=False, allow_nd=True)
-    quadruplets = check_tuples(quadruplets)
+    quadruplets = self._prepare_inputs(quadruplets,
+                                       type_of_inputs='tuples')


It seems that _prepare_inputs isn't user-facing. So long as users won't need to (commonly) pass an enum as an argument in the public API, I'm +1 for using an enum.

In any case, we should definitely raise an error if an unexpected value is passed.

wdevazelhes · 2018-12-12T15:45:19Z

Thanks @perimosocordiae ! I addressed your comments.

Regarding _prepare_inputs, indeed it is not user-centric for now because the arguments are not copied from check_input (there is *args and **kwargs instead, and no docstring consequently), and it is private. check_input is for now the more user centric function. But in the meantime, _prepare_input would be the main method to use for developers/users who want to create their own custom estimator. So I'm wondering if we shouldn't replace _prepare_input by prepare_input, a public method where the arguments of check_input and their docstring would be copied ? And therefore keep using a string as a flag too ('classic'/'tuples') ?

bellet · 2018-12-12T17:00:45Z

This may be a stupid question but why do we need 2 functions?

wdevazelhes · 2018-12-13T08:50:33Z

I guess maybe we don't necessarily need 2 functions indeed. Originally I thought of check_input as supposedly similar to check_array and check_X_y, and these are functions that can be used (and tested), without any Estimator instantiation. _prepare_input on the contrary is a BaseMetricLearner method that at the same time does self.check_preprocessor, and calls check_input filled with all the parameters that can be infered from knowing we are in some BaseMetricLearner (by the way for now we still have to specify 'type_of_input' but a simple refactoring distinguishing SupervisedMetricLearner and WeaklySupervisedMetricLearner would allow to also fill this argument)
But after all maybe we don't really need check_input on its own (I am not sure users would really want to use it as such, and for the tests we can always create some fake Metric Learners) ?

Having only the method _prepare_input would simplify the code indeed (not having an intermediate layers of arguments to pass, and knowing that we necessarily are inside a BaseMetricLearner would also simplify the error messages (and their testing) since for now we can pass a string or None into check_estimator(estimator=...), and probably other simplifications).

I guess the questions is: what is best: the "feature" (maybe not so useful) to be able to use check_input on its own without any Estimator, or the simplification of the code ?

In any case this is quite a concentrated part of the code so it could be refactored at any time

bellet · 2018-12-13T10:54:56Z

OK, maybe it makes sense to have two functions then. check_input may indeed be used by users while _prepare_input is for developers so should probably remain private?

perimosocordiae · 2018-12-14T03:08:20Z

In it goes!

perimosocordiae · 2018-12-14T03:10:12Z

@wdevazelhes this merged into the new_api_design branch. Can you make a PR to merge that branch back into master?

bellet · 2018-12-14T07:24:40Z

Shouldn't we write the doc first (#133)? This branch introduces quite a few compatibility breaks

perimosocordiae · 2018-12-14T14:50:50Z

Ah, I didn't notice that gh-133 was off of the new_api_design branch as well. Typically I prefer to do all PRs against master to avoid lots of conflict resolution down the road.

Re: compatibility breaks, users building from master should be ready for breaking changes from time to time. We'll need to make sure the docs are sufficient before cutting the next release, though.

wdevazelhes · 2018-12-17T09:49:20Z

OK, maybe it makes sense to have two functions then. check_input may indeed be used by users while _prepare_input is for developers so should probably remain private?
--

Yes that's true, let's keep this for now and see later how it goes maybe

wdevazelhes · 2018-12-17T09:58:22Z

Ah, I didn't notice that gh-133 was off of the new_api_design branch as well. Typically I prefer to do all PRs against master to avoid lots of conflict resolution down the road.
Re: compatibility breaks, users building from master should be ready for breaking changes from time to time. We'll need to make sure the docs are sufficient before cutting the next release, though.

There's also the fact that the API is quite changed so users would really need some doc informations on how to use the preprocessor etc
But maybe we can try to merge the doc soon even if it is not totally complete with all the examples ? This way we would benefit from users building from source that would try to use the new API and file bugs etc

perimosocordiae · 2018-12-17T14:52:06Z

Yes, I'd prefer to get master up to date sooner than later. Docs are important, and we definitely need good examples before we cut a release, but I want to make sure we're stable on master for a reasonable amount of time before releasing a new version.

wdevazelhes · 2018-12-19T15:27:03Z

I just pushed a minimal version of the documentation in PR #133 (see comment #133 (comment)), I think this could be merged into branch new_api before the end of the week maybe, so that we could merge new_api into master then

William de Vazelhes added 23 commits May 25, 2018 16:25

WIP create MahalanobisMixin

7b3d739

ENH Update algorithms with Mahalanobis Mixin:

f21cc85

- Make them inherit from Mahalanobis Mixin, and implement the metric_ property - Improve metric_ property by checking if it exists and raising the appropriate warning if not - Make tests work, by replacing metric() with metric_

FIX: add missing import

f9e3c82

FIX: take function _get_args from scikit-learn's PR scikit-learn/…

d0f5019

…scikit-learn#9450 Indeed, in the PR this function is modified to support python 2. This should solve the CI error.

ENH: add transformer_ attribute and improve docstring

eba2a60

WIP: move transform() in BaseMetricLearner to transformer_from_metric…

b5d966f

…() in MahalanobisMixin

WIP: refactor metric to original formulation: a function, with result…

ee0d1bd

… computed from the transformer

WIP: make all Mahalanobis Metric Learner algorithms have transformer_…

6b5a3b5

… and metric()

TST add test on toy example for score_pairs

35ece36

ENH Add embed function

dca6838

- add the function and docstring - use it for score_pairs - TST : - should be finite - have right output dimension - embedding should be linear - should work on a toy example

FIX fix error in slicing of quadruplets

3254ce3

FIX minor corrections

e209b21

FIX minor corrections

abea7de

- remove unusual s to test functions - remove redundant parenthesis

FIX fix PEP8 errors

65e794a

FIX remove possible one-sample scoring from docstring for now

12b5429

REF rename n_features_out to num_dims to be more coherent with curren…

eff278e

…t algorithms

ENH: Add check_tuples

585b5d2

FIX: fix parenthesis

af0a3ac

ENH: First commit adding a preprocessor

f1dd4c2

wdevazelhes mentioned this pull request Aug 22, 2018

[MRG] Create new Mahalanobis mixin #96

Merged

7 tasks

William de Vazelhes added 6 commits August 23, 2018 12:02

ENH: Improve check_tuples with more comments and deal better with ens…

47fbf46

…ure_min_features

STY: remove unexpected spaces

8eb419a

FIX: Raise more appropriate error message

033da60

The previous error message would have said "[...], shape=(shape_of_the_2D_array_extracted_from_3D)" But it is clearer to print the shape of the actual 3D initial array (tuples in input)

FIX: fix string formatting and refactor name and context to use conte…

6fae262

…xt in custom functions but to give name to check_array

FIX: only allow 2D if preprocessor and 3D if not preprocessor

758d4cc

FIX: put format arguments in the right order

deb6d5d

William de Vazelhes added 7 commits December 5, 2018 14:35

FIX: fix build_pairs and build_quadruplets because 'only named argume…

0b3e58a

…nts should follow expression'

STY: fix PEP8 error

fbd7242

MAINT: mututalize test_same_with_or_without_preprocessor_tuples and t…

148012e

…est_same_with_or_without_preprocessor_classic

TST: give better names in test_same_with_or_without_preprocessor

8c5675b

MAINT: refactor list_estimators into metric_learners

30061e4

TST: uniformize names input_data - tuples, labels - y

e241c28

FIX: fix build_pairs and build_quadruplets

b6d7de7

wdevazelhes mentioned this pull request Dec 5, 2018

Avoid replicated tests for lists of algorithms #136

Open

3 tasks

wdevazelhes changed the title ~~[WIP] Add preprocessor option~~ [MRG] Add preprocessor option Dec 6, 2018

bellet approved these changes Dec 10, 2018

View reviewed changes

MAINT: remove forgotten code duplication

a44e29a

perimosocordiae approved these changes Dec 12, 2018

View reviewed changes

MAINT: address scikit-learn-contrib#117 (review)

2db6410

perimosocordiae merged commit 010b34a into scikit-learn-contrib:new_api_design Dec 14, 2018

wdevazelhes mentioned this pull request Dec 19, 2018

Add documentation for the new API #133

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Add preprocessor option #117

[MRG] Add preprocessor option #117

wdevazelhes commented Aug 21, 2018 •

edited

Loading

wdevazelhes commented Dec 6, 2018

terrytangyuan commented Dec 7, 2018

bellet left a comment

bellet Dec 10, 2018

wdevazelhes Dec 11, 2018

bellet commented Dec 11, 2018

perimosocordiae left a comment

perimosocordiae Dec 12, 2018

wdevazelhes Dec 12, 2018

perimosocordiae Dec 12, 2018

wdevazelhes Dec 12, 2018

perimosocordiae Dec 12, 2018

wdevazelhes Dec 12, 2018

perimosocordiae Dec 12, 2018

wdevazelhes Dec 12, 2018

perimosocordiae Dec 12, 2018

wdevazelhes commented Dec 12, 2018

bellet commented Dec 12, 2018

wdevazelhes commented Dec 13, 2018 •

edited

Loading

bellet commented Dec 13, 2018 •

edited

Loading

perimosocordiae commented Dec 14, 2018

perimosocordiae commented Dec 14, 2018

bellet commented Dec 14, 2018

perimosocordiae commented Dec 14, 2018

wdevazelhes commented Dec 17, 2018

wdevazelhes commented Dec 17, 2018

perimosocordiae commented Dec 17, 2018

wdevazelhes commented Dec 19, 2018 •

edited

Loading

[MRG] Add preprocessor option #117

[MRG] Add preprocessor option #117

Conversation

wdevazelhes commented Aug 21, 2018 • edited Loading

TODO:

wdevazelhes commented Dec 6, 2018

terrytangyuan commented Dec 7, 2018

bellet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bellet commented Dec 11, 2018

perimosocordiae left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wdevazelhes commented Dec 12, 2018

bellet commented Dec 12, 2018

wdevazelhes commented Dec 13, 2018 • edited Loading

bellet commented Dec 13, 2018 • edited Loading

perimosocordiae commented Dec 14, 2018

perimosocordiae commented Dec 14, 2018

bellet commented Dec 14, 2018

perimosocordiae commented Dec 14, 2018

wdevazelhes commented Dec 17, 2018

wdevazelhes commented Dec 17, 2018

perimosocordiae commented Dec 17, 2018

wdevazelhes commented Dec 19, 2018 • edited Loading

wdevazelhes commented Aug 21, 2018 •

edited

Loading

wdevazelhes commented Dec 13, 2018 •

edited

Loading

bellet commented Dec 13, 2018 •

edited

Loading

wdevazelhes commented Dec 19, 2018 •

edited

Loading