Skip to content

[MRG] Add preprocessor option #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

wdevazelhes
Copy link
Member

@wdevazelhes wdevazelhes commented Aug 21, 2018

This PR adds an argument preprocessor to weakly supervised algorithms initialization: an option that allows them to accept two types of input: either 3D arrays of tuples of points as before, or 2D arrays of indices/identifiers of points. In the latter case, the preprocessor given as input to the Metric Learner will allow to retreive points from identifiers. The preprocessor is basically a callable that is called on a list of identifiers (or just on one identifier ? we'll have to decide), and that returns the points associated to this identifiers.
If instead of a callable the user provides an array X, it will automatically create the "indexing" preprocessor (i.e. a preprocessor such that preprocessor(id) returns X[id]). We could also imagine other shortcuts like this (for instance the user could provide a string of a root folder containing images and then it would create a preprocessor such that preprocessor("img2.png") would return the vector associated to the image located at "rootfolder/img2.png")

Note: This PR branched from MahalanobisMixin PR, so as soon as MahalanobisMixin is merged the diff should become more readable.

TODO:

  • Add comments and more info to this PR
  • Add tests: in progress: add tests to check that the output format of check_input is as specified
    - [ ] Add documentation -> probably in another PR
  • Make it simpler with unified check_input function
  • Refactor the check_input function and its tests to be cleaner
  • Write docstrings
  • Refactor tests to have the list of metric learners at one place
  • Fix the linalg bug

William de Vazelhes added 23 commits May 25, 2018 16:25
- Make them inherit from Mahalanobis Mixin, and implement the metric_ property
- Improve metric_ property by checking if it exists and raising the appropriate warning if not
- Make tests work, by replacing metric() with metric_
# Conflicts:
#	metric_learn/base_metric.py
#	metric_learn/covariance.py
#	metric_learn/itml.py
#	metric_learn/lfda.py
#	metric_learn/lmnn.py
#	metric_learn/lsml.py
#	metric_learn/mlkr.py
#	metric_learn/mmc.py
#	metric_learn/nca.py
#	metric_learn/rca.py
#	metric_learn/sdml.py
… new check_no_attributes_set_in_init"

This new function was introduced through PR scikit-learn/scikit-learn#9450 in scikit-learn.
It also allows to pass tests that would otherwise not pass: indeed having abstract attributes as properties threw an error. But the new test functions handles well this property inheritance.
…scikit-learn#9450

Indeed, in the PR this function is modified to support python 2. This should solve the CI error.
- Make MahalanobisMixin inherit from BaseMetricLearner to give a concrete implementation of score_pairs
- Use score_pairs to compute more easily predict
- Add docstring
- TST: for every algorithm:
   - test that using score_pairs pairwise returns an euclidean distance matrix
   - test that score_pairs works for 3D arrays of several pairs as well as 2D arrays of one pair (and there returns only a scalar)
   - test that score_pairs always returns a finite output
- add the function and docstring
- use it for score_pairs
- TST :
   - should be finite
   - have right output dimension
   - embedding should be linear
   - should work on a toy example
- remove unusual s to test functions
- remove redundant parenthesis
- replace embed by transform and add always the input X in calling the function
- mutualize _transformer_from_metric not to be overwritten in MMC
- improve test_mahalanobis_mixin.test_score_pairs_pairwise according to scikit-learn-contrib#96 (comment)
- improve test_mahalanobis_mixin.check_is_distance_matrix
- correct typos and nitpicks
William de Vazelhes added 6 commits August 23, 2018 12:02
The previous error message would have said "[...], shape=(shape_of_the_2D_array_extracted_from_3D)"
But it is clearer to print the shape of the actual 3D initial array (tuples in input)
…xt in custom functions but to give name to check_array
@wdevazelhes
Copy link
Member Author

Hi,
I think the PR should be good to merge now !
I addressed most of the comments, except those remaining ones:

  • [MRG] Add preprocessor option #117 (review) : preprocessor can be a class, which is indeed a callable but not a function (fn) anymore, plus it can also be an X (array-like). But maybe in tensorflow too they are not necessarily functions ? @terrytang how would you give an input which is just a numpy array in tf.Estimators ? with input_fn=X or creating an input function from X with something like input_fn=make_an_input_function(X) ?

  • [MRG] Add preprocessor option #117 (comment): I assume I fixed it by adding a comment on why to return two times X but I'm not sure it's what you refered to @bellet ? (Note that I recently refactored build_pairs so also maybe the comment is outdated)

  • [MRG] Add preprocessor option #117 (comment) : I went for a detailed error message for now, but maybe in the future we could refactor with enums ?

@wdevazelhes wdevazelhes changed the title [WIP] Add preprocessor option [MRG] Add preprocessor option Dec 6, 2018
@terrytangyuan
Copy link
Member

@wdevazelhes Basically input_fn is a function that returns a dictionary of features and targets. It is exposed to users so that they can transform their batches of data to avoid feeding the entire dataset at once and to speed up (asynchronous) training with (distributed) data pipeline. It's a function that gets called that provides batches of data to each fit step so that more flexible data types can be fed in, e.g. iterators or streaming input. For input that is not a function, there are utility methods such as numpy_input_fn and pandas_input_fn that can be used as input_fn, e.g.

age = np.arange(4) * 1.0
height = np.arange(32, 36)
x = {'age': age, 'height': height}
y = np.arange(-32, -28)

with tf.Session() as session:
  input_fn = numpy_input_fn(
      x, y, batch_size=2, shuffle=False, num_epochs=1)

Basically numpy_input_fn facilitates the process of creating a function/callable from numpy arrays.

Regarding "I went for a detailed error message for now, but maybe in the future we could refactor with enums ?" - sure, sounds good.

Copy link
Member

@bellet bellet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

"""
self.num_dims = num_dims
self.A0 = A0
self.epsilon = epsilon
self.alpha = alpha
self.max_iter = max_iter
super(MLKR, self).__init__(preprocessor)

def _process_inputs(self, X, y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you forgot to remove this function after putting things in fit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, thanks :p Done

@bellet
Copy link
Member

bellet commented Dec 11, 2018

I would say that this is ready to merge, unless @perimosocordiae has comments?

Copy link
Contributor

@perimosocordiae perimosocordiae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few very minor comments, but overall LGTM as well.

This is a massive feature, great job @wdevazelhes !

Equivalent of `check_array` in scikit-learn.
def check_input(input_data, y=None, preprocessor=None,
type_of_inputs='classic', tuple_size=None, accept_sparse=False,
dtype="numeric", order=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nitpick: use single quotes for dtype='numeric' to keep consistency with the rest of the file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, thanks
Done

args_for_sk_checks['ensure_min_features'],
context))
# normally we don't need to check_tuple_size too because tuple_size
# should'nt be able to be modified by any preprocessor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: shouldn't

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done

# normally we don't need to check_tuple_size too because tuple_size
# should'nt be able to be modified by any preprocessor
if input_data.ndim != 3: # we have to ensure this because check_array
# above does not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a weird comment line-break. I'd prefer moving the whole comment either above or below the if input_data.ndim != 3: line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, done

def preprocess_points(points, preprocessor):
"""form points if there is a preprocessor else keep them as such (assumes
that check_points has already been called)"""
print("Preprocessing points...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this print call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is not very useful (and not well formatted)

ensure_2d=False, allow_nd=True)
quadruplets = check_tuples(quadruplets)
quadruplets = self._prepare_inputs(quadruplets,
type_of_inputs='tuples')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that _prepare_inputs isn't user-facing. So long as users won't need to (commonly) pass an enum as an argument in the public API, I'm +1 for using an enum.

In any case, we should definitely raise an error if an unexpected value is passed.

@wdevazelhes
Copy link
Member Author

Thanks @perimosocordiae ! I addressed your comments.

Regarding _prepare_inputs, indeed it is not user-centric for now because the arguments are not copied from check_input (there is *args and **kwargs instead, and no docstring consequently), and it is private. check_input is for now the more user centric function. But in the meantime, _prepare_input would be the main method to use for developers/users who want to create their own custom estimator. So I'm wondering if we shouldn't replace _prepare_input by prepare_input, a public method where the arguments of check_input and their docstring would be copied ? And therefore keep using a string as a flag too ('classic'/'tuples') ?

@bellet
Copy link
Member

bellet commented Dec 12, 2018

This may be a stupid question but why do we need 2 functions?

@wdevazelhes
Copy link
Member Author

wdevazelhes commented Dec 13, 2018

I guess maybe we don't necessarily need 2 functions indeed. Originally I thought of check_input as supposedly similar to check_array and check_X_y, and these are functions that can be used (and tested), without any Estimator instantiation. _prepare_input on the contrary is a BaseMetricLearner method that at the same time does self.check_preprocessor, and calls check_input filled with all the parameters that can be infered from knowing we are in some BaseMetricLearner (by the way for now we still have to specify 'type_of_input' but a simple refactoring distinguishing SupervisedMetricLearner and WeaklySupervisedMetricLearner would allow to also fill this argument)
But after all maybe we don't really need check_input on its own (I am not sure users would really want to use it as such, and for the tests we can always create some fake Metric Learners) ?

Having only the method _prepare_input would simplify the code indeed (not having an intermediate layers of arguments to pass, and knowing that we necessarily are inside a BaseMetricLearner would also simplify the error messages (and their testing) since for now we can pass a string or None into check_estimator(estimator=...), and probably other simplifications).

I guess the questions is: what is best: the "feature" (maybe not so useful) to be able to use check_input on its own without any Estimator, or the simplification of the code ?

In any case this is quite a concentrated part of the code so it could be refactored at any time

@bellet
Copy link
Member

bellet commented Dec 13, 2018

OK, maybe it makes sense to have two functions then. check_input may indeed be used by users while _prepare_input is for developers so should probably remain private?

@perimosocordiae perimosocordiae merged commit 010b34a into scikit-learn-contrib:new_api_design Dec 14, 2018
@perimosocordiae
Copy link
Contributor

In it goes!

@perimosocordiae
Copy link
Contributor

@wdevazelhes this merged into the new_api_design branch. Can you make a PR to merge that branch back into master?

@bellet
Copy link
Member

bellet commented Dec 14, 2018

Shouldn't we write the doc first (#133)? This branch introduces quite a few compatibility breaks

@perimosocordiae
Copy link
Contributor

Ah, I didn't notice that gh-133 was off of the new_api_design branch as well. Typically I prefer to do all PRs against master to avoid lots of conflict resolution down the road.

Re: compatibility breaks, users building from master should be ready for breaking changes from time to time. We'll need to make sure the docs are sufficient before cutting the next release, though.

@wdevazelhes
Copy link
Member Author

OK, maybe it makes sense to have two functions then. check_input may indeed be used by users while _prepare_input is for developers so should probably remain private?
--

Yes that's true, let's keep this for now and see later how it goes maybe

@wdevazelhes
Copy link
Member Author

Ah, I didn't notice that gh-133 was off of the new_api_design branch as well. Typically I prefer to do all PRs against master to avoid lots of conflict resolution down the road.
Re: compatibility breaks, users building from master should be ready for breaking changes from time to time. We'll need to make sure the docs are sufficient before cutting the next release, though.

There's also the fact that the API is quite changed so users would really need some doc informations on how to use the preprocessor etc
But maybe we can try to merge the doc soon even if it is not totally complete with all the examples ? This way we would benefit from users building from source that would try to use the new API and file bugs etc

@perimosocordiae
Copy link
Contributor

Yes, I'd prefer to get master up to date sooner than later. Docs are important, and we definitely need good examples before we cut a release, but I want to make sure we're stable on master for a reasonable amount of time before releasing a new version.

@wdevazelhes
Copy link
Member Author

wdevazelhes commented Dec 19, 2018

I just pushed a minimal version of the documentation in PR #133 (see comment #133 (comment)), I think this could be merged into branch new_api before the end of the week maybe, so that we could merge new_api into master then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants