-
Notifications
You must be signed in to change notification settings - Fork 228
[MRG] Add preprocessor option #117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Add preprocessor option #117
Conversation
- Make them inherit from Mahalanobis Mixin, and implement the metric_ property - Improve metric_ property by checking if it exists and raising the appropriate warning if not - Make tests work, by replacing metric() with metric_
# Conflicts: # metric_learn/base_metric.py # metric_learn/covariance.py # metric_learn/itml.py # metric_learn/lfda.py # metric_learn/lmnn.py # metric_learn/lsml.py # metric_learn/mlkr.py # metric_learn/mmc.py # metric_learn/nca.py # metric_learn/rca.py # metric_learn/sdml.py
… new check_no_attributes_set_in_init" This new function was introduced through PR scikit-learn/scikit-learn#9450 in scikit-learn. It also allows to pass tests that would otherwise not pass: indeed having abstract attributes as properties threw an error. But the new test functions handles well this property inheritance.
…scikit-learn#9450 Indeed, in the PR this function is modified to support python 2. This should solve the CI error.
…() in MahalanobisMixin
… computed from the transformer
- Make MahalanobisMixin inherit from BaseMetricLearner to give a concrete implementation of score_pairs - Use score_pairs to compute more easily predict - Add docstring - TST: for every algorithm: - test that using score_pairs pairwise returns an euclidean distance matrix - test that score_pairs works for 3D arrays of several pairs as well as 2D arrays of one pair (and there returns only a scalar) - test that score_pairs always returns a finite output
- add the function and docstring - use it for score_pairs - TST : - should be finite - have right output dimension - embedding should be linear - should work on a toy example
- remove unusual s to test functions - remove redundant parenthesis
- replace embed by transform and add always the input X in calling the function - mutualize _transformer_from_metric not to be overwritten in MMC - improve test_mahalanobis_mixin.test_score_pairs_pairwise according to scikit-learn-contrib#96 (comment) - improve test_mahalanobis_mixin.check_is_distance_matrix - correct typos and nitpicks
The previous error message would have said "[...], shape=(shape_of_the_2D_array_extracted_from_3D)" But it is clearer to print the shape of the actual 3D initial array (tuples in input)
…xt in custom functions but to give name to check_array
…nts should follow expression'
…est_same_with_or_without_preprocessor_classic
Hi,
|
@wdevazelhes Basically
Basically Regarding "I went for a detailed error message for now, but maybe in the future we could refactor with enums ?" - sure, sounds good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
metric_learn/mlkr.py
Outdated
""" | ||
self.num_dims = num_dims | ||
self.A0 = A0 | ||
self.epsilon = epsilon | ||
self.alpha = alpha | ||
self.max_iter = max_iter | ||
super(MLKR, self).__init__(preprocessor) | ||
|
||
def _process_inputs(self, X, y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you forgot to remove this function after putting things in fit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, thanks :p Done
I would say that this is ready to merge, unless @perimosocordiae has comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few very minor comments, but overall LGTM as well.
This is a massive feature, great job @wdevazelhes !
metric_learn/_util.py
Outdated
Equivalent of `check_array` in scikit-learn. | ||
def check_input(input_data, y=None, preprocessor=None, | ||
type_of_inputs='classic', tuple_size=None, accept_sparse=False, | ||
dtype="numeric", order=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nitpick: use single quotes for dtype='numeric'
to keep consistency with the rest of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, thanks
Done
metric_learn/_util.py
Outdated
args_for_sk_checks['ensure_min_features'], | ||
context)) | ||
# normally we don't need to check_tuple_size too because tuple_size | ||
# should'nt be able to be modified by any preprocessor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: shouldn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done
metric_learn/_util.py
Outdated
# normally we don't need to check_tuple_size too because tuple_size | ||
# should'nt be able to be modified by any preprocessor | ||
if input_data.ndim != 3: # we have to ensure this because check_array | ||
# above does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a weird comment line-break. I'd prefer moving the whole comment either above or below the if input_data.ndim != 3:
line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, done
metric_learn/_util.py
Outdated
def preprocess_points(points, preprocessor): | ||
"""form points if there is a preprocessor else keep them as such (assumes | ||
that check_points has already been called)""" | ||
print("Preprocessing points...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this print call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is not very useful (and not well formatted)
ensure_2d=False, allow_nd=True) | ||
quadruplets = check_tuples(quadruplets) | ||
quadruplets = self._prepare_inputs(quadruplets, | ||
type_of_inputs='tuples') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that _prepare_inputs
isn't user-facing. So long as users won't need to (commonly) pass an enum as an argument in the public API, I'm +1 for using an enum.
In any case, we should definitely raise an error if an unexpected value is passed.
Thanks @perimosocordiae ! I addressed your comments. Regarding |
This may be a stupid question but why do we need 2 functions? |
I guess maybe we don't necessarily need 2 functions indeed. Originally I thought of Having only the method I guess the questions is: what is best: the "feature" (maybe not so useful) to be able to use In any case this is quite a concentrated part of the code so it could be refactored at any time |
OK, maybe it makes sense to have two functions then. |
In it goes! |
@wdevazelhes this merged into the new_api_design branch. Can you make a PR to merge that branch back into master? |
Shouldn't we write the doc first (#133)? This branch introduces quite a few compatibility breaks |
Ah, I didn't notice that gh-133 was off of the new_api_design branch as well. Typically I prefer to do all PRs against master to avoid lots of conflict resolution down the road. Re: compatibility breaks, users building from master should be ready for breaking changes from time to time. We'll need to make sure the docs are sufficient before cutting the next release, though. |
Yes that's true, let's keep this for now and see later how it goes maybe |
There's also the fact that the API is quite changed so users would really need some doc informations on how to use the preprocessor etc |
Yes, I'd prefer to get master up to date sooner than later. Docs are important, and we definitely need good examples before we cut a release, but I want to make sure we're stable on master for a reasonable amount of time before releasing a new version. |
I just pushed a minimal version of the documentation in PR #133 (see comment #133 (comment)), I think this could be merged into branch |
This PR adds an argument
preprocessor
to weakly supervised algorithms initialization: an option that allows them to accept two types of input: either 3D arrays of tuples of points as before, or 2D arrays of indices/identifiers of points. In the latter case, the preprocessor given as input to the Metric Learner will allow to retreive points from identifiers. The preprocessor is basically a callable that is called on a list of identifiers (or just on one identifier ? we'll have to decide), and that returns the points associated to this identifiers.If instead of a callable the user provides an array X, it will automatically create the "indexing" preprocessor (i.e. a preprocessor such that preprocessor(id) returns X[id]). We could also imagine other shortcuts like this (for instance the user could provide a string of a root folder containing images and then it would create a preprocessor such that preprocessor("img2.png") would return the vector associated to the image located at "rootfolder/img2.png")
Note: This PR branched from MahalanobisMixin PR, so as soon as MahalanobisMixin is merged the diff should become more readable.
TODO:
check_input
is as specified-
[ ] Add documentation-> probably in another PRcheck_input
functioncheck_input
function and its tests to be cleaner