-
Notifications
You must be signed in to change notification settings - Fork 228
[MRG + 1] Allow already formed tuples as an input. #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG + 1] Allow already formed tuples as an input. #92
Conversation
… Weakly Supervised Algorithms.
This comment has been minimized.
This comment has been minimized.
metric_learn/sdml.py
Outdated
# set up prior M | ||
if self.use_cov: | ||
X = np.unique(pairs.reshape(-1, pairs.shape[2]), axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is causing test failures, because the axis
argument for np.unique
was added in version 1.13.0 but we test with an earlier version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed, I will change this to something compatible with version 1.12.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go for this solution: https://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array/22941699#22941699
I also updated the docstring. Note also that for quadruplets learning, for now we need for the quadruplets to be ordered so that we know which samples are more similar to others. We could also imagine to let the ordering between the first two and the last two pairs be any (most similar pairs before the least similar, or the contrary), and then use a label to specify the ordering. This would be coherent with the supervision for pairs (where there is also a label of constraints), but is redundant. This can be decided later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good except for minor comments. I guess we have tests in place to confirm that these modifications did not change the output of the algorithms.
Regarding whether the order of instances in quadruplets/triplets should be fixed or given by labels is indeed an open question. Maybe we can indeed stay like this for now and make the final decision when implementing the score/predict functions. Having labels might be more natural in that context.
@@ -51,52 +51,65 @@ def __init__(self, gamma=1., max_iter=1000, convergence_threshold=1e-3, | |||
self.A0 = A0 | |||
self.verbose = verbose | |||
|
|||
def _process_inputs(self, X, constraints, bounds): | |||
self.X_ = X = check_array(X) | |||
def _process_pairs(self, pairs, y, bounds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense that this function _process_pairs
is shared across the class of pair metric learners. For instance, ruling the potential pairs that are identical is useful for all algorithms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree, I added it to the small features TODO list at the end of the main issue: #91 (comment)
metric_learn/itml.py
Outdated
pos_no_ident = vector_norm(pos_pairs[:, 0, :] - pos_pairs[:, 1, :]) > 1e-9 | ||
pos_pairs = pos_pairs[pos_no_ident] | ||
neg_no_ident = vector_norm(neg_pairs[:, 0, :] - neg_pairs[:, 1, :]) > 1e-9 | ||
neg_pairs = neg_pairs[neg_no_ident] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe showing a warning to the user when such pair is found and discarded is useful. in particular if a negative pair is made of two identical points, probably there is a problem with the way the user generated the pairs, or the dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree. I added it to the TODO
metric_learn/sdml.py
Outdated
pos_neg = c.positive_negative_pairs(num_constraints, | ||
random_state=random_state) | ||
pairs, y = wrap_pairs(X, pos_neg) | ||
y = 2 * y - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the pair labels be 0/1 and not -1/1?
that said, this raises the question of whether pair labels should be 0/1 or -1/1. If -1/1 is always more convenient for the algorithm implementation, we could switch to that. potentially allow both and convert to -1/1 in the process pairs helper function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree
Done (see commits 374a851 and b4bdec4). (Note that for now I did not try to simplify the code of the algorithms using these labels but I just added it in the new API issue main message here #91 (comment))
metric_learn/mmc.py
Outdated
pairs: array-like, shape=(n_constraints, 2, n_features) | ||
Array of pairs. Each row corresponds to two points. | ||
y: array-like, of shape (n_constraints,) | ||
Labels of constraints. Should be 0 for dissimilar pair, 1 for similar. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing changes for label -1/1
maybe make sure to check that no change has been missed, including within the algorithms and tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks indeed I forgot docstrings :p Tests should be passing (they do on my computer)
Thanks for the review. I am merging this PR. This will allow to review the next PR more easily. If some change remains to do, they can be changed later on, as we are working on a branch separate from master. |
This PR changes the API with the first and easier case to implement of the new api (see issue #91 ): the case where there is no preprocessor but tuples are already formed and given to the algorithm as such. The main question is for algorithms that needs a covariance matrix to initialize the metric (LSML), or that compute bounds (ITML). For now I just reconstruct a dataset of points from the pairs thanks to np.unique, in order to have the same results and pass the tests, but this is probably not the best thing to do. For SDML, I could express the term involved in the loss only with formed pairs instead of datapoints, and here is a snippet that tests it: https://gist.github.com/wdevazelhes/3c349e13976613d15ebc46178c942474
Left TODO: