Skip to content

[MRG + 1] Allow already formed tuples as an input. #92

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

wdevazelhes
Copy link
Member

@wdevazelhes wdevazelhes commented May 17, 2018

This PR changes the API with the first and easier case to implement of the new api (see issue #91 ): the case where there is no preprocessor but tuples are already formed and given to the algorithm as such. The main question is for algorithms that needs a covariance matrix to initialize the metric (LSML), or that compute bounds (ITML). For now I just reconstruct a dataset of points from the pairs thanks to np.unique, in order to have the same results and pass the tests, but this is probably not the best thing to do. For SDML, I could express the term involved in the loss only with formed pairs instead of datapoints, and here is a snippet that tests it: https://gist.github.com/wdevazelhes/3c349e13976613d15ebc46178c942474

Left TODO:

  • Update docstrings of methods where signature changed

@wdevazelhes wdevazelhes changed the title Allow already formed tuples as an input. [WIP] Allow already formed tuples as an input. May 17, 2018
@wdevazelhes

This comment has been minimized.

@wdevazelhes wdevazelhes changed the title [WIP] Allow already formed tuples as an input. [MRG] Allow already formed tuples as an input. May 18, 2018
@wdevazelhes wdevazelhes changed the title [MRG] Allow already formed tuples as an input. [WIP] Allow already formed tuples as an input. May 18, 2018
# set up prior M
if self.use_cov:
X = np.unique(pairs.reshape(-1, pairs.shape[2]), axis=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is causing test failures, because the axis argument for np.unique was added in version 1.13.0 but we test with an earlier version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed, I will change this to something compatible with version 1.12.1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wdevazelhes
Copy link
Member Author

wdevazelhes commented May 22, 2018

I also updated the docstring. Note also that for quadruplets learning, for now we need for the quadruplets to be ordered so that we know which samples are more similar to others. We could also imagine to let the ordering between the first two and the last two pairs be any (most similar pairs before the least similar, or the contrary), and then use a label to specify the ordering. This would be coherent with the supervision for pairs (where there is also a label of constraints), but is redundant. This can be decided later.

@wdevazelhes wdevazelhes changed the title [WIP] Allow already formed tuples as an input. [MRG] Allow already formed tuples as an input. May 22, 2018
Copy link
Member

@bellet bellet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good except for minor comments. I guess we have tests in place to confirm that these modifications did not change the output of the algorithms.

Regarding whether the order of instances in quadruplets/triplets should be fixed or given by labels is indeed an open question. Maybe we can indeed stay like this for now and make the final decision when implementing the score/predict functions. Having labels might be more natural in that context.

@@ -51,52 +51,65 @@ def __init__(self, gamma=1., max_iter=1000, convergence_threshold=1e-3,
self.A0 = A0
self.verbose = verbose

def _process_inputs(self, X, constraints, bounds):
self.X_ = X = check_array(X)
def _process_pairs(self, pairs, y, bounds):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense that this function _process_pairs is shared across the class of pair metric learners. For instance, ruling the potential pairs that are identical is useful for all algorithms

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, I added it to the small features TODO list at the end of the main issue: #91 (comment)

pos_no_ident = vector_norm(pos_pairs[:, 0, :] - pos_pairs[:, 1, :]) > 1e-9
pos_pairs = pos_pairs[pos_no_ident]
neg_no_ident = vector_norm(neg_pairs[:, 0, :] - neg_pairs[:, 1, :]) > 1e-9
neg_pairs = neg_pairs[neg_no_ident]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe showing a warning to the user when such pair is found and discarded is useful. in particular if a negative pair is made of two identical points, probably there is a problem with the way the user generated the pairs, or the dataset

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree. I added it to the TODO

pos_neg = c.positive_negative_pairs(num_constraints,
random_state=random_state)
pairs, y = wrap_pairs(X, pos_neg)
y = 2 * y - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the pair labels be 0/1 and not -1/1?

that said, this raises the question of whether pair labels should be 0/1 or -1/1. If -1/1 is always more convenient for the algorithm implementation, we could switch to that. potentially allow both and convert to -1/1 in the process pairs helper function

Copy link
Member Author

@wdevazelhes wdevazelhes May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree
Done (see commits 374a851 and b4bdec4). (Note that for now I did not try to simplify the code of the algorithms using these labels but I just added it in the new API issue main message here #91 (comment))

pairs: array-like, shape=(n_constraints, 2, n_features)
Array of pairs. Each row corresponds to two points.
y: array-like, of shape (n_constraints,)
Labels of constraints. Should be 0 for dissimilar pair, 1 for similar.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing changes for label -1/1
maybe make sure to check that no change has been missed, including within the algorithms and tests

Copy link
Member Author

@wdevazelhes wdevazelhes May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks indeed I forgot docstrings :p Tests should be passing (they do on my computer)

@wdevazelhes
Copy link
Member Author

Thanks for the review. I am merging this PR. This will allow to review the next PR more easily. If some change remains to do, they can be changed later on, as we are working on a branch separate from master.

@wdevazelhes wdevazelhes changed the title [MRG] Allow already formed tuples as an input. [MRG + ] Allow already formed tuples as an input. May 25, 2018
@wdevazelhes wdevazelhes changed the title [MRG + ] Allow already formed tuples as an input. [MRG + 1] Allow already formed tuples as an input. May 25, 2018
@wdevazelhes wdevazelhes merged commit 13f1535 into scikit-learn-contrib:new_api_design May 25, 2018
@wdevazelhes wdevazelhes deleted the new_api_fresh_start branch August 22, 2018 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants