Skip to content

SCML : Sparse Compositional Metric Learning #278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 71 commits into from
Jun 17, 2020

Conversation

grudloff
Copy link
Contributor

@grudloff grudloff commented Feb 14, 2020

Sparse Compositional Metric Learning (SCML) allows scalable learning of global, multi-task and multiple local Mahalanobis metrics for multi-class data under a unified framework based on sparse combinations of rank-one basis metrics. For this initial merge, only the global setting will be implemented.

The algorithm learns on triplets, so it is necessary to add the base class for the addition of this kind of algorithm. For the sake of clarity, this will be added in a separate concurrent PR.

This implementation follows closely the matlab implementation of Y. Shi, A. Bellet and F. Sha. Sparse Compositional Metric Learning. AAAI Conference on Artificial Intelligence (AAAI), 2014. SCML paper

Theory - Global setting

The Mahalanobis Matrix is constructed as a sum of rank-1 PSD matrices:

The basis are intended to be locally discriminative. In the original paper and in this implementation they are constructed with LDA of several local regions. There are other options to construct this basis that will be added later.

The constrains are a set of triplets C that are inforced through the minimization problem:

Advantages

  • No need to project to SPD cone as M is constructed as a nonegative sum of SPD matrices.
  • Only n_basis parameters must be learned and stored, without taking into account the basis.
  • Uses stochastic composite optimization and in consequence, big datasets can be handled by this algorithm. An efficient implementation of regularized dual averaging is used for the optimization procedure.
  • Faster to Train.
  • Good Performance on standard datasets (see paper for detailed information) especially on multi-task and local settings. A Benchmark against the other algorithms of the package will be added to this PR later.
  • First algorithm to learn on triplets to be implemented in this package.

Tests on Vehicles dataset
The results for the vehicles dataset found in the matlab implementation github are used to validate the consistency and the correctness of the current implementation.
On each of the tests, the algorithm was run 100 times, the mean and std of the resulted accuracy in test and train, as well as the time, are shown below each test link.
For the batch versions, the numbers of iterations are reduced to have the same amount of gradient computations for each method.

Test Vanilla

Time. Mean:  5.7989197754859925 std:  0.05547399180223441
Train - Accuracy. Mean:  79.3214990138067 std:  0.7612083562308343
Test - Accuracy. Mean:  77.71176470588235 std:  2.056124280977084

Test Batch

Time. Mean:  6.090397052764892 std:  0.048224950065408285
Train - Accuracy. Mean:  81.41222879684418 std:  0.9372967012148159
Test - Accuracy. Mean:  78.0 std:  1.6946894459868151

Test Adagrad

Time. Mean:  6.090397052764892 std:  0.048224950065408285
Train - Accuracy. Mean:  81.41222879684418 std:  0.9372967012148159
Test - Accuracy. Mean:  78.0 std:  1.6946894459868151

Test Batch+Adagrad

Time. Mean:  1.7653330016136168 std:  0.015989019692575546
Train - Accuracy. Mean:  81.31558185404339 std:  0.9613659886786882
Test - Accuracy. Mean:  78.03529411764706 std:  1.641208095836975

We can see that the result are almost the same and even with little variance improvements over the test accuracy. Also the use of mini-batches yields almost a 4-fold improvement on the time used.

Furthermore, the adagrad version has a faster convergence than the "vanilla" version as it can be observed of the results obtained with 1/20 the number of iterations, as observed on the achieved train acuracy. But this comes with an apparent tradeoff as the test accuracy of the vanilla version is a little bit better, this suggest that maybe it is a good idea to allow both options.
Test Vanilla - 1/20 iterations

Time. Mean:  1.2541892004013062 std:  0.02197305793469499
Train - Accuracy. Mean:  79.0 std:  0.8888682213828464
Test - Accuracy. Mean:  77.37647058823529 std:  1.5683284028668087

Test Adagrad - 1/20 iterations

Time. Mean:  1.3263736057281494 std:  0.020214300581072035
Train - Accuracy. Mean:  80.65483234714003 std:  1.1301895624652614
Test - Accuracy. Mean:  77.0764705882353 std:  2.0688077237684213

TODO:

  • Code Comentaries and docstrings
  • Documentation
  • Basic Tests
  • Specific Tests
  • Benchmark

@perimosocordiae
Copy link
Contributor

I suspect this will need a rebase now that the triplets PR is merged.

@grudloff
Copy link
Contributor Author

grudloff commented Mar 5, 2020

I edited the PR and added a test on the current implementation of this branch of the Vehicles dataset from the Matlab implementation repository. I will be adding tests on more datasets soon!

@grudloff
Copy link
Contributor Author

I added a simple basis construction for the unsupervised version in the meantime. The implementation is now consistent with other algorithm implementations, with the new basis construction for the unsupervised version it is possible to check both the normal and supervised version. The following step is to add specific tests and to add some benchmarks on other datasets!

I think the PR is now in shape for some feedback!

Comment on lines +91 to +93
X, y = make_classification(n_samples=100, n_classes=3, n_features=60,
n_informative=60, n_redundant=0, n_repeated=0,
random_state=42)
Copy link
Contributor Author

@grudloff grudloff Mar 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While making this test, I tried for n_informative = 45 and n_redundant = 15 . But got as a result an empty components matrix, because the objective function never went below 1 which is the value it takes on iter==0, so the weights stayed with the initialization value.

Watching at the change in the weights it seems like it is learning ok, just the values for the objective function are bigger (reaches a minimum of 2 or so) while in general values below 1 are achieved very fast.

I haven't looked deeply into it yet, but it seems like a situation that should be addressed.

EDIT: Nevermind this comment, I noticed the issue is that it shouldn't enter the output iter procedure on the first iteration. This is addressed in the next commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing class separation is nice (although it is not good here probably due to dimension compared to number of data points), but I mostly meant making a test to check that the number of bases one obtains is as expected (check that the returned n_basis is equal to the expected one, and that the shape of the returned basis is also what we expect), parameterized for a few values of n_samples, n_features and n_classes

(of course you should keep the two tests with toy data where you also know exactly what the basis should be, which is nice)

Copy link
Contributor Author

@grudloff grudloff Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually thought this test only to test the n_features > 50 case!

PD: It actually can do a lot better, it just has very bad hyperparameters!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then you could keep it and just extend it to also check the number of basis and basis shape, and parametrize it to cover a few values of the above mentioned parameters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was already on it! Just added it in the last commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect!
For n_samples, maybe remove 1000 to avoid slowing down too much the test suite?

Comment on lines 69 to 70
max_iter = int(self.max_iter/self.batch_size)
output_iter = int(self.output_iter/self.batch_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not consistent with the common semantic of "iteration": one iteration is one update of the parameter, regardless of the mini-batch size used to estimate the gradient

@bellet
Copy link
Member

bellet commented Mar 26, 2020

Things are looking pretty good at this point. I think the priority now should be to have a benchmark which shows the benefit compared to existing algorithms in the package and allows to decide on decent default parameter values.

@bellet
Copy link
Member

bellet commented Mar 27, 2020

Some additional small things to fix:

  • The algorithm crashes when output_iter is larger than max_iter because best_w is not defined
  • The algorithm crashes when some arguments are given with the wrong type, e.g. a float for max_iter. It would be a good idea to check that all arguments have the desired type and throw an error if it is not the case, along with tests


# weight vector
w = np.zeros((1, n_basis))
# avarage obj gradient wrt weights
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

average

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to merge what we have now but will wait for @perimosocordiae to decide/merge

Copy link
Contributor

@perimosocordiae perimosocordiae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few parts that need more testing, and some minor style nits here and there, but I think this is fine to merge as-is. We can clean up the remaining TODOs in future PRs.

@perimosocordiae perimosocordiae merged commit 43a60c9 into scikit-learn-contrib:master Jun 17, 2020
@perimosocordiae
Copy link
Contributor

Merged. Thanks for the huge contribution, @grudloff !

@grudloff
Copy link
Contributor Author

This is great news, thanks a lot! I had to leave this hanging but I definitely had in mind giving it the last push to reach completion once I had a bit of time. Glad you decided to merge as-is

@bellet
Copy link
Member

bellet commented Jun 18, 2020

Congrats @grudloff!

@wdevazelhes
Copy link
Member

Congrats @grudloff !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants