Skip to content

[WIP] Metric Learning :: NCA #4789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 31 commits into from

Conversation

artsobolev
Copy link
Contributor

GSoC 2015 project, Iteration 1: Neighborhood Component Analysis

For details refer to my GSoC blog, especially a post on NCA.


def nca_vectorized_oracle(X, y, params):
dX = X[:, None, :] - X[None, :, :] # n_samples x n_samples x n_comp
outer = dX[:, :, None, :] * dX[:, :, :, None] # n_samples x n_samples x n_comp x n_comp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use np.newaxis instead of None. What does n_comp stand for? Could you use n_features instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's n_components. OK. Please write it out.

@eickenberg
Copy link
Contributor

Some very superficial comments, mostly pertaining to form while waiting for the bench results.

I'm not sure I like the name "oracle". It may also be preferable to have _precompute functions for certain aspects, but that is up to discussion.

There are indeed some heavy tensors along the way in the fully vectorized version, so they will need to be memory profiled as well later on in addition to speed profiling for certain operations.

@artsobolev
Copy link
Contributor Author

@eickenberg I thought the name "oracle" is pretty much a standard in optimization theory (in this case we have first-order oracle that returns its value and the gradient at a given point).

@eickenberg
Copy link
Contributor

Well, oracle rings 'black box' to me. While it is true that you are not
exploiting any specific structure of the problem here, we still know the
function. Keep it for the moment if you like, but bear in mind that the
term is very technical and should probably in future iterations be replaced
by something like cost_and_gradient for better legibility.

On Sun, May 31, 2015 at 3:00 PM, B@rmaley.exe notifications@github.com
wrote:

@eickenberg https://github.com/eickenberg I thought the name "oracle"
is pretty much a standard in optimization theory (in this case we have
first-order oracle that returns its value and the gradient at a given
point).


Reply to this email directly or view it on GitHub
#4789 (comment)
.

logp -= sp.misc.logsumexp(logp)
assert logp.shape == (n_samples, )

p = np.exp(logp) # n_samples
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bothers me. Sometimes I get various numerical warnings (underflows). Should I keep probabilities in log-domain and use log-sum-exp trick for sums?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the second thought: outer product of Xs can be almost arbitrary.

Barmaley.exe added 5 commits May 31, 2015 19:50
Fixed a bug in fully-vectorized optimizer
Major refactoring
Compared vectorized and semivectorized NCA-assisted 1NN with
1NN with Euclidean distance on UCI Wine dataset

def nca_vectorized_oracle(X, y, n_components, loss, threshold=0.0):
dx = X[:, np.newaxis, :] - X[np.newaxis, :, :] # n_samples x n_samples x n_features
outer = dx[:, :, np.newaxis, :] * dx[:, :, :, np.newaxis] # n_samples x n_samples x n_features x n_features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check whether this full expansion is necessary? Looking at the formulas for the gradient, it seems possible to bring L inside all members of the sum. Assuming that L.shape[0] <= L.shape[1] (is there a usecase for the opposite?), this has the potential to decrease memory and cpu cycles on this outer expansion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to check in addition to this is to unravel the x_ik to what they actually are: x_ik[:, np.newaxis] * x_ik = (x_i - x_k)[:, np.newaxis] * (x_i - x_k) = x_i[:, np.newaxis] * x_i + ... which yields 4 terms, all of which you can already find in your dx. This should bring down memory consumption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, L is always either square of rectangular with n_components < n_features. The opposite doesn't make sense, since L's rank is bounded by n_features.

What's the benefit of propagating L inside? In case of vectorized optimizer we would lose the ability to cache outer products. It might be useful in semivectorized optimizer, but the effect seems useless: we'll be computing L (x - y) (x - y)^T products which doesn't seem to be of any help in case of square L (though, I can compute it efficiently since I already have L(x-y) from probabilities).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, L is not necessarily square. The reformulation does
nothing much for square L, but replaces one n_features by n_components in
the computation complexity, so if n_components << n_features, this should
result in a speed-up

On Mon, Jun 1, 2015 at 2:52 PM, B@rmaley.exe notifications@github.com
wrote:

In sklearn/metric_learning/nca.py
#4789 (comment)
:

@@ -0,0 +1,280 @@
+from ..base import BaseEstimator, TransformerMixin
+from ..utils.validation import check_is_fitted
+
+import numpy as np
+import scipy as sp
+import scipy.misc
+import scipy.optimize
+import time
+
+import sys
+
+
+def nca_vectorized_oracle(X, y, n_components, loss, threshold=0.0):

  • dx = X[:, np.newaxis, :] - X[np.newaxis, :, :] # n_samples x n_samples x n_features
  • outer = dx[:, :, np.newaxis, :] * dx[:, :, :, np.newaxis] # n_samples x n_samples x n_features x n_features

Yes, L is always either square of rectangular with n_components <
n_features. The opposite doesn't make sense, since L's rank is bounded by
n_features.

What's the benefit of propagating L inside? In case of vectorized
optimizer we would lose the ability to cache outer products. It might be
useful in semivectorized optimizer, but the effect seems useless: we'll be
computing L (x - y) (x - y)^T products which doesn't seem to be of any
help in case of square L (though, I can compute it efficiently since I
already have L(x-y) from probabilities).


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/4789/files#r31421976.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and yes, caching those outer products would be out of the window, but we don't know yet how useful the caching actually is, since that tensor is incredibly redundant due to its low rank. This is the point I would like to attack be evaluating selectively, sequentially.)

@eickenberg
Copy link
Contributor

Could you also make a comparison of gradient descent techniques? Preferably also on several datasets and for several numbers of samples/features so we can get a feeling for the practical complexity the methods have?

@amueller
Copy link
Member

amueller commented Jun 2, 2015

ping @bmcfee ;)

@eickenberg
Copy link
Contributor

cc @kastnerkyle :)

@kastnerkyle
Copy link
Member

I am catching up with this now and will have some questions/comments soon. Any progress toward stochastic NCA?

@artsobolev
Copy link
Contributor Author

Sorry for the silence, I was busy with my uni.

So I merged semivectorized oracle with stochastic one. The resultant cost can also use neighbors heuristic that cuts off too distant points.

I also experimented with L "propagated" inside of grad calculation. I measured execution time for 10 initializations on 2 datasets obtained by make_classification:

X.shape = (300, 100)
ds300x100

X.shape = (500, 20)
ds500x20

Overall, despite of line 130, the propagated version seems at least nearly as fast as the non-propagated one, or even faster in case of low-rank L.

@kastnerkyle
Copy link
Member

This is looking pretty good so far!

In the bottom left plot (X.shape (500, 20), m = 5) what is up with the negative time? Have we run something like line_profiler and mem_profiler to see where the bottlenecks are in this code generally? I am surprised that the times are so close but maybe I have the wrong intuition.

Over the next weeks my idea would be to document some of the intermediate functions more (since I have little knowledge of this field it will be great help for me :D), compare different evaluation strategies (do we evaluate against all other samples for the gradient or only among the samples in the batch - @eickenberg brought this up), and evaluate on some big-ish datasets that are reasonable tests for this.

Does this sound like an OK approach to everyone?

@artsobolev
Copy link
Contributor Author

Oh, that's weird. I didn't notice negative values on that plot. I use time to measure execution time, occasionally it returns weird results. Updated plot looks like this
download

Yes, I ran %lprun profiler in my dev ipython notebook, most of the time seems to be spent (+ precompute_distances=False) in vectorized operations like calculating outer products. logsumexp, etc.

I do test it on bigger (yet created by means of make_classification datasets: so far it gives bad results on big datasets (like 10,000 × 1000): accuracy on 1-nearest neighbors is higher without NCA. As I observe, the loss on all initializations is close to the optimal.

@jnothman
Copy link
Member

Could we get a statement on what the status of this is?

  • Should it be labeled [MRG] yet? If not, what blockers remain?
  • Is @Barmaley-exe going to bring it to merge post-GSoC?

@artsobolev
Copy link
Contributor Author

@jnothman

Overall, it kinda works, but I don't think it's merge-ready. It lacks documentation, tests and examples.

@jnothman
Copy link
Member

Fair enough. What of those things do you intend to do in the near future?

@terrytangyuan
Copy link

My friend and I actually just published this metric-learn package to Pypi. Anyone interested in helping me to integrate other methods besides NCA into scikit-learn? Any suggestions would be appreciated.

@amueller
Copy link
Member

That looks quite interesting. How does the NCA there compare against this one?

@terrytangyuan
Copy link

Haven't looked into the detailed implementation of NCA here yet. But besides wide range of algorithms metric-learn package has, it is definitely more mature and stable with tests and examples.

@maniteja123
Copy link
Contributor

Hi everyone, sorry for disturbing but it was suggested on mailing list that this addition is of importance. I would be really grateful if some feedback can be given regarding this. I understand there is quite some math behind this algorithm but will try my best to understand in case this feature is needed. Thanks a lot !

@bhargavvader
Copy link

bhargavvader commented Sep 8, 2016

@Barmaley-exe , are you planning on taking this up? Mind if I have a crack at it?
I can fork this and work on top of what you've already done.

Or as @terrytangyuan suggested, alternatively we can figure out a way to migrate the methods or ideas in the metric-learn package here. Any suggestions, @amueller ?

@artsobolev
Copy link
Contributor Author

You're welcome to contribute.

On Thursday, 8 September 2016, Bhargav Srinivasa notifications@github.com
wrote:

@Barmaley-exe https://github.com/Barmaley-exe , are you planning on
taking this up? Mind if I have a crack at it?
I can fork this and work on top of what you've already done.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#4789 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAafyo4exXzGUv0A1ewL4i19s_M7lN3Rks5qoCY_gaJpZM4EvYbc
.

@GaelVaroquaux
Copy link
Member

Closing as overridden by #10058

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants