-
Notifications
You must be signed in to change notification settings - Fork 228
[MRG] FIX LMNN gradient and cost function #201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] FIX LMNN gradient and cost function #201
Conversation
You must be careful when comparing with scikit-learn/scikit-learn#8602 because this version uses a squared hinge loss, which obviously changes the cost function and its gradient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I agree that there seems to be a problem with both the gradient and the cost: L
should indeed appear in the gradient as the objective is quadratic in L
, and the cost should indeed sum margin violations, not counting them
I realized in fact there was just one matrix product with L that was missing, I added it and added a toy example too for the loss function |
metric_learn/lmnn.py
Outdated
@@ -191,10 +191,12 @@ def _loss_grad(self, X, L, dfG, impostors, it, k, reg, target_neighbors, df, | |||
# do the gradient update | |||
assert not np.isnan(df).any() | |||
G = dfG * reg + df * (1 - reg) | |||
|
|||
grad = 2 * L.dot(G) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put the multiplication by 2 here so that we can test the real gradient using check_grad
@@ -191,10 +191,12 @@ def _loss_grad(self, X, L, dfG, impostors, it, k, reg, target_neighbors, df, | |||
# do the gradient update | |||
assert not np.isnan(df).any() | |||
G = dfG * reg + df * (1 - reg) | |||
|
|||
grad = 2 * L.dot(G) | |||
# compute the objective function | |||
objective = total_active * (1 - reg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact this was right, it's the term coming from the +1 that is in the hinge loss (cf. question from #46)
I guess it is ready to merge now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a while since I've read the LMNN paper or looked at the shogun implementation, so I'm not sure if this is a correct change or not. I suspect that there are a few different LMNN formulations floating around the web at this point, with slightly different characteristics. For example, I think the scikit-learn PR's implementation is using a modification of the original loss to make it more amenable for optimization using scipy.optimize.
Empirically, how does this PR affect the behavior of LMNN on iris / make_blobs / sandwich / etc ?
test/metric_learn_test.py
Outdated
# want to have the same result when applying the function twice | ||
return lmnn._loss_grad(X, L.reshape(-1, X.shape[1]), dfG, impostors, | ||
1, k, reg, target_neighbors, df.copy(), | ||
list(a1), list(a2))[0].ravel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test seems particularly brittle, as it replicates so much LMNN functionality. Ideally we could structure the LMNN code to make this simpler, but at the least we should avoid copy-paste here:
def loss_grad(flat_L):
return lmnn._loss_grad(X, flat_L.reshape(...), ...)
fun = lambda x: loss_grad(x)[1]
grad = lambda x: loss_grad(x)[0].ravel()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right, let's go for this clearer expression
done
(note: I didn't use the lambda expression since I was returned a flake8 error, but maybe it's something we want to ignore and still go for the lambda ?)
(np.array([[0], [1], [2], [3]]), | ||
[1, 0, 0, 1], 26.)]) | ||
def test_toy_ex_lmnn(X, y, loss): | ||
"""Test that the loss give the right result on a toy example""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did you get these values from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I computed them by hand, according to the paper (note that it's only the loss function and that I set L to identity so it's easy since the distances are easy to get from this 1D example).
Maybe I can add a small detail of the computation as a comment ?
But I agree I haven't been through every computation of the algorithm, I just looked more or less if the loss is correct at initialization with the toy examples, but I didn't check the part on recomputing which constraints are active after each iteration etc |
For iris, we don't see a big difference: from sklearn.datasets import load_iris
from metric_learn import LMNN
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
X, y = load_iris(return_X_y=True)
red_dim = make_pipeline(LMNN(), TSNE())
X_e = red_dim.fit_transform(X, y)
plt.scatter(X_e[:, 0], X_e[:, 1], c=y)
plt.show() |
From this example (inspired from #180, i.e. a dataset with two informative features and a big noise for the other features), the difference is pretty clear: from sklearn.utils import shuffle
from sklearn.datasets import make_classification
from metric_learn import LMNN
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
X, y = make_classification(random_state=42, n_redundant=0, shuffle=False)
X[:, 2:] *= 10
X, y = shuffle(X, y)
red_dim = make_pipeline(LMNN(), TSNE())
X_e = red_dim.fit_transform(X, y)
plt.scatter(X_e[:, 0], X_e[:, 1], c=y)
plt.show() |
Sorry my last commit was for the wrong branch, I'll revert it |
This reverts commit 562f33b.
This makes me think that we should fix a random seed in the sandwich example so that it is stable when run several times ? |
What about the problem in the objective function? |
You mean the one with the count being added ? I think it's maybe normal in the end (see comment #201 (comment)). In fact this count should be the sum of all the "+ 1" (the margin) in the hinge losses that are not 0 in the total sum. (But I didn't check yet whether |
objective += G.flatten().dot(L.T.dot(L).flatten()) | ||
return G, objective, total_active, df, a1, a2 | ||
objective += G.flatten().dot(L.flatten()) | ||
return 2 * G, objective, total_active, df, a1, a2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the factor 2 comes from differentiating L.T.dot(L)
. It would be clearer to put it directly in G
then
The hinge loss is equal to max(0, 1 - a). (in the case of LMNN, a is the distance with the impostor minus the distance with the target neighbor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not mistaken, you should compute something like
violation1 = np.max(0, g1 - g0)
violation2 = np.max(0, g2 - g0)
total_active += (violation1 > 0).sum() + (violation2 > 0).sum()
total_violation += violation1.sum() + violation2.sum()
and to replace objective = total_active * (1 - reg)
by objective = total_violation * (1 - reg)
(you can keep using total_active
as an interesting quantity to monitor though)
You can run some executions showing how the objective function evolves to make sure it goes down at each iteration when small enough step size is used, and compare with the version in master (the number of active constraints should tend to go down but not each time)
I was not thinking that "+1'" only was the value of the hinge loss, but I was thinking it was the "+1" part of the hinge loss (sum(1-a) = sum(1) - sum(a) = total_active - sum(a)). I had the impression that the "df" part later (some outer products), with operations with L, accounted for the sum of "a"s (distance with the impostor minus the distance with the target neighbor) so in total there was something like:
Yes good idea I'll do that |
Oups, in fact I introduced this bug in PR #101: I forgot to multiply by L when applying the update... See comments https://github.com/metric-learn/metric-learn/pull/101/files#r288573192 and https://github.com/metric-learn/metric-learn/pull/101/files#r288573007 |
Ok fair enough - the objective may be correct but the code is very hard to parse. Let's fix the gradient issue for now and make an issue to maybe refactor the code later to make it more easy to maintain. |
So in the end this PR just fixes the bug that I introduced, as well as introduced some tests, so if everybody agrees, I think we are good to merge ? @perimosocordiae ? |
Merged! |
Fixes #46
There seems to be problems in LMNN, that we might want to fix for the next release:
(Note: to find out what to do we can also compare to scikit-learn/scikit-learn#8602 )
(cf. https://github.com/scikit-learn/scikit-learn/pull/8602/files#diff2914dd42441ed8b594e69b2fe04d23f5R762)) This blocks PR [MRG] Uniformize num_dims and add it for LMNN #193:
https://github.com/metric-learn/metric-learn/blob/d4badc8adc0a8ce6689dd9a95e89181efaf5ee24/metric_learn/lmnn.py#L196
https://github.com/metric-learn/metric-learn/blob/d4badc8adc0a8ce6689dd9a95e89181efaf5ee24/metric_learn/lmnn.py#L198
I created a test that should pass when this is fixed (like the one in scikit-learn's PR)
I'll look into it, but I'm not yet that familiar with LMNN so you might know how to fix it