Skip to content

[MRG] FIX LMNN gradient and cost function #201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

wdevazelhes
Copy link
Member

Fixes #46
There seems to be problems in LMNN, that we might want to fix for the next release:

(Note: to find out what to do we can also compare to scikit-learn/scikit-learn#8602 )

https://github.com/metric-learn/metric-learn/blob/d4badc8adc0a8ce6689dd9a95e89181efaf5ee24/metric_learn/lmnn.py#L196

https://github.com/metric-learn/metric-learn/blob/d4badc8adc0a8ce6689dd9a95e89181efaf5ee24/metric_learn/lmnn.py#L198

I created a test that should pass when this is fixed (like the one in scikit-learn's PR)

I'll look into it, but I'm not yet that familiar with LMNN so you might know how to fix it

@wdevazelhes wdevazelhes added this to the v0.5.0 milestone May 13, 2019
@wdevazelhes wdevazelhes changed the title [WIP] FIX LMNN gradient [WIP] FIX LMNN gradient and cost function May 13, 2019
@bellet
Copy link
Member

bellet commented May 14, 2019

You must be careful when comparing with scikit-learn/scikit-learn#8602 because this version uses a squared hinge loss, which obviously changes the cost function and its gradient

Copy link
Member

@bellet bellet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I agree that there seems to be a problem with both the gradient and the cost: L should indeed appear in the gradient as the objective is quadratic in L, and the cost should indeed sum margin violations, not counting them

@wdevazelhes
Copy link
Member Author

I realized in fact there was just one matrix product with L that was missing, I added it and added a toy example too for the loss function

@@ -191,10 +191,12 @@ def _loss_grad(self, X, L, dfG, impostors, it, k, reg, target_neighbors, df,
# do the gradient update
assert not np.isnan(df).any()
G = dfG * reg + df * (1 - reg)

grad = 2 * L.dot(G)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the multiplication by 2 here so that we can test the real gradient using check_grad

@@ -191,10 +191,12 @@ def _loss_grad(self, X, L, dfG, impostors, it, k, reg, target_neighbors, df,
# do the gradient update
assert not np.isnan(df).any()
G = dfG * reg + df * (1 - reg)

grad = 2 * L.dot(G)
# compute the objective function
objective = total_active * (1 - reg)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact this was right, it's the term coming from the +1 that is in the hinge loss (cf. question from #46)

@wdevazelhes
Copy link
Member Author

I guess it is ready to merge now

@wdevazelhes wdevazelhes changed the title [WIP] FIX LMNN gradient and cost function [MRG] FIX LMNN gradient and cost function May 21, 2019
Copy link
Contributor

@perimosocordiae perimosocordiae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a while since I've read the LMNN paper or looked at the shogun implementation, so I'm not sure if this is a correct change or not. I suspect that there are a few different LMNN formulations floating around the web at this point, with slightly different characteristics. For example, I think the scikit-learn PR's implementation is using a modification of the original loss to make it more amenable for optimization using scipy.optimize.

Empirically, how does this PR affect the behavior of LMNN on iris / make_blobs / sandwich / etc ?

# want to have the same result when applying the function twice
return lmnn._loss_grad(X, L.reshape(-1, X.shape[1]), dfG, impostors,
1, k, reg, target_neighbors, df.copy(),
list(a1), list(a2))[0].ravel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems particularly brittle, as it replicates so much LMNN functionality. Ideally we could structure the LMNN code to make this simpler, but at the least we should avoid copy-paste here:

def loss_grad(flat_L):
  return lmnn._loss_grad(X, flat_L.reshape(...), ...)

fun = lambda x: loss_grad(x)[1]
grad = lambda x: loss_grad(x)[0].ravel()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right, let's go for this clearer expression
done
(note: I didn't use the lambda expression since I was returned a flake8 error, but maybe it's something we want to ignore and still go for the lambda ?)

(np.array([[0], [1], [2], [3]]),
[1, 0, 0, 1], 26.)])
def test_toy_ex_lmnn(X, y, loss):
"""Test that the loss give the right result on a toy example"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did you get these values from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I computed them by hand, according to the paper (note that it's only the loss function and that I set L to identity so it's easy since the distances are easy to get from this 1D example).
Maybe I can add a small detail of the computation as a comment ?

@wdevazelhes
Copy link
Member Author

wdevazelhes commented May 21, 2019

Empirically, how does this PR affect the behavior of LMNN on iris / make_blobs / sandwich / etc ?

This is the difference between master and this PR for LMNN:

master:
image

this PR:
image

So I guess a significant improvement, I'll check for the other datasets

@wdevazelhes
Copy link
Member Author

wdevazelhes commented May 21, 2019

But I agree I haven't been through every computation of the algorithm, I just looked more or less if the loss is correct at initialization with the toy examples, but I didn't check the part on recomputing which constraints are active after each iteration etc
But if we assume the loss is correct, it is, from the code:
loss = constant + tr(A L.T L), with A = dfG * reg + df * (1 - reg) = constant,
so the gradient is necessarily grad = L (A + A.T) = 2 * L A ( deriving from equation 3 page of this course: http://cs229.stanford.edu/notes/cs229-notes1.pdf)
So I guess now there is just need to check whether the matrix A is the right one, (i.e. it contains the right outer product differences) ?

@wdevazelhes
Copy link
Member Author

For iris, we don't see a big difference:

from sklearn.datasets import load_iris
from metric_learn import LMNN
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline

X, y = load_iris(return_X_y=True)

red_dim = make_pipeline(LMNN(), TSNE())
X_e = red_dim.fit_transform(X, y)

plt.scatter(X_e[:, 0], X_e[:, 1], c=y)
plt.show()

With master:
image

With this PR:
image

@wdevazelhes
Copy link
Member Author

From this example (inspired from #180, i.e. a dataset with two informative features and a big noise for the other features), the difference is pretty clear:

from sklearn.utils import shuffle
from sklearn.datasets import make_classification
from metric_learn import LMNN
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline

X, y = make_classification(random_state=42, n_redundant=0, shuffle=False)
X[:, 2:] *= 10
X, y = shuffle(X, y)

red_dim = make_pipeline(LMNN(), TSNE())
X_e = red_dim.fit_transform(X, y)

plt.scatter(X_e[:, 0], X_e[:, 1], c=y)
plt.show()

With this PR:
image

With master:
image

@wdevazelhes
Copy link
Member Author

Sorry my last commit was for the wrong branch, I'll revert it

William de Vazelhes added 2 commits May 22, 2019 10:40
@bellet
Copy link
Member

bellet commented May 24, 2019

Empirically, how does this PR affect the behavior of LMNN on iris / make_blobs / sandwich / etc ?

This is the difference between master and this PR for LMNN:

master:
image

this PR:
image

So I guess a significant improvement, I'll check for the other datasets

This makes me think that we should fix a random seed in the sandwich example so that it is stable when run several times ?

@bellet
Copy link
Member

bellet commented May 24, 2019

What about the problem in the objective function?

@wdevazelhes
Copy link
Member Author

What about the problem in the objective function?

You mean the one with the count being added ? I think it's maybe normal in the end (see comment #201 (comment)). In fact this count should be the sum of all the "+ 1" (the margin) in the hinge losses that are not 0 in the total sum. (But I didn't check yet whether total_active is indeed this number of active hinge losses)

objective += G.flatten().dot(L.T.dot(L).flatten())
return G, objective, total_active, df, a1, a2
objective += G.flatten().dot(L.flatten())
return 2 * G, objective, total_active, df, a1, a2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the factor 2 comes from differentiating L.T.dot(L). It would be clearer to put it directly in G then

@bellet
Copy link
Member

bellet commented May 24, 2019

What about the problem in the objective function?

You mean the one with the count being added ? I think it's maybe normal in the end (see comment #201 (comment)). In fact this count should be the sum of all the "+ 1" (the margin) in the hinge losses that are not 0 in the total sum. (But I didn't check yet whether total_active is indeed this number of active hinge losses)

The hinge loss is equal to max(0, 1 - a). (in the case of LMNN, a is the distance with the impostor minus the distance with the target neighbor)
It is equal to 0 when the margin is satisfied (a >= 1). if not it is equal to the margin violation (1 - a), not "+1"...

Copy link
Member

@bellet bellet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, you should compute something like

violation1 = np.max(0, g1 - g0)
violation2 = np.max(0, g2 - g0)
total_active += (violation1 > 0).sum() + (violation2 > 0).sum()
total_violation += violation1.sum() + violation2.sum()

and to replace objective = total_active * (1 - reg) by objective = total_violation * (1 - reg)

(you can keep using total_active as an interesting quantity to monitor though)

You can run some executions showing how the objective function evolves to make sure it goes down at each iteration when small enough step size is used, and compare with the version in master (the number of active constraints should tend to go down but not each time)

@wdevazelhes
Copy link
Member Author

The hinge loss is equal to max(0, 1 - a). (in the case of LMNN, a is the distance with the impostor minus the distance with the target neighbor)
It is equal to 0 when the margin is satisfied (a >= 1). if not it is equal to the margin violation (1 - a), not "+1"...

I was not thinking that "+1'" only was the value of the hinge loss, but I was thinking it was the "+1" part of the hinge loss (sum(1-a) = sum(1) - sum(a) = total_active - sum(a)). I had the impression that the "df" part later (some outer products), with operations with L, accounted for the sum of "a"s (distance with the impostor minus the distance with the target neighbor) so in total there was something like: push_loss = total_active - sum_of_the_differences_btw_distance_with_the_impostor_and_distance _with_target_neighbor
But I definitely need to check if this is indeed the case I didn't deepen the "df" computation part

You can run some executions showing how the objective function evolves to make sure it goes down at each iteration when small enough step size is used, and compare with the version in master (the number of active constraints should tend to go down but not each time)

Yes good idea I'll do that

@wdevazelhes
Copy link
Member Author

wdevazelhes commented May 29, 2019

Oups, in fact I introduced this bug in PR #101: I forgot to multiply by L when applying the update...

See comments https://github.com/metric-learn/metric-learn/pull/101/files#r288573192 and https://github.com/metric-learn/metric-learn/pull/101/files#r288573007

@bellet
Copy link
Member

bellet commented May 29, 2019

Ok fair enough - the objective may be correct but the code is very hard to parse. Let's fix the gradient issue for now and make an issue to maybe refactor the code later to make it more easy to maintain.

@wdevazelhes
Copy link
Member Author

So in the end this PR just fixes the bug that I introduced, as well as introduced some tests, so if everybody agrees, I think we are good to merge ? @perimosocordiae ?

@perimosocordiae perimosocordiae merged commit 187b59e into scikit-learn-contrib:master May 29, 2019
@perimosocordiae
Copy link
Contributor

Merged!

@wdevazelhes wdevazelhes mentioned this pull request Jun 14, 2019
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LMNN - objective calculate?
3 participants