-
Notifications
You must be signed in to change notification settings - Fork 228
[MRG] FIX: fix MLKR cost and gradient #111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] FIX: fix MLKR cost and gradient #111
Conversation
denom = np.maximum(K.sum(axis=0), EPS) | ||
yhat = K.dot(y) / denom | ||
ydiff = yhat - y | ||
cost = (ydiff**2).sum() | ||
|
||
# also compute the gradient | ||
np.fill_diagonal(K, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was the diagonal of K
set to 1 here ?
metric_learn/mlkr.py
Outdated
@@ -98,14 +98,14 @@ def _loss(flatA, X, y, dX): | |||
A = flatA.reshape((-1, X.shape[1])) | |||
dist = pdist(X, metric='mahalanobis', VI=A.T.dot(A)) | |||
K = squareform(np.exp(-dist**2)) | |||
np.fill_diagonal(K, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to set K
's diagonal to 0 here to prevent taking into account Kii
in the sum called denom
, and in the dot product with y
(for the cost function) (and though it is involved in the gradient, it should still be correct since whatever the diagonal of K
is in the gradient, it will be canceled out by zeros in dX
)
I just pushed a version that uses from sklearn.datasets import make_regression
from metric_learn import MLKR
import numpy as np
np.random.seed(0)
X, y = make_regression()
mlkr = MLKR()
mlkr.fit(X, y) Before commit 1143e59Optimization terminated successfully.
Current function value: 1753134.478289
Iterations: 0
Function evaluations: 1
Gradient evaluations: 1 After commit 1143e59Warning: Maximum number of iterations has been exceeded.
Current function value: 1215299.266080
Iterations: 1000
Function evaluations: 1792
Gradient evaluations: 1792 |
I just added a commit that makes Code:from sklearn.datasets import make_regression
from metric_learn import MLKR
import numpy as np
np.random.seed(0)
X, y = make_regression()
mlkr = MLKR()
mlkr.fit(X, y) Result (with setting disp=1 inside minimization):Warning: Maximum number of iterations has been exceeded.
Current function value: 1248855.833869
Iterations: 1000
Function evaluations: 1791
Gradient evaluations: 1791 |
Can you explain a bit more the last change and why it introduces some numerical differences with the previous case? Did you track down the extent of these differences? Maybe it would be useful to compare both versions by showing the objective value at each iteration, and maybe for several random seeds to check that the behavior is consistent. Maybe also use a standard small regression dataset ( |
The change amounts to expanding some squared differences, in a similar way to Besides, similarly to Finally, we transform the In any case I agree, I'll run a quick benchmark to see the extend of these numerical errors. |
Yes, what I meant to ask is whether you verified that for several matrices A, the cost and gradient you find is the same (up to some small difference) with/without your optimization. Regarding the last point: you could use an if condition to choose between the two ways of computing the gradient depending on the relative values of |
I didn't run an intensive benchmark yet, but I runned again the snippet above with L-BFGS-B instead of CG (where printing the cost function and gradient value at each iteration is easier (see #105 (comment))). We see that around iteration 20, the gradient seems to diverge between the two implementations. I agree the make_regression dataset is probably not a good idea, I'll try the same with a more standard dataset like boston (even With expanded form (new form)
With factorized form (old form)
|
No but indeed that makes sense, I'll run a quick snippet to see this
Why not use np.linalg.multi_dot (it is quite recent though: it was introduced in numpy 1.13) ? |
Yes, I think comparing for the same matrix A makes more sense (in your comparison above, the small differences accumulate along iterations). Maybe you can run the new code, and along the iterations also compute obj/grad at the current point to compare. Or simply generate a bunch of random PSD matrices A and compare. Indeed |
Here is a gist to compare gradients/loss between optimized/not optimized versions: https://gist.github.com/wdevazelhes/8c9f5fdc53ed6e7bbe8d8f958351db85 I tried with |
Here is also another snippet, this time with from metric_learn import MLKR
from sklearn.utils import check_random_state
import numpy as np
from losses import _loss_non_optimized, _loss_optimized
from collections import defaultdict
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
for seed in range(5):
rng = check_random_state(seed)
A = rng.randn(X.shape[1], X.shape[1])
print('gradient differences:')
print(np.linalg.norm(_loss_optimized(A, X, y)[1]
- _loss_non_optimized(A, X, y)[1]))
print('loss differences:')
print(_loss_optimized(A, X, y)[0]
- _loss_non_optimized(A, X, y)[0]) Results: gradient differences:
0.0002306125088543176
loss differences:
0.0
gradient differences:
0.00022796999902744875
loss differences:
0.0
gradient differences:
0.0001696791546658249
loss differences:
0.0
gradient differences:
0.0004592041477002949
loss differences:
0.0
gradient differences:
0.00042782581177926474
loss differences:
0.0
|
I tried the above snippet putting back the supposedly null W.sum(axis=1) in the sum (so the gradient expression becomes grad = (4 * (X_emb_t * (W.sum(axis=0) + W.sum(axis=1)) - X_emb_t.dot(W + W.T)).dot(X)) And this time the difference is really lower (see below), so I guess in fact maybe the W.sum(axis=1) was maybe compensating some errors ? I will look into it further but I thought it was supposed to be 0... 2.021742238365209e-09
loss differences:
0.0
gradient differences:
6.28459815257491e-10
loss differences:
0.0
gradient differences:
6.62783153933958e-10
loss differences:
0.0
gradient differences:
1.4157154963156227e-09
loss differences:
0.0
gradient differences:
2.5947199373104065e-08
loss differences:
0.0
|
You are looking at absolute difference between the two gradients, right? This can be misleading as it ignores the norm of the gradient. You should divide the difference by the norm of the first gradient to get a relative error |
About |
True, here is the new result with relative differences: from metric_learn import MLKR
from sklearn.utils import check_random_state
import numpy as np
from losses import _loss_non_optimized, _loss_optimized
from collections import defaultdict
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
for seed in range(5):
rng = check_random_state(seed)
A = rng.randn(X.shape[1], X.shape[1])
print('gradient differences:')
print(np.linalg.norm((_loss_optimized(A, X, y)[1]
- _loss_non_optimized(A, X, y)[1])/np.linalg.norm(_loss_optimized(A, X, y)[1])))
print('loss differences:')
print(_loss_optimized(A, X, y)[0]
- _loss_non_optimized(A, X, y)[0])
|
You're right, I'll try to compare the two computations of the gradient with the finite approximation of it (considered as the "true" gradient), maybe it can give more insight |
Here is a snippets that prints the relative difference of the gradient compared with the finite differences approximation, for optimized and non optimized versions: from sklearn.datasets import load_boston
from scipy.optimize import check_grad
import numpy as np
from losses import _loss_optimized, _loss_non_optimized
from collections import defaultdict
from sklearn.utils import check_random_state
X, y = load_boston(return_X_y=True)
def finite_diff(loss, seed):
rng = check_random_state(seed)
M = rng.randn(rng.randint(1, X.shape[1] + 1), X.shape[1])
def fun(M):
return loss(M, X, y)[0]
def grad(M):
return loss(M, X, y)[1].ravel()
rel_diff = check_grad(fun, grad, M.ravel()) / np.linalg.norm(grad(M))
return rel_diff
if __name__ == '__main__':
differences = defaultdict(list)
for seed in range(3):
for loss in [_loss_optimized, _loss_non_optimized]:
differences[loss.__name__].append(finite_diff(loss, seed))
means = dict()
variances = dict()
for loss in [_loss_optimized, _loss_non_optimized]:
means[loss.__name__] = np.mean(differences[loss.__name__])
variances[loss.__name__] = np.var(differences[loss.__name__])
print(differences)
print('means: {}'.format(means))
print('variances: {}'.format(variances)) Results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These results support the fact that your optimized method computes a correct gradient (the variation with respect to the previous version is negligible). I think we can merge
Looks good, especially with the new test case. Merged. |
Running
scipy.check_grad
on MLKR loss function returned an error.This PR fixes the gradient and adds a non regression test. I also modified the cost function (filling K diagonal with zeros), see comments in the code below