NMF regularization should not depend on n_samples #20484

jeremiedbb · 2021-07-07T16:48:20Z

The objective function of NMF as described in the doc is (assuming l2 reg for simplicity)
0.5 ||X - W.H||² + alpha ||W||² + alpha ||H||²

Suppose I generate some datasets Xi as Wi @ H based on the same set of components H but with different n_samples. I would expect that for a same alpha, training a NMF model on these datasets with the same parameters would find similar components H.

However, in the objective function, the terms 0.5 ||X - W.H||² + alpha ||W||² depend on n_samples but alpha ||H||² doesn't. Thus when n_samples increases, the regularization on H has less and less impact and we start overfitting it.

Here is a plot of the objective function across iterations for different values of n_samples (datasets built as described above). The objective function is normalized per samples to be able to compare values. When n_samples grows we can reach lower and lower values for the objective function.

I also recorded the norm of the fitted components (H) in each case: 734.8, 2161.0, 3362.2, 3575.3. They increase when n_samples grows because the regularization on H has less and less impact (the norm of the original H is 1337.5). On the other hand, when n_samples decreases, the norm of H becomes smaller and smaller because the regularization on H becomes too important w.r.t. the regularization on W.

I propose to rescale the regularization on H to be alpha * n_samples / n_features * ||H||², such that the regularizations on H and W have the same order of magnitude.

Here is plot of the objective function for the same datasets with the proposed change. It's now able to reach similar values of the objective function regardless of the number of samples.

The recorder norms of the fitted components are now: 1067.6, 1043.7, 1051.3, 1049.8. We are now able to retrieve similar components regardless of n_samples.

The text was updated successfully, but these errors were encountered:

ogrisel · 2021-07-07T17:19:11Z

Could you please put labels on the axis of the plots? ;)

ogrisel · 2021-07-07T17:24:25Z

It would be interesting to see the decomposition of the objective value as a stacked plot with the sum of 3 terms:

the contribution of the data-fit term
the contribution of the reg on the loadings
the contribution of the reg on the components

and it we could also add as another line the data-fit term computed on an held out validation set.

This would make it possible to tune the optimal alpha for the 2 strategies for different n_samples sizes and see if:

the optimal alpha is stable enough when n_samples changes
if the new strategy makes it possible to balance the 2 reg terms in a sensible way to avoid overfitting on both of them while not causing significant underfitting by penalizing one of the 2 matrices too much.

GaelVaroquaux · 2021-07-07T20:56:33Z

Right. Indeed, the optimization formula at the top of your post should be written as a sum of a term on vectors "x", which naturally would lead to an "n_samples" appearing when the sum is expanded. The NMF is sometimes formulated as such in the literature, and I agree that it is a cleaner way of doing things.

The n_features factor is, I believe less conventional. I agree it makes sense. I remember having the same thoughts when doing research on applications of dictionary learning years ago. I am a bit more uneasy about this part, though it makes sense.

Note that the problem is quite general: penalties in scikit-learn are implemented using the textbook formulation (though for NMF, the textbook formulation should probably include the factor n_sample). I believe that dictionary learning suffers from this same problem. Lasso should probably be implemented with lambda multiplied by log(p)/n, and the ridge lambda should be a factor of the leading eigenvalue of the covariance matrix, or the trace(Sigma)/n_features (to get a cheap proxy).

All these suffer from the same problem: changes in data shape will impact the optimal penalty coefficients. Arguably, the worse problem is the dependency in n_samples, as it depends on the amount of data for the same problem.

I don't know what our strategy should be in scikit-learn. We could slowly but surely move to these scaling. They would probably benefit end users. However, I can see students or people in the academic world being surprised.

I think that we should strive to be consistent in scikit-learn. I also see a brutal deprecation cycle.

TomDLT · 2021-07-08T00:00:43Z

See also #5296

jeremiedbb · 2021-07-08T15:45:52Z

@ogrisel I edited the plots to display the contributions of the 3 terms of the objective function. Everything is normalized by n_samples to be able to compare.

and it we could also add as another line the data-fit term computed on an held out validation set.

This would make it possible to tune the optimal alpha for the 2 strategies for different n_samples sizes and see if:

Here's a plot of the data fit computed on a validation set for models fitted on the 4 datasets with different n_samples, for several values of alpha, for the 2 strategies.
Current behavior

new scaling of the regularization

With the current behavior, the data fit on the validation set begin to grow at different values of alpha depending on n_samples, while it does not with the proposed scaling.

jeremiedbb · 2021-07-08T16:07:34Z

@GaelVaroquaux I agree the n_features scaling seems less conventional but it's necessary to have a balance between the H and W regularization. One solution could be to split the regularization into alpha_H and alpha_W and let the user deals with it but it would not be very nice of us :)

Another motivation to fix it is for MiniBatchKMeans where the issue is even more problematic to me because we fit on mini batches with potentially different sizes (partial_fit), which means that with the current strategy the regularization does not have the same impact for each batch. Since MiniBatchKMeans is being implemented we can easily use the proposed strategy but I think it would be good to have a consistent strategy for the 2 estimators.

jeremiedbb · 2021-07-09T09:16:33Z

@TomDLT, glad we came to the same conclusion :) And actually your proposition is better, i.e. alpha * n_features * ||W||² + alpha * n_samples * ||H||². With my proposition the regularization does not have the same balance w.r.t. the data fit for different n_features while it does with yours. This is illustrated by the 2 figures below. It shows the data fit on a validation set (normalized by n_features * n_samples to be able to compare) versus alpha for different n_features.

alpha * ||W||² + alpha * n_samples / n_features * ||H||²

alpha * n_features * ||W||² + alpha * n_samples * ||H||²

jeremiedbb · 2021-07-09T09:28:52Z

Based on this discussion and on the discussion in #5296, here's what I propose:

deprecate the alpha and regularization parameters
add 2 new parameters, alpha_W and alpha_H, to replace the 2 deprecated ones
They default to 0 to match the default of the deprecated ones.
If they are non 0 and alpha is also non 0, we use them instead of alpha.
This way we can decide to regularize only W or only H by setting one to 0 or the other instead of having a string parameter for that.
We solve 0.5 ||X - W.H||² + alpha_W * n_features * ||W||² + alpha_H * n_samples * ||H||², and change the docstrings accordingly.
If alpha_W and alpha_H are left to 0 and alpha is non zero then we solve the unscaled problem as before.

I think the deprecation cycle is not too painful since by default there's no regularization, which means that we can easily add these new parameters without changing the default behavior of the estimator.

ogrisel · 2021-07-09T22:20:14Z

We could also have NMF(alpha_W=1e-2, alpha_H="same") to make it possible to only tune alpha_W in a grid search with the proposed scaling to make the 2 regularization terms balanced whatever n_samples.

This would make it possible to keep the convenience of tuning a single parameter which should probably be enough most of the time, but also allows for tuning alpha_W and alpha_H independently for added flexibility.

jeremiedbb added the Bug: triage label Jul 7, 2021

jeremiedbb mentioned this issue Jul 12, 2021

Rescale regularization terms of NMF #20512

Merged

glemaitre added Bug and removed Bug: triage labels Jul 13, 2021

glemaitre closed this as completed in #20512 Jul 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NMF regularization should not depend on n_samples #20484

NMF regularization should not depend on n_samples #20484

jeremiedbb commented Jul 7, 2021 •

edited by glemaitre

Loading

ogrisel commented Jul 7, 2021

ogrisel commented Jul 7, 2021

GaelVaroquaux commented Jul 7, 2021

TomDLT commented Jul 8, 2021

jeremiedbb commented Jul 8, 2021

jeremiedbb commented Jul 8, 2021

jeremiedbb commented Jul 9, 2021 •

edited

Loading

jeremiedbb commented Jul 9, 2021 •

edited

Loading

ogrisel commented Jul 9, 2021

NMF regularization should not depend on n_samples #20484

NMF regularization should not depend on n_samples #20484

Comments

jeremiedbb commented Jul 7, 2021 • edited by glemaitre Loading

ogrisel commented Jul 7, 2021

ogrisel commented Jul 7, 2021

GaelVaroquaux commented Jul 7, 2021

TomDLT commented Jul 8, 2021

jeremiedbb commented Jul 8, 2021

jeremiedbb commented Jul 8, 2021

jeremiedbb commented Jul 9, 2021 • edited Loading

jeremiedbb commented Jul 9, 2021 • edited Loading

ogrisel commented Jul 9, 2021

jeremiedbb commented Jul 7, 2021 •

edited by glemaitre

Loading

jeremiedbb commented Jul 9, 2021 •

edited

Loading

jeremiedbb commented Jul 9, 2021 •

edited

Loading