Skip to content

NMF regularization should not depend on n_samples #20484

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeremiedbb opened this issue Jul 7, 2021 · 9 comments · Fixed by #20512
Closed

NMF regularization should not depend on n_samples #20484

jeremiedbb opened this issue Jul 7, 2021 · 9 comments · Fixed by #20512
Labels

Comments

@jeremiedbb
Copy link
Member

jeremiedbb commented Jul 7, 2021

The objective function of NMF as described in the doc is (assuming l2 reg for simplicity)
0.5 ||X - W.H||² + alpha ||W||² + alpha ||H||²

Suppose I generate some datasets Xi as Wi @ H based on the same set of components H but with different n_samples. I would expect that for a same alpha, training a NMF model on these datasets with the same parameters would find similar components H.

However, in the objective function, the terms 0.5 ||X - W.H||² + alpha ||W||² depend on n_samples but alpha ||H||² doesn't. Thus when n_samples increases, the regularization on H has less and less impact and we start overfitting it.

Here is a plot of the objective function across iterations for different values of n_samples (datasets built as described above). The objective function is normalized per samples to be able to compare values. When n_samples grows we can reach lower and lower values for the objective function.
nmf_obj2

I also recorded the norm of the fitted components (H) in each case: 734.8, 2161.0, 3362.2, 3575.3. They increase when n_samples grows because the regularization on H has less and less impact (the norm of the original H is 1337.5). On the other hand, when n_samples decreases, the norm of H becomes smaller and smaller because the regularization on H becomes too important w.r.t. the regularization on W.

I propose to rescale the regularization on H to be alpha * n_samples / n_features * ||H||², such that the regularizations on H and W have the same order of magnitude.

Here is plot of the objective function for the same datasets with the proposed change. It's now able to reach similar values of the objective function regardless of the number of samples.
nmf_obj1

The recorder norms of the fitted components are now: 1067.6, 1043.7, 1051.3, 1049.8. We are now able to retrieve similar components regardless of n_samples.

@ogrisel
Copy link
Member

ogrisel commented Jul 7, 2021

Could you please put labels on the axis of the plots? ;)

@ogrisel
Copy link
Member

ogrisel commented Jul 7, 2021

It would be interesting to see the decomposition of the objective value as a stacked plot with the sum of 3 terms:

  • the contribution of the data-fit term
  • the contribution of the reg on the loadings
  • the contribution of the reg on the components

and it we could also add as another line the data-fit term computed on an held out validation set.

This would make it possible to tune the optimal alpha for the 2 strategies for different n_samples sizes and see if:

  • the optimal alpha is stable enough when n_samples changes
  • if the new strategy makes it possible to balance the 2 reg terms in a sensible way to avoid overfitting on both of them while not causing significant underfitting by penalizing one of the 2 matrices too much.

@GaelVaroquaux
Copy link
Member

Right. Indeed, the optimization formula at the top of your post should be written as a sum of a term on vectors "x", which naturally would lead to an "n_samples" appearing when the sum is expanded. The NMF is sometimes formulated as such in the literature, and I agree that it is a cleaner way of doing things.

The n_features factor is, I believe less conventional. I agree it makes sense. I remember having the same thoughts when doing research on applications of dictionary learning years ago. I am a bit more uneasy about this part, though it makes sense.

Note that the problem is quite general: penalties in scikit-learn are implemented using the textbook formulation (though for NMF, the textbook formulation should probably include the factor n_sample). I believe that dictionary learning suffers from this same problem. Lasso should probably be implemented with lambda multiplied by log(p)/n, and the ridge lambda should be a factor of the leading eigenvalue of the covariance matrix, or the trace(Sigma)/n_features (to get a cheap proxy).

All these suffer from the same problem: changes in data shape will impact the optimal penalty coefficients. Arguably, the worse problem is the dependency in n_samples, as it depends on the amount of data for the same problem.

I don't know what our strategy should be in scikit-learn. We could slowly but surely move to these scaling. They would probably benefit end users. However, I can see students or people in the academic world being surprised.

I think that we should strive to be consistent in scikit-learn. I also see a brutal deprecation cycle.

@TomDLT
Copy link
Member

TomDLT commented Jul 8, 2021

See also #5296

@jeremiedbb
Copy link
Member Author

@ogrisel I edited the plots to display the contributions of the 3 terms of the objective function. Everything is normalized by n_samples to be able to compare.

and it we could also add as another line the data-fit term computed on an held out validation set.

This would make it possible to tune the optimal alpha for the 2 strategies for different n_samples sizes and see if:

Here's a plot of the data fit computed on a validation set for models fitted on the 4 datasets with different n_samples, for several values of alpha, for the 2 strategies.
Current behavior
nmf_objval1
new scaling of the regularization
nmf_objval2

With the current behavior, the data fit on the validation set begin to grow at different values of alpha depending on n_samples, while it does not with the proposed scaling.

@jeremiedbb
Copy link
Member Author

@GaelVaroquaux I agree the n_features scaling seems less conventional but it's necessary to have a balance between the H and W regularization. One solution could be to split the regularization into alpha_H and alpha_W and let the user deals with it but it would not be very nice of us :)

Another motivation to fix it is for MiniBatchKMeans where the issue is even more problematic to me because we fit on mini batches with potentially different sizes (partial_fit), which means that with the current strategy the regularization does not have the same impact for each batch. Since MiniBatchKMeans is being implemented we can easily use the proposed strategy but I think it would be good to have a consistent strategy for the 2 estimators.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Jul 9, 2021

@TomDLT, glad we came to the same conclusion :) And actually your proposition is better, i.e. alpha * n_features * ||W||² + alpha * n_samples * ||H||². With my proposition the regularization does not have the same balance w.r.t. the data fit for different n_features while it does with yours. This is illustrated by the 2 figures below. It shows the data fit on a validation set (normalized by n_features * n_samples to be able to compare) versus alpha for different n_features.

alpha * ||W||² + alpha * n_samples / n_features * ||H||²
nfm_objreg_feat1

alpha * n_features * ||W||² + alpha * n_samples * ||H||²
nfm_objreg_feat2

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Jul 9, 2021

Based on this discussion and on the discussion in #5296, here's what I propose:

  • deprecate the alpha and regularization parameters
  • add 2 new parameters, alpha_W and alpha_H, to replace the 2 deprecated ones
    They default to 0 to match the default of the deprecated ones.
    If they are non 0 and alpha is also non 0, we use them instead of alpha.
    This way we can decide to regularize only W or only H by setting one to 0 or the other instead of having a string parameter for that.
  • We solve 0.5 ||X - W.H||² + alpha_W * n_features * ||W||² + alpha_H * n_samples * ||H||², and change the docstrings accordingly.
  • If alpha_W and alpha_H are left to 0 and alpha is non zero then we solve the unscaled problem as before.

I think the deprecation cycle is not too painful since by default there's no regularization, which means that we can easily add these new parameters without changing the default behavior of the estimator.

@ogrisel
Copy link
Member

ogrisel commented Jul 9, 2021

We could also have NMF(alpha_W=1e-2, alpha_H="same") to make it possible to only tune alpha_W in a grid search with the proposed scaling to make the 2 regularization terms balanced whatever n_samples.

This would make it possible to keep the convenience of tuning a single parameter which should probably be enough most of the time, but also allows for tuning alpha_W and alpha_H independently for added flexibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants