-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
NMF regularization should not depend on n_samples #20484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you please put labels on the axis of the plots? ;) |
It would be interesting to see the decomposition of the objective value as a stacked plot with the sum of 3 terms:
and it we could also add as another line the data-fit term computed on an held out validation set. This would make it possible to tune the optimal alpha for the 2 strategies for different
|
Right. Indeed, the optimization formula at the top of your post should be written as a sum of a term on vectors "x", which naturally would lead to an "n_samples" appearing when the sum is expanded. The NMF is sometimes formulated as such in the literature, and I agree that it is a cleaner way of doing things. The n_features factor is, I believe less conventional. I agree it makes sense. I remember having the same thoughts when doing research on applications of dictionary learning years ago. I am a bit more uneasy about this part, though it makes sense. Note that the problem is quite general: penalties in scikit-learn are implemented using the textbook formulation (though for NMF, the textbook formulation should probably include the factor n_sample). I believe that dictionary learning suffers from this same problem. Lasso should probably be implemented with lambda multiplied by log(p)/n, and the ridge lambda should be a factor of the leading eigenvalue of the covariance matrix, or the trace(Sigma)/n_features (to get a cheap proxy). All these suffer from the same problem: changes in data shape will impact the optimal penalty coefficients. Arguably, the worse problem is the dependency in n_samples, as it depends on the amount of data for the same problem. I don't know what our strategy should be in scikit-learn. We could slowly but surely move to these scaling. They would probably benefit end users. However, I can see students or people in the academic world being surprised. I think that we should strive to be consistent in scikit-learn. I also see a brutal deprecation cycle. |
See also #5296 |
@ogrisel I edited the plots to display the contributions of the 3 terms of the objective function. Everything is normalized by n_samples to be able to compare.
Here's a plot of the data fit computed on a validation set for models fitted on the 4 datasets with different n_samples, for several values of alpha, for the 2 strategies. With the current behavior, the data fit on the validation set begin to grow at different values of alpha depending on n_samples, while it does not with the proposed scaling. |
@GaelVaroquaux I agree the n_features scaling seems less conventional but it's necessary to have a balance between the H and W regularization. One solution could be to split the regularization into Another motivation to fix it is for MiniBatchKMeans where the issue is even more problematic to me because we fit on mini batches with potentially different sizes (partial_fit), which means that with the current strategy the regularization does not have the same impact for each batch. Since MiniBatchKMeans is being implemented we can easily use the proposed strategy but I think it would be good to have a consistent strategy for the 2 estimators. |
@TomDLT, glad we came to the same conclusion :) And actually your proposition is better, i.e. |
Based on this discussion and on the discussion in #5296, here's what I propose:
I think the deprecation cycle is not too painful since by default there's no regularization, which means that we can easily add these new parameters without changing the default behavior of the estimator. |
We could also have This would make it possible to keep the convenience of tuning a single parameter which should probably be enough most of the time, but also allows for tuning |
The objective function of NMF as described in the doc is (assuming l2 reg for simplicity)
0.5 ||X - W.H||² + alpha ||W||² + alpha ||H||²
Suppose I generate some datasets Xi as Wi @ H based on the same set of components H but with different n_samples. I would expect that for a same
alpha
, training a NMF model on these datasets with the same parameters would find similar components H.However, in the objective function, the terms
0.5 ||X - W.H||² + alpha ||W||²
depend on n_samples butalpha ||H||²
doesn't. Thus when n_samples increases, the regularization on H has less and less impact and we start overfitting it.Here is a plot of the objective function across iterations for different values of n_samples (datasets built as described above). The objective function is normalized per samples to be able to compare values. When n_samples grows we can reach lower and lower values for the objective function.

I also recorded the norm of the fitted components (H) in each case:
734.8, 2161.0, 3362.2, 3575.3
. They increase when n_samples grows because the regularization on H has less and less impact (the norm of the original H is1337.5
). On the other hand, when n_samples decreases, the norm of H becomes smaller and smaller because the regularization on H becomes too important w.r.t. the regularization on W.I propose to rescale the regularization on H to be
alpha * n_samples / n_features * ||H||²
, such that the regularizations on H and W have the same order of magnitude.Here is plot of the objective function for the same datasets with the proposed change. It's now able to reach similar values of the objective function regardless of the number of samples.

The recorder norms of the fitted components are now:
1067.6, 1043.7, 1051.3, 1049.8
. We are now able to retrieve similar components regardless of n_samples.The text was updated successfully, but these errors were encountered: