-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Add scaling to alpha regularization parameter in NMF #5296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/decomposition/nmf.py
Outdated
alpha_W = 0. | ||
# The priors for W and H are scaled differently. | ||
if regularization in ('both', 'components'): | ||
alpha_H = float(alpha) * n_samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the proper scaling of alpha
.
All other modifications in this PR are just a clean way to pass the same regularization to every solvers.
c832609
to
8060b44
Compare
What is plotted, is it a norm of the matrices? If so which? |
While I agree that with your change to add data dependent scaling the transfer effect seems to be smaller than without the scaling, I am worried that this kind of scaling could be non-standard in the literature. Could you please try to review popular implementation of NMF to check if they do this too? Have you seen this kind of scaling in the literature? |
The plots show the mean of all elements in W and H, i.e. the L1 norm normalized by the number of elements. It all came from @vene's comment (#4852 (comment)), and to avoid blocking the PR, I opened a separate PR to further talk about it. |
As #4852 is merged now, what's the status of this PR ? |
Status: Controversial
New idea: another slightly more standard method could be to normalize H (aka the dictionary) to unit norm at each iteration, yet I am not sure of the consequences on regularization. |
Coming back to this PR after all those years. It's true that the proposed normalization for the regularizer makes a lot of sense from the look of the experiment. Could you try to do the same experiments on some natural datases (e.g. TF-IDF vectors or scientific signal data) to see if the normal of the matrices shrink fast close to However we would need a new option FutureWarning to select between the 2 regularizer defintiions (in the public API only) along with a |
I agree this looks very convincing especially if as Olivier asks we see the pattern in real data too. This seems more than a usability issue since we don't provide the API to search for separate alphas for W and H, which would be needed to construct an equivalent problem, right? We could have a |
Closed by #20512 |
(I separated this modification from #4852 (comment) for further discussion about it.)
In NMF, I scaled here the regularization parameter
alpha
withn_samples
andn_features
.Indeed, without scaling, two problems appear:
n_samples != n_features
), the constraint is unbalanced. One of H or W goes to zero, and the other one, which is less penalized, increases to compensate.sqrt(n_features * n_samples)
. It makes the scaling of alpha depends on the size of the data.Test to prove the point: I tested several sizes for the input
X
, and plotted how the coefficients in W and H collapse with respect to thealpha
parameter.without scaling alpha:


with properly scaling alpha:
Scaling used: I used
alpha_W = alpha * n_features
andalpha_H = alpha * n_samples
.Conclusion: the effect of the
alpha
parameter is much more consistent if we scale it.As L1 and L2 regularizations in NMF is fresh new (#4852), it would not really break any code (before 0.17 at least).
But do we want to add this?
Is it consistent with other estimators in scikit-learn?
What Do You Think?
@vene @mblondel