Skip to content

[MRG] Add scaling to alpha regularization parameter in NMF #5296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

TomDLT
Copy link
Member

@TomDLT TomDLT commented Sep 22, 2015

(I separated this modification from #4852 (comment) for further discussion about it.)
In NMF, I scaled here the regularization parameter alpha with n_samples and n_features.


Indeed, without scaling, two problems appear:

  • The regularization penalizes the sum of coefficients in W and H. If the number of element in W and H is not the same (i.e. n_samples != n_features), the constraint is unbalanced. One of H or W goes to zero, and the other one, which is less penalized, increases to compensate.
  • The value of alpha that makes the coefficients of W and H collapse is proportional with sqrt(n_features * n_samples). It makes the scaling of alpha depends on the size of the data.

Test to prove the point: I tested several sizes for the input X, and plotted how the coefficients in W and H collapse with respect to the alpha parameter.

without scaling alpha:
noregul
with properly scaling alpha:
regul

Scaling used: I used alpha_W = alpha * n_features and alpha_H = alpha * n_samples.

Conclusion: the effect of the alpha parameter is much more consistent if we scale it.


As L1 and L2 regularizations in NMF is fresh new (#4852), it would not really break any code (before 0.17 at least).
But do we want to add this?
Is it consistent with other estimators in scikit-learn?
What Do You Think?

@vene @mblondel

alpha_W = 0.
# The priors for W and H are scaled differently.
if regularization in ('both', 'components'):
alpha_H = float(alpha) * n_samples
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the proper scaling of alpha.
All other modifications in this PR are just a clean way to pass the same regularization to every solvers.

@TomDLT TomDLT changed the title [WIP] Add scaling to alpha regularization parameter in NMF [MRG] Add scaling to alpha regularization parameter in NMF Sep 28, 2015
@ogrisel
Copy link
Member

ogrisel commented Oct 5, 2016

What is plotted, is it a norm of the matrices? If so which?

@ogrisel
Copy link
Member

ogrisel commented Oct 5, 2016

While I agree that with your change to add data dependent scaling the transfer effect seems to be smaller than without the scaling, I am worried that this kind of scaling could be non-standard in the literature. Could you please try to review popular implementation of NMF to check if they do this too?

Have you seen this kind of scaling in the literature?

@TomDLT
Copy link
Member Author

TomDLT commented Oct 5, 2016

The plots show the mean of all elements in W and H, i.e. the L1 norm normalized by the number of elements.

It all came from @vene's comment (#4852 (comment)), and to avoid blocking the PR, I opened a separate PR to further talk about it.
I have never seen any such scaling in the literature, yet as @vene's puts it (#4852 (comment)), "it's a usability detail, [...] But usability is important". I guess this is the difference between a paper's implementation and a general public package's implementation. At the same time, I understand the need to avoid non-standard behaviors.

@tguillemot
Copy link
Contributor

As #4852 is merged now, what's the status of this PR ?

@TomDLT
Copy link
Member Author

TomDLT commented Dec 16, 2016

Status: Controversial
We are not set yet on what we prefer:

  • standard (current) behavior: no scaling of alpha, but the regularization depends on the size of the data, and the regularization is not balanced between W and H.
  • non-standard (proposed) behavior: scaling alpha, so the regularization does not depend on the size of the data, and is balanced between W and H.

New idea: another slightly more standard method could be to normalize H (aka the dictionary) to unit norm at each iteration, yet I am not sure of the consequences on regularization.

@ogrisel
Copy link
Member

ogrisel commented Oct 14, 2020

Coming back to this PR after all those years. It's true that the proposed normalization for the regularizer makes a lot of sense from the look of the experiment. Could you try to do the same experiments on some natural datases (e.g. TF-IDF vectors or scientific signal data) to see if the normal of the matrices shrink fast close to alpha=1 as is the case on your synthetic experiments?

However we would need a new option FutureWarning to select between the 2 regularizer defintiions (in the public API only) along with a FutureWarning that explains that the default behavior will change.

@vene
Copy link
Member

vene commented Oct 20, 2020

I agree this looks very convincing especially if as Olivier asks we see the pattern in real data too. This seems more than a usability issue since we don't provide the API to search for separate alphas for W and H, which would be needed to construct an equivalent problem, right?

We could have a scale_alpha=False|True arg and change its default value.

@TomDLT
Copy link
Member Author

TomDLT commented Oct 20, 2020

So I tried with two datasets:

  • Olivetti faces, (400, 4096), dense. I get the same results than with randomly generated data (not shown here).
  • 20 newsgroup, (11314, 39115), sparse. I get pretty different results (see below). The new scaling definitely improves the situation, but not quite as neatly as in the dense case. We probably need to add some heuristic that takes into account the sparsity level.

without scaling (master branch)
20news_master
with scaling (this branch)
20news_branch

import itertools
import warnings

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import NMF
from sklearn.exceptions import ConvergenceWarning

warnings.simplefilter('ignore', DeprecationWarning)
warnings.simplefilter('ignore', ConvergenceWarning)


def load_20news():
    from sklearn.datasets import fetch_20newsgroups
    from sklearn.feature_extraction.text import TfidfVectorizer
    dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                                 remove=('headers', 'footers', 'quotes'))
    vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
    X = vectorizer.fit_transform(dataset.data)
    return X


def load_faces():
    from sklearn.datasets import fetch_olivetti_faces
    X = fetch_olivetti_faces(shuffle=True).data
    return X


def load_random(n_samples, n_features):
    rng = np.random.RandomState(0)
    X = np.abs(rng.randn(n_samples, n_features))
    return X


dataset = "random"
if dataset == "20news":
    X = load_20news()
    n_samples_list = [100, 1000, 10000]
    n_features_list = [100, 1000, 10000]
elif dataset == "faces":
    X = load_faces()
    n_samples_list = [4, 40, 400]
    n_features_list = [4, 40, 400]
elif dataset == "random":
    X = load_random(1000, 1000)
    n_samples_list = [10, 100, 1000]
    n_features_list = [10, 100, 1000]

alphas = np.logspace(-7, 3, num=100)

estimator = NMF(n_components=10, solver='cd', init='random',
                tol=1e-4, max_iter=10, random_state=0, alpha=0., l1_ratio=0.,
                verbose=0, shuffle=True)

fig, axes = plt.subplots(3, 3, figsize=(20, 15))
for (n_samples,
     n_features), ax in zip(itertools.product(n_samples_list, n_features_list),
                            axes.ravel()):
    print('%d - %d' % (n_samples, n_features))

    mean_w = np.zeros(len(alphas))
    mean_h = np.zeros(len(alphas))
    for ii, alpha in enumerate(alphas):
        # alpha *= np.sqrt(n_samples * n_features)
        estimator = estimator.set_params(alpha=alpha)
        W = estimator.fit_transform(X[:n_samples, :n_features])
        H = estimator.components_
        mean_w[ii] = W.mean()
        mean_h[ii] = H.mean()

    ax.set_xscale('log')
    ax.plot(alphas, mean_w, label='W')
    ax.plot(alphas, mean_h, label='H')
    ax.legend(loc=0)
    ax.set_title('X.shape = (%d, %d)' % (n_samples, n_features))

fig.suptitle('Mean coefficient, w.r.t alpha L2 regularisation', size=16)
fig.savefig(f'{dataset}_master.png')
plt.show()

@TomDLT
Copy link
Member Author

TomDLT commented Jul 15, 2021

Closed by #20512

@TomDLT TomDLT closed this Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants