Skip to content

[MRG+1] refactor NMF and add CD solver #4852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 22, 2015
Merged

Conversation

TomDLT
Copy link
Member

@TomDLT TomDLT commented Jun 11, 2015

This PR is a first part of what is discussed in #4811.
It includes:

  • refactor ProjectedGradientNMF into NMF, with a 'proj-grad' solver
  • include random initialization into _initialize_nmf()
  • add ElasticNet-like regularization (with parameters alpha and l1_ratio)
  • add a solver with Coordinate Descent (from @mblondel) with Cython for critical part
  • update some tests and bench_plot_nmf (which looks terrible btw)

In some future PR, I will:

  • include MM (cf comment) in order to include more loss functions (I-divergence and IS-divergence)
  • investigate more on initialization of NMF
  • ...

Please tell me what you think :)

@amueller
Copy link
Member

I think it's awesome! And travis is failing ;)

@TomDLT
Copy link
Member Author

TomDLT commented Jun 11, 2015

Thanks ! Only 2/3 are failing :p

@TomDLT
Copy link
Member Author

TomDLT commented Jun 11, 2015

Benchmark on 20news full dataset (sparse), no regularization, n_components=10
20news_noregul
Benchmark on Olivetti faces full dataset (dense), no regularization, n_components=10
faces_noregul

@amueller
Copy link
Member

nice. I don't have time to review though.

return _update_cdnmf_fast(Ht, WtW, WtX, alpha, l1_ratio, False)


def fit_coordinate_descent(X, W, H, tol=1e-4, max_iter=200, alpha=0.001,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a private function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the discussion on NMF, you said:

Maybe we can keep the class API simple and expose a function with more options

This function add the choice to regularize W or H or both, whereas the class does not have this option, and regularizes both. Is it what you meant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to do like in ridge.py with ridge_regression and Ridge (i.e. the public function is common for all solvers).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i.e. the public function is common for all solvers).

Ok I'll do that

@mblondel
Copy link
Member

I think we should remove projected gradient by v0.19. Otherwise, with your plan to add more loss functions, this will be way too much code to maintain. @vene, is it fine with you?

@TomDLT
Copy link
Member Author

TomDLT commented Jun 12, 2015

I think we should remove projected gradient by v0.19

I think that was the plan, I will add deprecation.

@vene
Copy link
Member

vene commented Jun 12, 2015

Based on the plots above, that aren't surprising, yes, I fully agree that if we merge CD we can deprecate PG.

for when sparsity is not desired)
'random': non-negative random matrices, scale with:
sqrt(X.mean() / n_components)
'uniform': matrices filled with the same value:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where the code path for 'uniform' currently is. Is 'uniform' supposed to be different than 'random'?
Also, @mblondel's gist used as initialization a random subset of the rows/cols of the input matrix, would this make sense? We could run some benchmarks for this too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'uniform' is initialization by the same value everywhere, but I removed it in this PR.
I will do a specific PR to address the initialization schemes and run some benchmarks.
And I will also look at the random subset of rows/cols.

raise ValueError('Array with wrong shape passed to %s. Expected %s, '
'but got %s ' % (whom, shape, np.shape(A)))
check_non_negative(A, whom)
if np.max(A) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember in one of the papers they say that full rows/cols of zeros can cause problems, is it the case in this implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a full col/row of zeros in W or H, WtW or HHt will have a zero in its diagonal, which is a problem since we will divide by zero.
To avoid that, I added a small L2 regularization (1e-15), which avoid this problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not the right way to do it. In this case, the optimal solution is probably to not update the coefficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e., skip the current iteration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I changed it

int max_inner, double tol):
cdef double violation = 0.
cdef double pg
cdef int n_length = W.shape[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_features?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W is (n_samples, n_components)
H.T is (n_features, n_components)
I changed the name to be more clear:
cdef int n_samples = W.shape[0] # n_features for H update

@TomDLT TomDLT force-pushed the nmf branch 2 times, most recently from f7041cc to ecd0415 Compare June 22, 2015 09:42
@TomDLT
Copy link
Member Author

TomDLT commented Jun 22, 2015

Benchmark of greedy coordinate descent (GCD) vs coordinate descent (CD).
20news, n_components=10
20news_10
20news, n_components=30
20news_30
faces, n_components=10
faces_10
faces, n_components=30
faces_30

(edit) Conclusion:
For 20news dataset, GCD and CD do not give the same minima, but they seems comparable.
For faces dataset, GCD is faster (n_components=10) or equal (n_components=30) to CD.

Results are not as good as expected but GCD seems a bit better than CD (as expected).
I will look at further speed improvement, and benchmark on a larger dataset.

@agramfort
Copy link
Member

agramfort commented Jun 22, 2015 via email

@vene
Copy link
Member

vene commented Jun 22, 2015

I notice that the greedy curves are sometimes longer on the X axis, even if they reach the same loss. Is this because the convergence criterion fires later, or because iterations take longer? The paper seemed to advertise the greedy selection as being very cheap computationally.

@TomDLT
Copy link
Member Author

TomDLT commented Jun 22, 2015

I notice that the greedy curves are sometimes longer on the X axis, even if they reach the same loss. Is this because the convergence criterion fires later, or because iterations take longer?

Iterations take longer. I used the same number of iteration for CD and GCD, with a low tolerance so as to stop only when maximum number of iterations is reached.

@mblondel
Copy link
Member

The improvements are not so impressive overall (~1s gain). The author released the source code in MATLAB (computationally expensive parts written in C). You could have a look to check potential tricks.

In the lower-right plot in the n_components=10 case (news20), the loss difference looks very bad. I wonder if it's not just a scale problem.

l2_reg = (1 - l1_ratio) * alpha

# L2 regularization corresponds to increase the diagonal of HHt
HHt = np.dot(Ht.T, Ht) + np.eye(n_components) * l2_reg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that you used fast_dot in some places. You could potentially use it here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to be sure to have C-contiguous array, but as I removed the check in Cython, I will change it.

:class:`NMF` can also be initialized with random non-negative matrices, by
passing an integer seed or a ``RandomState`` to :attr:`init`.
:class:`NMF` can also be initialized with correctly scaled random non-negative
matrices by setting :attr:`init="random"`. An integer seed or a ``RandomState`` can also be passed to :attr:`random_state` to control reproducibility.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line too long

@agramfort
Copy link
Member

that's it for me

don't forget what's new page

nice job @TomDLT !

@mblondel
Copy link
Member

Sorry to be late on this but could we address the scaling of alpha in another PR? I agree that scaling of alpha could be useful but I think it should probably be an option (default: False) for backward compatibility and consistency with the rest of the scikit. Also, the motivation for the way it is currently scaled is not completely obvious to me. My first intuition would have been to divide the loss term by n_samples x n_features. Although @TomDLT has some early experiments, it would be nice to take our time to compare different scalings.

@TomDLT
Copy link
Member Author

TomDLT commented Sep 21, 2015

Comments addressed, thanks.

I also removed the alpha scaling from this PR, as suggested by @mblondel, and I will open a new PR for further discussion about it.

@agramfort
Copy link
Member

+1 for merge on my side.

@agramfort agramfort changed the title [MRG] refactor NMF and add CD solver [MRG+1] refactor NMF and add CD solver Sep 21, 2015
@mblondel
Copy link
Member

+1 too

@amueller amueller merged commit db52aac into scikit-learn:master Sep 22, 2015
@amueller
Copy link
Member

🍻 Thanks @TomDLT and @mblondel this is awesome!

@agramfort
Copy link
Member

agramfort commented Sep 22, 2015 via email

@TomDLT
Copy link
Member Author

TomDLT commented Sep 22, 2015

Cool 🍻 !

@amueller
Copy link
Member

amueller commented Nov 9, 2015

sorry if I overlooked it but did you commit / post your benchmark script using the "real" datasets? the current one only uses synthetic data, right?

@TomDLT
Copy link
Member Author

TomDLT commented Nov 9, 2015

I benchmarked it with "real" datasets I think: 20 newsgroups, Olivetti Faces, RCV1 and MNIST
My script is available here, but is not on scikit-learn.

I don't find benchmarks/bench_plot_nmf.py very useful by the way.
We could maybe change it with a shorter version of my script.

@amueller
Copy link
Member

amueller commented Nov 9, 2015

I think that would be great. Would you be interested to do that? Otherwise maybe open an issue?

@TomDLT
Copy link
Member Author

TomDLT commented Nov 9, 2015

I can do that

@TomDLT
Copy link
Member Author

TomDLT commented Nov 10, 2015

see #5779

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants