Skip to content

[WIP] NMF estimator based on the Lee and Seung algorithm #2540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

larsmans
Copy link
Member

Based on code by @vene. Needs tests, docs. Benchmark needs to be updated, must check usage of NMF class in the scikit.

Question: does it make sense at all to bootstrap this thing using NNDSVD?

@vene
Copy link
Member

vene commented Oct 20, 2013

I think the initialization thing is not just about speed of
convergence, but actually about where the algo converges to, as the
problem is non-convex. You can check the reconstruction error of for
different initialization methods.

I will take a look at this exciting PR soon :)

On Sun, Oct 20, 2013 at 11:49 PM, Lars Buitinck
notifications@github.com wrote:

Based on code by @vene. Needs tests, docs. Benchmark needs to be updated,
must check usage of NMF class in the scikit.

Question: does it make sense at all to bootstrap this thing using NNDSVD?


You can merge this Pull Request by running

git pull https://github.com/larsmans/scikit-learn lee-seung-nmf

Or view, comment on, or merge it at:

#2540

Commit Summary

ENH NMF estimator based on the Lee and Seung algorithm

File Changes

M sklearn/decomposition/nmf.py (367)

Patch Links:

https://github.com/scikit-learn/scikit-learn/pull/2540.patch
https://github.com/scikit-learn/scikit-learn/pull/2540.diff

@amueller
Copy link
Member

I really thought that one would be slower... we should really test for a regression in the projected gradient one...

@mblondel
Copy link
Member

For NMF by the multiplicative method with KL divergence, see #1348

@larsmans
Copy link
Member Author

Looks like that PR can be merged into this one as a loss argument to NMF. I'll see if I can find time to have a shot at it.

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2013

Would be interesting to experiment with a mini-batch SGD variant as well. It might even be faster that the batch gradient method and furthermore could accept a partial_fit method to make it possible train out-of-core topic models.

However I am not sure whether / how the mini-batch size will impact the convergence.

@larsmans
Copy link
Member Author

@ogrisel My thought exactly. Wikipedia has a page about online NMF with a few references, I wanted to read those before hacking up my own online version.

@vene
Copy link
Member

vene commented Oct 21, 2013

How about putting this on the pycon.fr sprint list?

@vene
Copy link
Member

vene commented Oct 21, 2013

unless of course @larsmans uses his magic and finishes it up first ;)

On Mon, Oct 21, 2013 at 2:16 PM, Vlad Niculae vlad@vene.ro wrote:

How about putting this on the pycon.fr sprint list?

@larsmans
Copy link
Member Author

When's that?

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2013

How about putting this on the pycon.fr sprint list?

I would be glad trying to hack on this at pycon.fr even if @larsmans ends up implementing the final version. It's good to have several people in the project having a good practical understanding of the algos.

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2013

When's that?

Next week, Monday and Tuesday in the beautiful city of Strasbourg: http://www.pycon.fr/2013/

@larsmans
Copy link
Member Author

Yeah, I'd googled it :)

I'd very much appreciate any work on the NMF solvers; they can solve interesting text mining problems if they're fast enough.

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2013

+1 also sparse online solvers for NMF also looks like an interesting building block for recsys. Dense online NMF could be useful for signal processing like audio classifcation / decomposition from STFT data.

@mblondel
Copy link
Member

The scikit is starting to become a bit inconsistent in the way we create classes.

  • one model, many learning algorithms (e.g., LinearSVC, LogisticRegression, Ridge)
  • one learning algorithm, many models / objectives (e.g. SGDClassifier)
  • one model / objectve, one learning algorithm (e.g., Lasso)

I tend to prefer the first (the model parameters are the same, only the way we fit them changes) although from a code organization point of view the second can be more convenient.

I would like to leave the door open to add a coordinate descent solver for NMF. All major solvers (projected gradient, multiplicative method, CD) support both KL and squared losses, it's just a matter of implementing them.

@larsmans
Copy link
Member Author

I've noticed this before, but then the machine learning community is inconsistent and sloppy in distinguishing problems, models and algorithms. IMHO, we should make pragmatic choices with the number of hyperparameters (and valid combinations of them) as a guideline, so as not to bother users with parameters that don't do anything. I've set up this NMF as a separate class without all the extra parameters that PG-NMF has.

@mblondel
Copy link
Member

MHO, we should make pragmatic choices with the number of hyperparameters (and valid combinations of them) as a guideline

Sounds good.

Would you care to rename NMF to MultiplicativeNMF?

@larsmans
Copy link
Member Author

Good idea. But then what do we do with the current NMF synonym for ProjectedGradientNMF? Deprecate it? Leave it in?

@GaelVaroquaux
Copy link
Member

• one model, many learning algorithms (e.g., LinearSVC, LogisticRegression, Ridge)

I tend to prefer the first

So do I, however,

(the model parameters are the same, only the way we fit them changes)

that's the limit: sometimes the algorithms conditions a lot the
parameters. That's when I would break up in different objects.

@GaelVaroquaux
Copy link
Member

MHO, we should make pragmatic choices with the number of hyperparameters (and valid combinations of them) as a guideline

Sounds good.

I feel the same.

Would you care to rename NMF to MultiplicativeNMF?

Ouch, breakage again...

@larsmans
Copy link
Member Author

No breakage. This PR actually breaks some code by reusing the NMF identifier for the new estimator. It's currently a synonym for the other estimator, see decomposition/nmf.py, near the bottom.

@mblondel
Copy link
Member

Deprecate it? Leave it in?

I would deprecate it.

@GaelVaroquaux
Copy link
Member

I would deprecate it.

Yes but... It's really tiring to know that scikit-learn is not reliable
it the mid term (6 month to a year) in terms of forward compatibility. We
will have to stop doing this. We are breaking people's code for
cosmetic reasons. It is fine when we are a young project focusing on
getting guidelines right, but it's not fine for a mature project.

@larsmans
Copy link
Member Author

I'll leave it in for this PR but remove all references to it in the documentation. We were never consistent about addressing that class, sometimes calling it NMF, sometimes ProjectedGradientNMF.

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2013

Yes but... It's really tiring to know that scikit-learn is not reliable it the mid term (6 month to a year) in terms of forward compatibility. We will have to stop doing this. We are breaking people's code for cosmetic reasons. It is fine when we are a young project focusing on getting guidelines right, but it's not fine for a mature project.

We could also just increase the length of the deprecation cycle to 1 or 2 years instead of 2 releases to transition smoothly from young project to mature project while not freezing API or naming mistakes in stone.

@larsmans
Copy link
Member Author

Docs done, now for the tests.

@larsmans
Copy link
Member Author

Hm. Next problem is that some of the tests are quite unstable. By inserting more tests, the RNG's state is changed and some of the PG-NMF tests start failing.

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2013

We should really seed each test explicitly to make them independent of the running order.

@larsmans
Copy link
Member Author

I've done that to the tests where it's really a problem. I haven't run all of them indivually, but the rest seem stable. Tests done, next up: check all examples that use NMF.

@mblondel
Copy link
Member

Yes but... It's really tiring to know that scikit-learn is not reliable it the mid term (6 month to a year) in terms of forward compatibility. We will have to stop doing this. We are breaking people's code for cosmetic reasons. It is fine when we are a young project focusing on getting guidelines right, but it's not fine for a mature project.

Indeed, but do we want to keep two different names to refer to the same class?

@mblondel
Copy link
Member

Sorry I thought you meant the projection to the positive semidefinite cone (the set of all positive semidefinite matrices), which is obviously not the same as the non-negative orthant.

@vene
Copy link
Member

vene commented Oct 25, 2013

Something that keeps bugging me in NMF is regularization/sparsity constraints. IIRC the literature says that while the unregularized formulation is naturally somewhat sparse for nice datasets, it's usually needed to enforce this.

As far as I know the multiplicative update method cannot handle this, right?

The implementation is PG-NMF is currently very hackish and uses hstacks to adjust the cost function (sorry, it was some of my first numerical code!) and I also think it causes some slowdown and wasted memory.

It would be nice to hear some experience from the field with this re: L2 or L1 regularization.

@mblondel
Copy link
Member

I think it's always a good idea to use regularization. In addition to making solutions sparser, it also makes the problem better conditioned. The projected gradient update without regularization is
W = max(W - step_size * gradient, 0)

Now, if you use alpha * ||W||_1 (i.e. L1 norm) as regularization, the gradient of the regularization is a matrix with alpha everywhere so the update becomes
W = max(W - step_size * gradient - step_size * alpha, 0)

So it's easy to see that it is more likely that weights get truncated to zero.

I cannot comment for the multiplicative method as I don't know the method well.

@larsmans
Copy link
Member Author

Ok, I could do with some help here. I can't seem to get the common tests to pass since (1) they pick up _BaseNMF even when there's an __all__ in the nmf.py file, and (2) the test that checks whether fit(X).transform(X) == fit_transform(X) fails for MultiplicativeNMF but not ProjectedGradientNMF. I don't understand this: my gut feeling is that it should fail in both cases or in neither, because they're doing the same thing.

I wrote a custom MultiplicativeNMF.transform that uses the multipllicative update algorithm with a fixed H, but this doesn't solve the problem.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 670cd6b on larsmans:lee-seung-nmf into 101f0ee on scikit-learn:master.

@larsmans
Copy link
Member Author

@mblondel The L1-regularized version of the multiplicative algorithm is given by Hoyer 2002.

W *= update
maxW = np.max(update)

if i % 10 == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the 10 here pretty arbitrary? The max can be slow for large dense matrices, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's arbitrary, and I guess it doesn't skip much since I moved the max outside the check to reuse the update variable (there were updateH and updateW before). I'd have to do some timing to figure out which one is better.

@vene
Copy link
Member

vene commented Oct 29, 2013

An example that plots reconstruction error over time for projected gradient and the multiplicative method would be nice. This can be done by setting the number of iterations to 1 and call fit multiple times with the result of the last call as initialization for W and H. A callback mechanism would be nicer but we have never agreed on an API.

Unfortunately the current API for NMF doesn't support warm restarts. It would be trivial to add but should we?

@vene
Copy link
Member

vene commented Oct 29, 2013

More updates on the text benchmark and faces benchmark: It turns out PG isn't so bad, but it's not fair to set the same tol for both algorithms as they measure different things. A higher tol for PG leads to very good solutions and is faster than multiplicative.

For a fair comparison we need to somehow adjust the convergence criteria to make it match. I think the PG-NMF convergence criteria is the messed up one, as you can see that with random init it sometimes "converges" to noise very quickly.

@vene
Copy link
Member

vene commented Oct 29, 2013

I forgot to mention: the current snapshot of the benchmarks is ran against a local branch where I made the multiplicative nmf check for convergence at every iteration, and the final number of iterations is saved under self.n_iter_.

@larsmans
Copy link
Member Author

@vene Do you think I should give up on this PR?

@vene
Copy link
Member

vene commented Oct 29, 2013

Definitely not, but I think I found a nasty hidden bug in the projected
gradient stopping condition. I'm still digging and I think we should bench
on Movielens or some different scenario.

On Tue, Oct 29, 2013 at 3:47 PM, Lars Buitinck notifications@github.comwrote:

@vene https://github.com/vene Do you think I should give up on this PR?


Reply to this email directly or view it on GitHubhttps://github.com//pull/2540#issuecomment-27308958
.

@vene
Copy link
Member

vene commented Oct 29, 2013

Also from the benchmarks it seems that the multiplicative update is better
at the moment for sparse multidimensional data, but we need to figure out
how to properly compare the tols.

On Tue, Oct 29, 2013 at 3:58 PM, Vlad Niculae vlad@vene.ro wrote:

Definitely not, but I think I found a nasty hidden bug in the projected
gradient stopping condition. I'm still digging and I think we should bench
on Movielens or some different scenario.

On Tue, Oct 29, 2013 at 3:47 PM, Lars Buitinck notifications@github.comwrote:

@vene https://github.com/vene Do you think I should give up on this PR?


Reply to this email directly or view it on GitHubhttps://github.com//pull/2540#issuecomment-27308958
.

@mblondel
Copy link
Member

Unfortunately the current API for NMF doesn't support warm restarts. It would be trivial to add but should we?

The convergence plot would also be useful for checking correctness of the implementations. How did you generate the convergence plots in your notebooks?

@vene
Copy link
Member

vene commented Oct 30, 2013

The "dumb" way: for max_iter in range(...).

On the other hand this allows you to use the stopping condition, otherwise
with warm restarts I guess you will keep doing that one iteration, or
anyway more changes to the code would be needed.

On Wed, Oct 30, 2013 at 3:06 AM, Mathieu Blondel
notifications@github.comwrote:

Unfortunately the current API for NMF doesn't support warm restarts. It
would be trivial to add but should we?

The convergence plot would also be useful for checking correctness of the
implementations. How did you generate the convergence plots in your
notebooks?


Reply to this email directly or view it on GitHubhttps://github.com//pull/2540#issuecomment-27360069
.

@mblondel
Copy link
Member

I think I had a similar problem in lightning. When you use warm-start, the first iteration is already pretty good. Since the stopping criterion is relatively to the first iteration, the stopping criterion therefore becomes too strict. One solution would be to store the first iteration's projected gradient norm in self.init_norm_.

@mblondel
Copy link
Member

The "dumb" way: for max_iter in range(...).

Alright, I guess this is the easiest way to do it for now. We can always add warm-start later (the dumb way doesn't scale to large datasets).

@vene
Copy link
Member

vene commented Nov 28, 2013

There is also this paper about a way to speed up convergence of KL divergence NMF by considering diagonalized Newton updates as well, and the author's (synthetic) benchmarks show it to be better than the coordinate descent version. This paper doesn't pass the scikit-learn criteria for inclusion, but I found it interesting nonetheless.

@amueller
Copy link
Member

Close based on #4811?

@ogrisel
Copy link
Member

ogrisel commented Aug 27, 2015

Let's close and move the discussion to #4811.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants