-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] NMF estimator based on the Lee and Seung algorithm #2540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think the initialization thing is not just about speed of I will take a look at this exciting PR soon :) On Sun, Oct 20, 2013 at 11:49 PM, Lars Buitinck
|
I really thought that one would be slower... we should really test for a regression in the projected gradient one... |
For NMF by the multiplicative method with KL divergence, see #1348 |
Looks like that PR can be merged into this one as a |
Would be interesting to experiment with a mini-batch SGD variant as well. It might even be faster that the batch gradient method and furthermore could accept a However I am not sure whether / how the mini-batch size will impact the convergence. |
@ogrisel My thought exactly. Wikipedia has a page about online NMF with a few references, I wanted to read those before hacking up my own online version. |
How about putting this on the pycon.fr sprint list? |
unless of course @larsmans uses his magic and finishes it up first ;) On Mon, Oct 21, 2013 at 2:16 PM, Vlad Niculae vlad@vene.ro wrote:
|
When's that? |
I would be glad trying to hack on this at pycon.fr even if @larsmans ends up implementing the final version. It's good to have several people in the project having a good practical understanding of the algos. |
Next week, Monday and Tuesday in the beautiful city of Strasbourg: http://www.pycon.fr/2013/ |
Yeah, I'd googled it :) I'd very much appreciate any work on the NMF solvers; they can solve interesting text mining problems if they're fast enough. |
+1 also sparse online solvers for NMF also looks like an interesting building block for recsys. Dense online NMF could be useful for signal processing like audio classifcation / decomposition from STFT data. |
The scikit is starting to become a bit inconsistent in the way we create classes.
I tend to prefer the first (the model parameters are the same, only the way we fit them changes) although from a code organization point of view the second can be more convenient. I would like to leave the door open to add a coordinate descent solver for NMF. All major solvers (projected gradient, multiplicative method, CD) support both KL and squared losses, it's just a matter of implementing them. |
I've noticed this before, but then the machine learning community is inconsistent and sloppy in distinguishing problems, models and algorithms. IMHO, we should make pragmatic choices with the number of hyperparameters (and valid combinations of them) as a guideline, so as not to bother users with parameters that don't do anything. I've set up this |
Sounds good. Would you care to rename |
Good idea. But then what do we do with the current |
So do I, however,
that's the limit: sometimes the algorithms conditions a lot the |
I feel the same.
Ouch, breakage again... |
No breakage. This PR actually breaks some code by reusing the NMF identifier for the new estimator. It's currently a synonym for the other estimator, see |
I would deprecate it. |
Yes but... It's really tiring to know that scikit-learn is not reliable |
I'll leave it in for this PR but remove all references to it in the documentation. We were never consistent about addressing that class, sometimes calling it |
We could also just increase the length of the deprecation cycle to 1 or 2 years instead of 2 releases to transition smoothly from young project to mature project while not freezing API or naming mistakes in stone. |
Docs done, now for the tests. |
Hm. Next problem is that some of the tests are quite unstable. By inserting more tests, the RNG's state is changed and some of the PG-NMF tests start failing. |
We should really seed each test explicitly to make them independent of the running order. |
I've done that to the tests where it's really a problem. I haven't run all of them indivually, but the rest seem stable. Tests done, next up: check all examples that use NMF. |
Indeed, but do we want to keep two different names to refer to the same class? |
Sorry I thought you meant the projection to the positive semidefinite cone (the set of all positive semidefinite matrices), which is obviously not the same as the non-negative orthant. |
Something that keeps bugging me in NMF is regularization/sparsity constraints. IIRC the literature says that while the unregularized formulation is naturally somewhat sparse for nice datasets, it's usually needed to enforce this. As far as I know the multiplicative update method cannot handle this, right? The implementation is PG-NMF is currently very hackish and uses hstacks to adjust the cost function (sorry, it was some of my first numerical code!) and I also think it causes some slowdown and wasted memory. It would be nice to hear some experience from the field with this re: L2 or L1 regularization. |
I think it's always a good idea to use regularization. In addition to making solutions sparser, it also makes the problem better conditioned. The projected gradient update without regularization is Now, if you use So it's easy to see that it is more likely that weights get truncated to zero. I cannot comment for the multiplicative method as I don't know the method well. |
Moved the code by @vene from benchmarks/bench_plot_nmf.py to sklearn.decomposition.nmf.
Ok, I could do with some help here. I can't seem to get the common tests to pass since (1) they pick up I wrote a custom |
@mblondel The L1-regularized version of the multiplicative algorithm is given by Hoyer 2002. |
W *= update | ||
maxW = np.max(update) | ||
|
||
if i % 10 == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the 10
here pretty arbitrary? The max can be slow for large dense matrices, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's arbitrary, and I guess it doesn't skip much since I moved the max
outside the check to reuse the update
variable (there were updateH
and updateW
before). I'd have to do some timing to figure out which one is better.
Unfortunately the current API for NMF doesn't support warm restarts. It would be trivial to add but should we? |
More updates on the text benchmark and faces benchmark: It turns out PG isn't so bad, but it's not fair to set the same For a fair comparison we need to somehow adjust the convergence criteria to make it match. I think the PG-NMF convergence criteria is the messed up one, as you can see that with random init it sometimes "converges" to noise very quickly. |
I forgot to mention: the current snapshot of the benchmarks is ran against a local branch where I made the multiplicative nmf check for convergence at every iteration, and the final number of iterations is saved under |
@vene Do you think I should give up on this PR? |
Definitely not, but I think I found a nasty hidden bug in the projected On Tue, Oct 29, 2013 at 3:47 PM, Lars Buitinck notifications@github.comwrote:
|
Also from the benchmarks it seems that the multiplicative update is better On Tue, Oct 29, 2013 at 3:58 PM, Vlad Niculae vlad@vene.ro wrote:
|
The convergence plot would also be useful for checking correctness of the implementations. How did you generate the convergence plots in your notebooks? |
The "dumb" way: for max_iter in range(...). On the other hand this allows you to use the stopping condition, otherwise On Wed, Oct 30, 2013 at 3:06 AM, Mathieu Blondel
|
I think I had a similar problem in lightning. When you use warm-start, the first iteration is already pretty good. Since the stopping criterion is relatively to the first iteration, the stopping criterion therefore becomes too strict. One solution would be to store the first iteration's projected gradient norm in |
Alright, I guess this is the easiest way to do it for now. We can always add warm-start later (the dumb way doesn't scale to large datasets). |
There is also this paper about a way to speed up convergence of KL divergence NMF by considering diagonalized Newton updates as well, and the author's (synthetic) benchmarks show it to be better than the coordinate descent version. This paper doesn't pass the scikit-learn criteria for inclusion, but I found it interesting nonetheless. |
Close based on #4811? |
Let's close and move the discussion to #4811. |
Based on code by @vene. Needs
tests,docs. Benchmark needs to be updated, must check usage ofNMF
class in the scikit.Question: does it make sense at all to bootstrap this thing using NNDSVD?