Skip to content

Online implementation of Non-negative Matrix Factorizarion (NMF) #13308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pcerda opened this issue Feb 27, 2019 · 19 comments · Fixed by #16948
Closed

Online implementation of Non-negative Matrix Factorizarion (NMF) #13308

pcerda opened this issue Feb 27, 2019 · 19 comments · Fixed by #16948

Comments

@pcerda
Copy link

pcerda commented Feb 27, 2019

Description

@GaelVaroquaux Maybe an online version of the NMF could be useful in big data settings.

Steps/Code to Reproduce

Expected Results

Actual Results

Here I show the results for an sparse matrix as input of size 1M x 2.8k
The factorization has n_components = 10 and each point is an entire pass on the data.
The current online implementation only supports the kullback-leibler divergence, but in practice
can be generalized to any beta divergence.

batch_vs_online_nmf

Versions

@GaelVaroquaux
Copy link
Member

Nice figure! Let me see with the others if this should go in.

@GaelVaroquaux
Copy link
Member

Could you run on a wikipedia text example?

@pcerda
Copy link
Author

pcerda commented Feb 27, 2019

Here is a benchmark on word counts for the first paragraph of 500k wikipedia articles. To speed up calculations, I used a HashingVectorizer with 4096 features. The KL divergence is calculated on 50k test samples.

online_nmf_benchmark_wiki

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 27, 2019

Can you do a draft PR just adding the code to scikit-learn, so that people can have a quick look at how complicated the code would be.

Don't worry about test, documentation, or full API compatibility for now. Just add the object (which I think should be called MiniBatchNMF).

@amueller
Copy link
Member

would be nice to see more full benchmarks but I think in general this would be useful.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 28, 2019 via email

@amueller
Copy link
Member

vary n_features and n_samples?

@cmarmo
Copy link
Contributor

cmarmo commented Apr 7, 2020

@GaelVaroquaux , @amueller I have synchronized with master the code proposed by @pcerda in #13386. I was unable to access the data set used in the previous comment, I used the "Blog Authorship Corpus" then.
The attached plot benchmarks n_features, n_samples and n_components with the scikit-learn implemented version of Non-Negative Matrix Factorization (NMF) and the online one proposed by @pcerda.

bench_topics

The code used to produce the plot is a variation on the topic extraction example. It is available here.
As a reminder, the paper to be referred to is Online algorithms for nonnegative matrix factorization with the Itakura-Saito divergence, A Lefevre, F Bach, C Févotte, 2011.

Comments?
Thanks!

@GaelVaroquaux
Copy link
Member

The speed up is impressive. Thank you for highlighting it.

@GaelVaroquaux
Copy link
Member

@cmarmo : can you push a PR with the new code. So that we can review and iterate on this.

Thanks!!

@cmarmo
Copy link
Contributor

cmarmo commented Apr 17, 2020

@cmarmo : can you push a PR with the new code. So that we can review and iterate on this.

I'm on it: just trying to get rid of the NMFOriginal.py file. Patricio kept it in order to guarantee the comparison with the original implementation, I'm trying to reproduce the original behavior in the same code. I can open a WIP PR if you think it's worth it, but I would rather if you wait before reviewing the code. Thanks!

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 17, 2020 via email

@jeremiedbb
Copy link
Member

@cmarmo in the benchmarks you made above, you display the convergence time. What is the convergence criterion and do both versions converge to a solution of similar quality ?

@cmarmo
Copy link
Contributor

cmarmo commented Apr 24, 2020

@jeremiedbb I'm working to add this information to the plot. Thanks for your comment.

@cmarmo
Copy link
Contributor

cmarmo commented Apr 30, 2020

Hi @jeremiedbb and thanks for your patience, my plot behaved fuzzy probably because someone else was heavyly computing on the machine at the same time... waiting for the new version let me just say that, if I understand correctly, the convergence is computed here

if (previous_error - error) / error_at_init < tol:
using the kullback-leibler divergence as beta_divergence to define the error between the original matrix and its decomposition. My impression is that the minibatch implementation has a lower divergence at the convergence time, but I need to build reliable plots to confirm.

@cmarmo
Copy link
Contributor

cmarmo commented May 3, 2020

@jeremiedbb this new plot and my previous comment, are they answering your question?
bench_topics

@jeremiedbb
Copy link
Member

So in all your benchmarks, online NMF converges faster to a better solution. Pretty convincing !

@cmarmo
Copy link
Contributor

cmarmo commented May 7, 2020

As I'm starting to play with the source, for the sake of reproducibility this plot and this one have been produced with this version of the nmf code.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 24, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants