Skip to content

[MRG] Online implementation of non-negative matrix factorization #16948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 349 commits into from
Apr 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
349 commits
Select commit Hold shift + click to select a range
3694255
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Aug 3, 2020
97082c7
Sum batch iterations to iterations.
cmarmo Aug 3, 2020
5cc9949
Debugging.
cmarmo Aug 11, 2020
e852ad2
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Aug 13, 2020
7d75d30
Debug
cmarmo Aug 13, 2020
753e6f6
Some improvements.
cmarmo Aug 13, 2020
cd28014
Add hardcoded forgetting factor.
cmarmo Aug 18, 2020
88fc02c
Sync with upstream.
cmarmo Aug 24, 2020
d5ad09a
Fix index.
cmarmo Aug 24, 2020
6b8969f
Various testing.
cmarmo Aug 24, 2020
2a7d316
Same results for NMF and onlineNMF for batch_size=n_samples.
cmarmo Aug 28, 2020
09e50e3
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Aug 28, 2020
172d097
Linting.
cmarmo Aug 28, 2020
921bd33
Linting in benchmarks.
cmarmo Aug 28, 2020
03867c2
Fix number of iterations.
cmarmo Aug 28, 2020
f58900c
Clean parameters.
cmarmo Aug 28, 2020
e2be821
Remove transform and inverse_transform function.
cmarmo Aug 29, 2020
0020eb6
Fix references.
cmarmo Aug 29, 2020
05d6010
Add tests.
cmarmo Aug 29, 2020
8c7a3fb
Fix lint errors in tests.
cmarmo Aug 29, 2020
e4c1e23
Add one more test.
cmarmo Aug 30, 2020
6b930d9
Fix import.
cmarmo Aug 30, 2020
4c8bf0a
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Aug 31, 2020
8f54700
Remove duplicated code.
cmarmo Aug 31, 2020
6b99b95
Lint.
cmarmo Aug 31, 2020
34778ab
Fix indentation.
cmarmo Aug 31, 2020
df4ec4e
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Aug 31, 2020
7679e3d
Fix indentation.
cmarmo Aug 31, 2020
44fa3bf
Fix docstring example.
cmarmo Aug 31, 2020
fcde447
Add forget_factor as parameter.
cmarmo Aug 31, 2020
749d59a
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Sep 2, 2020
51f763c
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Sep 3, 2020
bebde14
Fix partial_fit function (hopefully). Adapt benchmarks.
cmarmo Sep 4, 2020
e1794a8
Linting.
cmarmo Sep 4, 2020
3104d5e
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Sep 4, 2020
00574c7
Bench with n_traing greater than n_test.
cmarmo Sep 7, 2020
898b590
Try to avoid SyntaxError in import.
cmarmo Sep 7, 2020
8b4de0d
Try to avoid SyntaxError in import (again).
cmarmo Sep 7, 2020
8379b53
Try to avoid SyntaxError in import (last one?).
cmarmo Sep 7, 2020
e7b5ec7
Try to avoid SyntaxError in import?
cmarmo Sep 7, 2020
9909247
Try to avoid SyntaxError in import?
cmarmo Sep 7, 2020
0e0bf23
Add sample variation.
cmarmo Sep 7, 2020
e140b4d
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Sep 7, 2020
e243df9
Linting.
cmarmo Sep 7, 2020
c42c499
Set forget_factor default to 0.7. Add some doc. Add MiniBatchNMF to A…
cmarmo Sep 8, 2020
f2017f5
Test.
cmarmo Sep 8, 2020
b7a4555
Test.
cmarmo Sep 8, 2020
164183f
Remove failing file for now.
cmarmo Sep 8, 2020
21b6413
Fix sphinx warning.
cmarmo Sep 8, 2020
5053538
Add test for partial_fit. Fix output number of iterations.
cmarmo Sep 8, 2020
0cbeb10
Lintgit push origin modified_nmf_for_minibatch !
cmarmo Sep 8, 2020
5882a19
Lint and refactor.
cmarmo Sep 8, 2020
7b959d4
Lint.
cmarmo Sep 8, 2020
f103131
Tentative test for auxiliary matrices.
cmarmo Sep 8, 2020
8e06593
Lint.
cmarmo Sep 8, 2020
a09aee3
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Sep 9, 2020
60d058f
Better test for auxiliary matrices.
cmarmo Sep 9, 2020
d277dcd
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Sep 10, 2020
39357b0
Address comments.
cmarmo Sep 10, 2020
ec687c6
Add docstring for _multiplicative_update_h.
cmarmo Sep 10, 2020
e0c25e2
Remove shuffle in MiniBatchNMF partial_fit.
cmarmo Sep 10, 2020
4f23406
Tentatively reverting benchmarks.
cmarmo Sep 10, 2020
825d6dd
Address some of the comments.
cmarmo Sep 10, 2020
936cdcc
Address some of the comments.
cmarmo Sep 10, 2020
7c13c85
Inherit MiniBatch NMF from NMF.
cmarmo Sep 14, 2020
66ae8c0
Lint.
cmarmo Sep 14, 2020
4b03d60
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Sep 17, 2020
70ca165
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Sep 21, 2020
0a9b7a1
Documentation.
cmarmo Sep 21, 2020
384c4c2
Increase iterations for MiniBatchNMF common tests.
cmarmo Sep 21, 2020
40a638d
Remove unexplained failing file to allow documentation build.
cmarmo Sep 21, 2020
1f4966f
Add validation for batch_size.
cmarmo Sep 26, 2020
77b7e5f
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Sep 26, 2020
4d75a3e
Remove f-string for python 3.6 compatibility.
cmarmo Sep 26, 2020
a1be937
Fix conflicts.
cmarmo Oct 16, 2020
0268bb8
Fix some more conflicts.
cmarmo Oct 16, 2020
114d55f
Generalize test to minibatchnmf.
cmarmo Oct 16, 2020
12c33d1
Lint and forgotten tests.
cmarmo Oct 16, 2020
a8f660e
Fix call.
cmarmo Oct 16, 2020
ab73275
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Oct 20, 2020
4d50010
Make all tests pass (thanks Jeremie).
cmarmo Oct 21, 2020
dc2af80
Fix messages and FutureWarning (again).
cmarmo Oct 21, 2020
7994110
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Oct 22, 2020
3eaf438
Add iter_offset_ .
cmarmo Oct 22, 2020
b59c32a
Lint.
cmarmo Oct 22, 2020
896efa5
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Nov 12, 2020
b9ca59b
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Dec 7, 2020
602ce53
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Dec 17, 2020
3fdcec0
Apply suggestions from code review
cmarmo Dec 18, 2020
964dad8
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Dec 18, 2020
af95de9
Address comments.
cmarmo Dec 18, 2020
bded3d4
Update tests.
cmarmo Dec 18, 2020
386c8bc
Merge with master.
cmarmo Dec 18, 2020
3f41280
Address some comments.
cmarmo Dec 18, 2020
f215c33
Apply suggestions from code review
cmarmo Dec 18, 2020
7b91764
Lint.
cmarmo Dec 18, 2020
34ada71
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Dec 23, 2020
a234186
Address more comments.
cmarmo Dec 23, 2020
dae9012
Test batch_size lt n_samples. Fix lint.
cmarmo Dec 23, 2020
d02399a
Parametrize the nmf close to MBnmf test.
cmarmo Dec 29, 2020
98c569b
Sets assume_finite in MiniBatchNMF (see discussions in #18581).
cmarmo Dec 29, 2020
2256926
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Jan 5, 2021
dd66685
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Jan 5, 2021
96545a6
Add back benchmark script.
cmarmo Jan 7, 2021
e33e166
Add new test on test sample.
cmarmo Jan 7, 2021
53c1398
Optimize transform parameters in partial_fit.
cmarmo Jan 7, 2021
1726b00
Fix indentation of iter_offset. Check convergence every iteration.
cmarmo Jan 8, 2021
4749f7f
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Jan 13, 2021
27f5640
Set max_iter to self.max_iter in partial_fit.
cmarmo Jan 13, 2021
a6adcaa
Remove debug relics. Add comment on batch_size.
cmarmo Jan 13, 2021
c253588
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Jan 19, 2021
89fcf7c
Merge branch 'master' into modified_nmf_for_minibatch
cmarmo Jan 19, 2021
8d1bdf9
Generalise norm notation in docstring.
cmarmo Jan 19, 2021
6da0cd2
Throw an error when batch_size is not None and loss=frobenius. Reorga…
cmarmo Jan 19, 2021
7ba62fe
Fix tests (the fixable one).
cmarmo Jan 19, 2021
bfc07f1
Add batch size in mbnmf transform function.
cmarmo Jan 20, 2021
c632d81
Experimenting with iterations.
cmarmo Jan 20, 2021
0e3e23c
Updating bench scripts.
cmarmo Jan 20, 2021
0a203d0
Updating bench scripts.
cmarmo Jan 20, 2021
2befe59
Merge branch 'main' into modified_nmf_for_minibatch
cmarmo Jan 25, 2021
378fbe0
Revert n_iter.
cmarmo Jan 25, 2021
02ea2ff
Add a loop for W (tentative).
cmarmo Jan 25, 2021
d6784db
Fix lint.
cmarmo Jan 25, 2021
144ce91
Fix one test.
cmarmo Jan 25, 2021
c5f6b6b
Merge branch 'main' into modified_nmf_for_minibatch
cmarmo Jan 28, 2021
c629e83
Revert unuseful iterations on W.
cmarmo Jan 28, 2021
1df2415
Remove condition on batch_size gt n_samples.
cmarmo Jan 28, 2021
9ddeeef
Return H from multiplicative_update_H.
cmarmo Jan 28, 2021
6dad778
Some adjustements in tests.
cmarmo Jan 28, 2021
2e8b7d4
Merge branch 'main' into modified_nmf_for_minibatch
cmarmo Feb 1, 2021
673052a
Fix auxiliary functions manipulations.
cmarmo Feb 1, 2021
c59e325
Remove explicit calls to auxiliary matrices. Initialize them at each …
cmarmo Feb 1, 2021
885b8dd
Fix lint errors.
cmarmo Feb 1, 2021
616d01a
Start reformatting the iteration loop.
cmarmo Feb 1, 2021
97d5dda
Merge branch 'main' into modified_nmf_for_minibatch
cmarmo Feb 2, 2021
4782e63
Return iter_offset.
cmarmo Feb 2, 2021
86a5c22
Merge branch 'main' into modified_nmf_for_minibatch
cmarmo Feb 10, 2021
73c50a8
Working on tests and iterations.
cmarmo Feb 10, 2021
a83cbd5
Fix lint error.
cmarmo Feb 10, 2021
4107137
Fix common tests.
cmarmo Feb 10, 2021
c2b6919
Fix linting error.
cmarmo Feb 10, 2021
47dd275
Merge branch 'main' into modified_nmf_for_minibatch
cmarmo Feb 25, 2021
1de5e35
Merge branch 'main' into modified_nmf_for_minibatch
cmarmo Mar 1, 2021
dc70492
Allow all losses in MiniBatchNMF.
cmarmo Mar 1, 2021
db2b7ad
Allow batch_size= n_samples in mbNMF.
cmarmo Mar 1, 2021
d5172fc
reformat number of iterations.
cmarmo Mar 1, 2021
0649073
Fix lint.
cmarmo Mar 1, 2021
c83ba3f
Merge with upstream and start refactoring.
cmarmo Mar 15, 2021
784cf5f
Fix lint errors.
cmarmo Mar 15, 2021
68ede97
Apply reviewer comments.
cmarmo Mar 15, 2021
8611f09
Address some comment. Fix bad dtype in MiniBatchNMF.
cmarmo Mar 18, 2021
0e00c2a
Fix lint.
cmarmo Mar 18, 2021
ade8273
Merge branch 'main' into modified_nmf_for_minibatch
cmarmo Mar 22, 2021
1df45b4
generalize function parameters in test.
cmarmo Mar 22, 2021
961c2cb
Improve test on partial_fit, fix iteration number.
cmarmo Mar 22, 2021
3b2b442
Compute iter_offset.
cmarmo Mar 22, 2021
b48d1dc
Fix iteration number and intitialization in tests.
cmarmo Mar 25, 2021
8af4d32
Merge branch 'main' into modified_nmf_for_minibatch
cmarmo Mar 25, 2021
cce2e7e
Reworking iterations, fix some tests.
cmarmo Mar 29, 2021
d55dc99
Minor adjustments.
cmarmo Mar 31, 2021
52f41fa
Refactor tests.
cmarmo Apr 2, 2021
805f21c
Add a test for reconstruction.
cmarmo Apr 6, 2021
93782e4
Merge branch 'main' into modified_nmf_for_minibatch
cmarmo Apr 12, 2021
da88b2f
Address some sommeents.
cmarmo Apr 12, 2021
5b9531b
Merge with main.
cmarmo Apr 27, 2021
049368a
Simplify tests.
cmarmo Apr 27, 2021
7914e9d
Fix lint errors.
cmarmo Apr 27, 2021
603ce83
Add MiniBatchNMF to the example about topics extraction.
cmarmo Apr 27, 2021
9abe904
Merge branch 'main' into modified_nmf_for_minibatch
cmarmo Apr 28, 2021
3c50aff
Remove obsolete benchmark script.
cmarmo Apr 28, 2021
7085842
Fix sphinx warning.
cmarmo Apr 28, 2021
ed3e13a
True fix sphinx warning.
cmarmo Apr 28, 2021
fc2456b
Use _fit_transform instead of transform in partial_fit.
cmarmo Apr 30, 2021
f1d1e75
Fix partial_fit.
cmarmo May 4, 2021
cf50558
Increase iteration number in common tests.
cmarmo May 6, 2021
66a4c70
Merge branch 'master' into modified_nmf_for_minibatch
jeremiedbb May 12, 2021
762c27f
Sync with main and update test on error messages.
cmarmo May 27, 2021
e7b727a
Address some comments.
cmarmo May 27, 2021
decbca8
Cast ceil output.
cmarmo May 27, 2021
d8048f7
Fix lint error.
cmarmo May 31, 2021
8941e6c
Address comment.
cmarmo May 31, 2021
1185388
Merge branch 'master' into modified_nmf_for_minibatch
jeremiedbb Jun 16, 2021
e0a5ff5
Merge branch 'modified_nmf_for_minibatch' of https://github.com/cmarm…
jeremiedbb Jun 16, 2021
5668353
Import the black config and deal with existing merge conflicts before…
cmarmo Jun 19, 2021
c2c13a0
MAINT Adds target_version to black config (#20293)
thomasjpfan Jun 17, 2021
492efd9
Format code with black.
cmarmo Jun 19, 2021
48c2213
Merge with upstream.
cmarmo Jun 19, 2021
51274d0
Fix forgotten conflict.
cmarmo Jun 19, 2021
40d0b36
Fix more forgotten conflicts.
cmarmo Jun 19, 2021
500e526
wip
jeremiedbb Jun 22, 2021
1dbcc52
Merge branch 'modified_nmf_for_minibatch' of https://github.com/cmarm…
jeremiedbb Jun 22, 2021
ad39359
black
jeremiedbb Jun 22, 2021
47b5f88
wip
jeremiedbb Jul 5, 2021
68f0e48
black
jeremiedbb Jul 5, 2021
352fd19
Merge branch 'master' into modified_nmf_for_minibatch
jeremiedbb Jul 6, 2021
52863f7
black
jeremiedbb Jul 6, 2021
547ce68
cln
jeremiedbb Jul 6, 2021
b0471ad
cln
jeremiedbb Jul 6, 2021
e83b97d
Merge branch 'master' into modified_nmf_for_minibatch
jeremiedbb Jul 20, 2021
2ba0e96
cln + regularization
jeremiedbb Jul 23, 2021
61f3c93
Merge branch 'master' into modified_nmf_for_minibatch
jeremiedbb Jul 23, 2021
25be104
pass numpydoc val
jeremiedbb Jul 23, 2021
4561e9f
wip
jeremiedbb Sep 1, 2021
fb98339
Merge branch 'master' into modified_nmf_for_minibatch
jeremiedbb Oct 27, 2021
446ce3c
iter
jeremiedbb Oct 27, 2021
620a065
whats new
jeremiedbb Oct 27, 2021
8f16bbe
black
jeremiedbb Oct 27, 2021
ec31b65
black
jeremiedbb Oct 27, 2021
8194068
cln
jeremiedbb Oct 27, 2021
7b721c3
cln
jeremiedbb Oct 27, 2021
198afe2
cln
jeremiedbb Oct 27, 2021
b30e3b7
cln
jeremiedbb Oct 28, 2021
06a3342
iter
jeremiedbb Oct 28, 2021
a6ff0e9
cln doc
jeremiedbb Oct 29, 2021
7e33d60
improve coverage
jeremiedbb Oct 29, 2021
54e1ad7
black
jeremiedbb Oct 29, 2021
d22da5f
Merge branch 'master' into modified_nmf_for_minibatch
jeremiedbb Nov 2, 2021
bd71e13
cln
jeremiedbb Nov 2, 2021
f7c6bbf
cln doc
jeremiedbb Nov 2, 2021
b5b23b4
Merge remote-tracking branch 'upstream/main' into pr/cmarmo/16948
jeremiedbb Dec 6, 2021
4d20ad4
adress comments
jeremiedbb Dec 6, 2021
584744a
black
jeremiedbb Dec 6, 2021
607e7db
cln
jeremiedbb Dec 6, 2021
6c4382b
remove solver param
jeremiedbb Dec 6, 2021
a029c25
lint
jeremiedbb Dec 6, 2021
8b37550
Merge branch 'master' into modified_nmf_for_minibatch
jeremiedbb Feb 8, 2022
54f17ed
apply suggestions
jeremiedbb Feb 8, 2022
34ba813
lint
jeremiedbb Feb 8, 2022
8d54ef7
improve obj function readability
jeremiedbb Feb 9, 2022
8e18e0b
non-negative
jeremiedbb Feb 9, 2022
e6d7fa6
Merge branch 'master' into modified_nmf_for_minibatch
jeremiedbb Mar 3, 2022
e52cbd2
address comments
jeremiedbb Mar 3, 2022
eb06c60
lint
jeremiedbb Mar 3, 2022
dad2eb2
address comments
jeremiedbb Mar 25, 2022
a027686
credit pcerda
jeremiedbb Mar 25, 2022
ce646d7
update what's new entry
jeremiedbb Apr 8, 2022
616f9ba
test beta_loss > 2
jeremiedbb Apr 8, 2022
b6681f8
improve solve_W docstring
jeremiedbb Apr 8, 2022
0922eb3
improve partial_fit docstring
jeremiedbb Apr 8, 2022
051fa8e
don't introduce new warnings in tests
jeremiedbb Apr 8, 2022
0094d4f
lint
jeremiedbb Apr 8, 2022
7978db1
Merge remote-tracking branch 'upstream/main' into pr/cmarmo/16948
jeremiedbb Apr 21, 2022
a7ef482
address review comments
jeremiedbb Apr 21, 2022
c77de85
lint
jeremiedbb Apr 21, 2022
e2510ec
fix position in what's new
jeremiedbb Apr 21, 2022
3ecf370
better format obj function in docstring
jeremiedbb Apr 21, 2022
15ead2e
Merge branch 'main' into modified_nmf_for_minibatch
jeremiedbb Apr 21, 2022
9abf0c9
Merge remote-tracking branch 'upstream/main' into pr/cmarmo/16948
jeremiedbb Apr 22, 2022
5790e5f
avoid convergence warning in example
jeremiedbb Apr 22, 2022
99ec76c
Merge branch 'modified_nmf_for_minibatch' of https://github.com/cmarm…
jeremiedbb Apr 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/computing/scaling_strategies.rst
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ Here is a list of incremental estimators for different tasks:
+ :class:`sklearn.decomposition.MiniBatchDictionaryLearning`
+ :class:`sklearn.decomposition.IncrementalPCA`
+ :class:`sklearn.decomposition.LatentDirichletAllocation`
+ :class:`sklearn.decomposition.MiniBatchNMF`
- Preprocessing
+ :class:`sklearn.preprocessing.StandardScaler`
+ :class:`sklearn.preprocessing.MinMaxScaler`
Expand Down
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -319,6 +319,7 @@ Samples generator
decomposition.MiniBatchDictionaryLearning
decomposition.MiniBatchSparsePCA
decomposition.NMF
decomposition.MiniBatchNMF
decomposition.PCA
decomposition.SparsePCA
decomposition.SparseCoder
Expand Down
26 changes: 26 additions & 0 deletions doc/modules/decomposition.rst
Original file line number Diff line number Diff line change
Expand Up @@ -921,6 +921,29 @@ stored components::
* :ref:`sphx_glr_auto_examples_applications_plot_topics_extraction_with_nmf_lda.py`
* :ref:`sphx_glr_auto_examples_decomposition_plot_beta_divergence.py`

.. _MiniBatchNMF:

Mini-batch Non Negative Matrix Factorization
--------------------------------------------

:class:`MiniBatchNMF` [7]_ implements a faster, but less accurate version of the
non negative matrix factorization (i.e. :class:`~sklearn.decomposition.NMF`),
better suited for large datasets.

By default, :class:`MiniBatchNMF` divides the data into mini-batches and
optimizes the NMF model in an online manner by cycling over the mini-batches
for the specified number of iterations. The ``batch_size`` parameter controls
the size of the batches.

In order to speed up the mini-batch algorithm it is also possible to scale
past batches, giving them less importance than newer batches. This is done
introducing a so-called forgetting factor controlled by the ``forget_factor``
parameter.

The estimator also implements ``partial_fit``, which updates ``H`` by iterating
only once over a mini-batch. This can be used for online learning when the data
is not readily available from the start, or when the data does not fit into memory.

.. topic:: References:

.. [1] `"Learning the parts of objects by non-negative matrix factorization"
Expand All @@ -945,6 +968,9 @@ stored components::
the beta-divergence" <1010.1763>`
C. Fevotte, J. Idier, 2011

.. [7] :arxiv:`"Online algorithms for nonnegative matrix factorization with the
Itakura-Saito divergence" <1106.4198>`
A. Lefevre, F. Bach, C. Fevotte, 2011

.. _LatentDirichletAllocation:

Expand Down
5 changes: 5 additions & 0 deletions doc/whats_new/v1.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,11 @@ Changelog
:mod:`sklearn.decomposition`
............................

- |MajorFeature| Added a new estimator :class:`decomposition.MiniBatchNMF`. It is a
faster but less accurate version of non-negative matrix factorization, better suited
for large datasets. :pr:`16948` by :user:`Chiara Marmo <cmarmo>`,
:user:`Patricio Cerda <pcerda>` and :user:`Jérémie du Boisberranger <jeremiedbb>`.

- |Enhancement| :class:`decomposition.PCA` exposes a parameter `n_oversamples` to tune
:func:`sklearn.decomposition.randomized_svd` and
get accurate results when the number of features is large.
Expand Down
75 changes: 72 additions & 3 deletions examples/applications/plot_topics_extraction_with_nmf_lda.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,15 @@
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.decomposition import NMF, MiniBatchNMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20
batch_size = 128
init = "nndsvda"


def plot_top_words(model, feature_names, n_top_words, title):
Expand Down Expand Up @@ -101,7 +103,15 @@ def plot_top_words(model, feature_names, n_top_words, title):
"n_samples=%d and n_features=%d..." % (n_samples, n_features)
)
t0 = time()
nmf = NMF(n_components=n_components, random_state=1, alpha=0.1, l1_ratio=0.5).fit(tfidf)
nmf = NMF(
n_components=n_components,
random_state=1,
init=init,
beta_loss="frobenius",
alpha_W=0.00005,
alpha_H=0.00005,
l1_ratio=1,
).fit(tfidf)
print("done in %0.3fs." % (time() - t0))


Expand All @@ -121,10 +131,12 @@ def plot_top_words(model, feature_names, n_top_words, title):
nmf = NMF(
n_components=n_components,
random_state=1,
init=init,
beta_loss="kullback-leibler",
solver="mu",
max_iter=1000,
alpha=0.1,
alpha_W=0.00005,
alpha_H=0.00005,
l1_ratio=0.5,
).fit(tfidf)
print("done in %0.3fs." % (time() - t0))
Expand All @@ -137,6 +149,63 @@ def plot_top_words(model, feature_names, n_top_words, title):
"Topics in NMF model (generalized Kullback-Leibler divergence)",
)

# Fit the MiniBatchNMF model
print(
"\n" * 2,
"Fitting the MiniBatchNMF model (Frobenius norm) with tf-idf "
"features, n_samples=%d and n_features=%d, batch_size=%d..."
% (n_samples, n_features, batch_size),
)
t0 = time()
mbnmf = MiniBatchNMF(
n_components=n_components,
random_state=1,
batch_size=batch_size,
init=init,
beta_loss="frobenius",
alpha_W=0.00005,
alpha_H=0.00005,
l1_ratio=0.5,
).fit(tfidf)
print("done in %0.3fs." % (time() - t0))


tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
plot_top_words(
mbnmf,
tfidf_feature_names,
n_top_words,
"Topics in MiniBatchNMF model (Frobenius norm)",
)

# Fit the MiniBatchNMF model
print(
"\n" * 2,
"Fitting the MiniBatchNMF model (generalized Kullback-Leibler "
"divergence) with tf-idf features, n_samples=%d and n_features=%d, "
"batch_size=%d..." % (n_samples, n_features, batch_size),
)
t0 = time()
mbnmf = MiniBatchNMF(
n_components=n_components,
random_state=1,
batch_size=batch_size,
init=init,
beta_loss="kullback-leibler",
alpha_W=0.00005,
alpha_H=0.00005,
l1_ratio=0.5,
).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
plot_top_words(
mbnmf,
tfidf_feature_names,
n_top_words,
"Topics in MiniBatchNMF model (generalized Kullback-Leibler divergence)",
)

print(
"\n" * 2,
"Fitting LDA models with tf features, n_samples=%d and n_features=%d..."
Expand Down
7 changes: 6 additions & 1 deletion sklearn/decomposition/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,11 @@
"""


from ._nmf import NMF, non_negative_factorization
from ._nmf import (
NMF,
MiniBatchNMF,
non_negative_factorization,
)
from ._pca import PCA
from ._incremental_pca import IncrementalPCA
from ._kernel_pca import KernelPCA
Expand All @@ -31,6 +35,7 @@
"IncrementalPCA",
"KernelPCA",
"MiniBatchDictionaryLearning",
"MiniBatchNMF",
"MiniBatchSparsePCA",
"NMF",
"PCA",
Expand Down
Loading