Skip to content

Draft implementation of MiniBatchNMF #13326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

pcerda
Copy link

@pcerda pcerda commented Feb 28, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Implementation of a mini-batch NMF with multiplicative updates.
#13308

Any other comments?

@amueller
Copy link
Member

In the paper the mini batch algorithms is actually not that much faster surprisingly. That's kind of odd.
Generally adding a new class for something with 70 citation is not usually something we do.

@GaelVaroquaux
Copy link
Member

The paper is the following:

Online algorithms for nonnegative matrix factorization with the Itakura-Saito divergence, A Lefevre, F Bach, C Févotte

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of cosmetic comments.

self.W, self.A, self.B = self._init_W(X)
# self.rho = self.r**(self.batch_size / n_samples)
# else:
# not implemented yet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to implement for dense.

return H


def mini_batch(iterable1, iterable2, n=1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use sklearn.utils.gen_batches instead of this function?

Ht_W = _special_sparse_dot(Ht, W, Vt)
Ht_W_data = Ht_W.data
np.divide(Vt.data, Ht_W_data, out=Ht_W_data, where=(Ht_W_data != 0))
self.rho = self.r ** (1 / (iter + 1))
Copy link
Member

@GaelVaroquaux GaelVaroquaux Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any parameter that is not specified by the init, but rather a consequence of the fit should have a name that ends with _: self.rho -> self.rho_. It should also be the case of many other variables in this object.

# not implemented yet

n_batch = (n_samples - 1) // self.batch_size + 1
self.iter = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable should probably be named "self.n_iter_"

if i == n_batch-1:
W_change = np.linalg.norm(
self.W - W_last) / np.linalg.norm(W_last)
if (W_change < self.tol) and (iter >= self.min_iter - 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"iter" should be renamed to "n_iter" as "iter" is a function in Python.

@GaelVaroquaux
Copy link
Member

Can you also start working on tests.

@GaelVaroquaux
Copy link
Member

@pcerda : can you add your benchmarking code. We'll need it to profile / benchmark.

ngram_range: tuple, default=(2, 4)

init: str, default 'k-means++'
Initialization method of the W matrix.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps add 2-3 words on what the W matrix is in the NMF formulation


Parameters
----------

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't an empty line here against convention ?


def __init__(self, n_components=10, batch_size=512,
r=.001, init='k-means++',
tol=1E-4, min_iter=2, max_iter=5, ngram_range=(2, 4),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tol does not match the default mentioned in docstring

Ht_out = Ht * safe_sparse_dot(Ht_W, W_WT1.T)
squared_norm = np.linalg.norm(
Ht_out - Ht) / (np.linalg.norm(Ht) + 1E-10)
Ht[:] = Ht_out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the [:] to make sure that the same Ht object is used in each iteration ?

self.W_, self.A_, self.B_ = self._m_step(
X[slice], self.W_, self.A_, self.B_, H[slice], self.iter)
self.iter += 1
if i == n_batch-1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 ? My feel is to add spaces around "-"


Parameters
----------
X : array-like (str), shape [n_samples,]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_feature missing in [] bracket ?

break
previous_error = error
if beta_loss < 1:
W[slice][W[slice] < np.finfo(np.float64).eps] = 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider splitting into two lines for ++readability

@@ -1297,6 +1328,37 @@ def fit(self, X, y=None, **params):
self.fit_transform(X, **params)
return self

def partial_fit(self, X, y=None, **params):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add docstring

@amueller
Copy link
Member

amueller commented Aug 6, 2019

What's the relationship between this PR and #13386?

@GaelVaroquaux
Copy link
Member

I think that #13386 is a second design that we thought was cleaner. @pcerda, can you confirm.

We'll need to finish this, by the way :). Maybe @TwsThomas will be able to pitch in.

@amueller amueller added the Superseded PR has been replace by a newer PR label Aug 12, 2019
Base automatically changed from master to main January 22, 2021 10:50
@ogrisel
Copy link
Member

ogrisel commented Jan 5, 2022

Closing in favor of #16948.

@ogrisel ogrisel closed this Jan 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:decomposition Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants