-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] refactor NMF and add CD solver #4852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think it's awesome! And travis is failing ;) |
Thanks ! Only 2/3 are failing :p |
nice. I don't have time to review though. |
return _update_cdnmf_fast(Ht, WtW, WtX, alpha, l1_ratio, False) | ||
|
||
|
||
def fit_coordinate_descent(X, W, H, tol=1e-4, max_iter=200, alpha=0.001, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a private function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the discussion on NMF, you said:
Maybe we can keep the class API simple and expose a function with more options
This function add the choice to regularize W or H or both, whereas the class does not have this option, and regularizes both. Is it what you meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to do like in ridge.py with ridge_regression
and Ridge
(i.e. the public function is common for all solvers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i.e. the public function is common for all solvers).
Ok I'll do that
I think we should remove projected gradient by v0.19. Otherwise, with your plan to add more loss functions, this will be way too much code to maintain. @vene, is it fine with you? |
I think that was the plan, I will add deprecation. |
Based on the plots above, that aren't surprising, yes, I fully agree that if we merge CD we can deprecate PG. |
for when sparsity is not desired) | ||
'random': non-negative random matrices, scale with: | ||
sqrt(X.mean() / n_components) | ||
'uniform': matrices filled with the same value: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where the code path for 'uniform' currently is. Is 'uniform' supposed to be different than 'random'?
Also, @mblondel's gist used as initialization a random subset of the rows/cols of the input matrix, would this make sense? We could run some benchmarks for this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'uniform' is initialization by the same value everywhere, but I removed it in this PR.
I will do a specific PR to address the initialization schemes and run some benchmarks.
And I will also look at the random subset of rows/cols.
raise ValueError('Array with wrong shape passed to %s. Expected %s, ' | ||
'but got %s ' % (whom, shape, np.shape(A))) | ||
check_non_negative(A, whom) | ||
if np.max(A) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember in one of the papers they say that full rows/cols of zeros can cause problems, is it the case in this implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a full col/row of zeros in W or H, WtW or HHt will have a zero in its diagonal, which is a problem since we will divide by zero.
To avoid that, I added a small L2 regularization (1e-15), which avoid this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not the right way to do it. In this case, the optimal solution is probably to not update the coefficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e., skip the current iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I changed it
int max_inner, double tol): | ||
cdef double violation = 0. | ||
cdef double pg | ||
cdef int n_length = W.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W is (n_samples, n_components)
H.T is (n_features, n_components)
I changed the name to be more clear:
cdef int n_samples = W.shape[0] # n_features for H update
f7041cc
to
ecd0415
Compare
really nice
just one thing, can you summarize your conclusions when you post figures?
thanks
|
I notice that the greedy curves are sometimes longer on the X axis, even if they reach the same loss. Is this because the convergence criterion fires later, or because iterations take longer? The paper seemed to advertise the greedy selection as being very cheap computationally. |
Iterations take longer. I used the same number of iteration for CD and GCD, with a low tolerance so as to stop only when maximum number of iterations is reached. |
The improvements are not so impressive overall (~1s gain). The author released the source code in MATLAB (computationally expensive parts written in C). You could have a look to check potential tricks. In the lower-right plot in the n_components=10 case (news20), the loss difference looks very bad. I wonder if it's not just a scale problem. |
l2_reg = (1 - l1_ratio) * alpha | ||
|
||
# L2 regularization corresponds to increase the diagonal of HHt | ||
HHt = np.dot(Ht.T, Ht) + np.eye(n_components) * l2_reg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that you used fast_dot in some places. You could potentially use it here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was to be sure to have C-contiguous array, but as I removed the check in Cython, I will change it.
:class:`NMF` can also be initialized with random non-negative matrices, by | ||
passing an integer seed or a ``RandomState`` to :attr:`init`. | ||
:class:`NMF` can also be initialized with correctly scaled random non-negative | ||
matrices by setting :attr:`init="random"`. An integer seed or a ``RandomState`` can also be passed to :attr:`random_state` to control reproducibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line too long
that's it for me don't forget what's new page nice job @TomDLT ! |
Sorry to be late on this but could we address the scaling of alpha in another PR? I agree that scaling of alpha could be useful but I think it should probably be an option (default: False) for backward compatibility and consistency with the rest of the scikit. Also, the motivation for the way it is currently scaled is not completely obvious to me. My first intuition would have been to divide the loss term by n_samples x n_features. Although @TomDLT has some early experiments, it would be nice to take our time to compare different scalings. |
Comments addressed, thanks. I also removed the alpha scaling from this PR, as suggested by @mblondel, and I will open a new PR for further discussion about it. |
+1 for merge on my side. |
+1 too |
[MRG+1] refactor NMF and add CD solver
🍻
|
Cool 🍻 ! |
sorry if I overlooked it but did you commit / post your benchmark script using the "real" datasets? the current one only uses synthetic data, right? |
I benchmarked it with "real" datasets I think: 20 newsgroups, Olivetti Faces, RCV1 and MNIST I don't find |
I think that would be great. Would you be interested to do that? Otherwise maybe open an issue? |
I can do that |
see #5779 |
This PR is a first part of what is discussed in #4811.
It includes:
_initialize_nmf()
alpha
andl1_ratio
)In some future PR, I will:
Please tell me what you think :)