-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Fix for ridge regression with sparse matrix input #2552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Coverage remained the same when pulling 8facf356b1d668af863559538bf902ab46d98998 on bdkearns:fix-ridge-with-sparse into 101f0ee on scikit-learn:master. |
Re-enable selecting SVD for sparse input (even though it is converted to dense, one day sparse solving may be supported, and there are current valid use cases where sparse input is typical and doesn't blow up the SVD, but does blow up the eigen solver)
I'd rather have sparse matrix SVD support directly, then, using either |
Ping @agramfort, @mblondel. |
Sure, sparse SVD would be nice to have, but you don't consider it a regression that perfectly working usages under 0.13 don't only raise MemoryError because they get forced into the eigen solver, but are actually forbidden from working even manually specifying the svd solver? And for what gain? Why was this usage suddenly banned in 0.14? IMO the advice for the other person getting a MemoryError with a nearly 100k x 100k square matrix (testing the limits on either solver really, just happened to trigger on SVD) would be to simply specify the solver by hand, not to suddenly force every sparse input into the eigen solver. Seems a bit knee jerk. Now someone with say 100k samples x 100 features (which only needs a 100x100 economy SVD right?) who happens to have a sparse matrix input (say coming from a textual feature problem) gets forced into eigen (when they used to get svd) and gets MemoryError, and if they happen to deduce why and try to change back to svd, find it banned? I guess they can work around by converting the input to dense beforehand, but that seems an odd thing to force users to do. |
@@ -645,8 +645,8 @@ def _values(self, alpha, y, v, Q, QT_y): | |||
return y - (c / G_diag), c | |||
|
|||
def _pre_compute_svd(self, X, y): | |||
if sparse.issparse(X): | |||
raise TypeError("SVD not supported for sparse matrices") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm -1 on this PR because densifying a sparse matrix behind the scenes is really not a good idea. I'd rather improve the error message to tell users to call X = X.toarray()
in their script.
Well, if that's the case, then why only disable sparse inputs for the SVD solver? The eigen solver does safe_sparse_dot(dense_output=True), creating a MxM dense matrix possibly even larger than the sparse input. And you guys changed it so sparse inputs previously fed to the SVD solver (which worked fine when densifying), now without warning feed to eigen, even when n_samples >> n_features, and give MemoryErrors in these cases. Consider text problems -- the inputs lend themselves to sparse representation (users probably won't think to densify first, even if it is possible), n_samples >> n_features, so eigen will explode by default now, even if densify+SVD would have and did work in the past. See issue #2354 for an example that creates a memory error under 0.14, was fine under 0.13. |
Yes but the resulting output has shape [n_samples, n_samples] when dense NumPy arrays are passed too. So densifying the output matrix doesn't change anything here.
Indeed that's bad. Maybe we should disable this behavior then. |
I'm not sure if we can use a truncated SVD here but it's worth a try. |
Yes, but with dense matrices, anything with n_samples > n_features is sent to SVD, not eigen, and produces n_features x n_features array instead. Which silently changed for sparse, while banning them from SVD? One can construct sparse matrices that blow up either solver, as they both convert to dense. IMO there are two paths -- allow sparse inputs, obviously documenting that both solvers use dense computations, or flat out ban them from both. This "oh one user blew up one solver with this shape problem, let's ban sparse from that one and send all problems to the other" is careless. |
Also, one might consider that other parts of sklearn like feature_extraction.text.CountVectorizer produce sparse matrices. And users probably expect to be able to feed these directly into the linear models like these, as in the examples. And these sparse matrices are of the sort typically with large n_samp, small n_feat. Changing the model here to ban sparse or force them into a solver that produces huge arrays (n_samp x n_samp even when n_samp >> n_feat) when it didn't before seems odd. You guys are producing a library here, one would think API consistency would be valued. One might think you guys would evaluate typical usages patterns/examples before making changes like this that break consistency. |
And this breakage was for one user who was getting a MemoryError? And what part of the change actually "fixed" their problem? Hard coding all sparse arrays to eigen. It wasn't even the removal of the densify in SVD. And now all users of the problems described above get MemoryErrors! What a 'fix'. Anyways, I'm surprised anyone would support hard coding all sparse inputs to one solver (or equally be -1 against reverting it). |
What line creates a (n_features, n_features) matrix? |
@@ -701,7 +701,7 @@ def fit(self, X, y, sample_weight=1.0): | |||
with_sw = len(np.shape(sample_weight)) | |||
|
|||
if gcv_mode is None or gcv_mode == 'auto': | |||
if sparse.issparse(X) or n_features > n_samples or with_sw: | |||
if n_features > n_samples or with_sw: | |||
gcv_mode = 'eigen' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, sparse matrices with n_samp >> n_feat (say typical of the type produced by text feature problems) are sent to eigen, where a dense n_samp x n_samp is produced (X dot X.T), when dense matrices of this shape would have been sent to SVD (and in the past sparse were also).
Please don't edit your posts. Otherwise people who read the discussion later can't understand why I asked
|
Simple typo, don't forget you're free to edit as well. |
hum was there any conclusion here? |
Silently densifying a sparse matrix is a bad idea and we've always tried to avoid it. The real fix would be to support sparse matrices directly (so this should be labeled as "new feature" rather than "bug"). We can try to compute a sparse SVD by I suggest we close this PR. I agree with the problem but not with the solution. |
I only skimmed it but it seemed like a regression, but if you say it is not then I'll take your word for it ;) |
The associated issue was resolved in #13350, closing. |
Fix for ridge regression with sparse matrix input
dense, one day sparse solving may be supported, and there are current
valid use cases where sparse input is typical and doesn't blow up the
SVD, but does blow up the eigen solver)
(Fixes #2354)