Skip to content

Fix for ridge regression with sparse matrix input #2552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Fix for ridge regression with sparse matrix input #2552

wants to merge 1 commit into from

Conversation

bdkearns
Copy link
Contributor

Fix for ridge regression with sparse matrix input

  • Re-enable selecting SVD for sparse input (even though it is converted to
    dense, one day sparse solving may be supported, and there are current
    valid use cases where sparse input is typical and doesn't blow up the
    SVD, but does blow up the eigen solver)

(Fixes #2354)

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 8facf356b1d668af863559538bf902ab46d98998 on bdkearns:fix-ridge-with-sparse into 101f0ee on scikit-learn:master.

Re-enable selecting SVD for sparse input (even though it is converted to
dense, one day sparse solving may be supported, and there are current
valid use cases where sparse input is typical and doesn't blow up the
SVD, but does blow up the eigen solver)
@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling c7529c7 on bdkearns:fix-ridge-with-sparse into 101f0ee on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling c7529c7 on bdkearns:fix-ridge-with-sparse into 101f0ee on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling c7529c7 on bdkearns:fix-ridge-with-sparse into 101f0ee on scikit-learn:master.

@larsmans
Copy link
Member

I'd rather have sparse matrix SVD support directly, then, using either scipy.sparse.linalg.svds or sklearn.utils.extmath.randomized_svd. (The latter is very fast and stable.)

@larsmans
Copy link
Member

Ping @agramfort, @mblondel.

@bdkearns
Copy link
Contributor Author

Sure, sparse SVD would be nice to have, but you don't consider it a regression that perfectly working usages under 0.13 don't only raise MemoryError because they get forced into the eigen solver, but are actually forbidden from working even manually specifying the svd solver? And for what gain? Why was this usage suddenly banned in 0.14? IMO the advice for the other person getting a MemoryError with a nearly 100k x 100k square matrix (testing the limits on either solver really, just happened to trigger on SVD) would be to simply specify the solver by hand, not to suddenly force every sparse input into the eigen solver. Seems a bit knee jerk.

Now someone with say 100k samples x 100 features (which only needs a 100x100 economy SVD right?) who happens to have a sparse matrix input (say coming from a textual feature problem) gets forced into eigen (when they used to get svd) and gets MemoryError, and if they happen to deduce why and try to change back to svd, find it banned? I guess they can work around by converting the input to dense beforehand, but that seems an odd thing to force users to do.

@@ -645,8 +645,8 @@ def _values(self, alpha, y, v, Q, QT_y):
return y - (c / G_diag), c

def _pre_compute_svd(self, X, y):
if sparse.issparse(X):
raise TypeError("SVD not supported for sparse matrices")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 on this PR because densifying a sparse matrix behind the scenes is really not a good idea. I'd rather improve the error message to tell users to call X = X.toarray() in their script.

@bdkearns
Copy link
Contributor Author

Well, if that's the case, then why only disable sparse inputs for the SVD solver? The eigen solver does safe_sparse_dot(dense_output=True), creating a MxM dense matrix possibly even larger than the sparse input. And you guys changed it so sparse inputs previously fed to the SVD solver (which worked fine when densifying), now without warning feed to eigen, even when n_samples >> n_features, and give MemoryErrors in these cases. Consider text problems -- the inputs lend themselves to sparse representation (users probably won't think to densify first, even if it is possible), n_samples >> n_features, so eigen will explode by default now, even if densify+SVD would have and did work in the past.

See issue #2354 for an example that creates a memory error under 0.14, was fine under 0.13.

@mblondel
Copy link
Member

The eigen solver does safe_sparse_dot(dense_output=True)

Yes but the resulting output has shape [n_samples, n_samples] when dense NumPy arrays are passed too. So densifying the output matrix doesn't change anything here.

now without warning feed to eigen, even when n_samples >> n_features, and give MemoryErrors in these cases

Indeed that's bad. Maybe we should disable this behavior then.

@mblondel
Copy link
Member

I'd rather have sparse matrix SVD support directly, then, using either scipy.sparse.linalg.svds or sklearn.utils.extmath.randomized_svd. (The latter is very fast and stable.)

I'm not sure if we can use a truncated SVD here but it's worth a try.

@bdkearns
Copy link
Contributor Author

Yes but the resulting output has shape [n_samples, n_samples] when dense NumPy arrays are passed too. So densifying the output matrix doesn't change anything here.

Yes, but with dense matrices, anything with n_samples > n_features is sent to SVD, not eigen, and produces n_features x n_features array instead. Which silently changed for sparse, while banning them from SVD?

One can construct sparse matrices that blow up either solver, as they both convert to dense. IMO there are two paths -- allow sparse inputs, obviously documenting that both solvers use dense computations, or flat out ban them from both. This "oh one user blew up one solver with this shape problem, let's ban sparse from that one and send all problems to the other" is careless.

@bdkearns
Copy link
Contributor Author

Also, one might consider that other parts of sklearn like feature_extraction.text.CountVectorizer produce sparse matrices. And users probably expect to be able to feed these directly into the linear models like these, as in the examples. And these sparse matrices are of the sort typically with large n_samp, small n_feat. Changing the model here to ban sparse or force them into a solver that produces huge arrays (n_samp x n_samp even when n_samp >> n_feat) when it didn't before seems odd. You guys are producing a library here, one would think API consistency would be valued. One might think you guys would evaluate typical usages patterns/examples before making changes like this that break consistency.

@bdkearns
Copy link
Contributor Author

And this breakage was for one user who was getting a MemoryError? And what part of the change actually "fixed" their problem? Hard coding all sparse arrays to eigen. It wasn't even the removal of the densify in SVD. And now all users of the problems described above get MemoryErrors! What a 'fix'. Anyways, I'm surprised anyone would support hard coding all sparse inputs to one solver (or equally be -1 against reverting it).

@bdkearns bdkearns closed this Oct 29, 2013
@bdkearns bdkearns reopened this Oct 29, 2013
@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling c7529c7 on bdkearns:fix-ridge-with-sparse into 101f0ee on scikit-learn:master.

@mblondel
Copy link
Member

What line creates a (n_features, n_features) matrix?

@@ -701,7 +701,7 @@ def fit(self, X, y, sample_weight=1.0):
with_sw = len(np.shape(sample_weight))

if gcv_mode is None or gcv_mode == 'auto':
if sparse.issparse(X) or n_features > n_samples or with_sw:
if n_features > n_samples or with_sw:
gcv_mode = 'eigen'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, sparse matrices with n_samp >> n_feat (say typical of the type produced by text feature problems) are sent to eigen, where a dense n_samp x n_samp is produced (X dot X.T), when dense matrices of this shape would have been sent to SVD (and in the past sparse were also).

@mblondel
Copy link
Member

Changing the model here to ban sparse or force them into a solver that produces huge arrays (n_feat x n_feat) when it didn't before seems odd

Please don't edit your posts. Otherwise people who read the discussion later can't understand why I asked

What line creates a (n_features, n_features) matrix?

@bdkearns
Copy link
Contributor Author

Simple typo, don't forget you're free to edit as well.

@amueller
Copy link
Member

hum was there any conclusion here?

@mblondel
Copy link
Member

Silently densifying a sparse matrix is a bad idea and we've always tried to avoid it. The real fix would be to support sparse matrices directly (so this should be labeled as "new feature" rather than "bug"). We can try to compute a sparse SVD by svds or sklearn.utils.extmath.randomized_svd but they only compute the first k singular vectors so the results won't be the same. I would be curious to know if this can work well (for k sufficiently large).

I suggest we close this PR. I agree with the problem but not with the solution.

@amueller
Copy link
Member

I only skimmed it but it seemed like a regression, but if you say it is not then I'll take your word for it ;)

@amueller amueller removed the Bug label Mar 3, 2015
@amueller amueller removed this from the 0.16 milestone Mar 3, 2015
@rth
Copy link
Member

rth commented Jun 14, 2019

The associated issue was resolved in #13350, closing.

@rth rth closed this Jun 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Changes in ridge.py break usage with sparse matrices
6 participants