SGD classification unnecessarily slow

Multiclass prediction using SGD on sparse data is unnecessarily slow. It seems to be copying large arrays _on every call to `predict`, `predict_proba` etc_, which kills its performance.

I think the main culprit is `safe_sparse_dot`, which uses scipy's "CSR \* dense" routine:

``` python
def safe_sparse_dot(a, b, dense_output=False):
    if issparse(a) or issparse(b):
        ret = a * b  # <== this line here: `a` is CSR, `b` is dense clf.coef_.T
        if dense_output and hasattr(ret, "toarray"):
            ...
```

Because `b` is transposed and scipy's multiplication invokes `b.ravel()`, this is very slow (copies `clf.coef_` internally).

Keeping `clf.coef_.T` as a C-contiguous array [here](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/base.py#L252) improved the prediction performance of SGD classifier **7300x** for us (1s vs 137µs per call):

``` python
def decision_function(self, X):
    ...
    # before:
    # scores = safe_sparse_dot(X, self.coef_.T, dense_output=True) + self.intercept_

    # after quick fix: self.coef_T = np.ascontiguousarray(self.coef_.T)
    scores = safe_sparse_dot(X, self.coef_T, dense_output=True) + self.intercept_
    ...
```

The exact speedup numbers will vary, depending on `coef_` size (~the number of SGD classes and features).

This could be raised as an issue in scipy as well (I see no good reason for such inefficiency -- that `ravel()` is just too generous), but since the fix seems trivial, maybe it's worth addressing on sklearn side as well?

This is using scipy 0.16.1 and sklearn 0.17.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

SGD classification unnecessarily slow #6186

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

SGD classification unnecessarily slow #6186

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions