Skip to content

Fitting default LogisticRegression on large sparse csr np.float32 matrix fails with n_jobs > 1 #15924

Closed
@peay

Description

@peay

Description

When fitting a LogisticRegression on a sparse csr matrix with dtype set to np.float32

<378019x13491 sparse matrix of type '<class 'numpy.float32'>'
    with 13405727 stored elements in Compressed Sparse Row format>

the call to fit fails, with the exception below, indicating we are trying to write to a read-only variable.

I have read #4597, and the issue seems similar here:

  • joblib thinks data matrix is large enough to be exposed as a read-only memory-mapped file
  • logistic.py runs the following check, which can lead to an implicit cast to np.float64:
        if solver == 'lbfgs':
            _dtype = np.float64
        else:
            _dtype = [np.float64, np.float32]

        X, y = check_X_y(X, y, accept_sparse='csr', dtype=_dtype, order="C",
                         accept_large_sparse=solver != 'liblinear')

Here, check_X_y tries to convert X to np.float64 which apparently cannot be done.

Solutions appear to be either:

  • Pass in a matrix with np.float64 type
  • Use a different solver than lbfgs

which I guess is acceptable. I am not sure if check_X_y could also do the type conversion without having to write X (this seems to be done as part of sorting)>

The logistic regression API and documentation could also be improved to be more user-friendly:

  • The documentation for LogisticRegression does not mention that lbfgs only supports np.float64. I think using np.float32 is a relatively common use case to limit memory usage, but here, even when memory mapping is not used, it silently leads to a copy with a different type being made.

  • Should check_X_y check whether X is read-only before attempting to do a type conversion? This could fail with a clearer exception, if we are not able to do the copy without touching the original matrix.

Code

from sklearn.model_selection import GridSearchCV, GroupKFold, StratifiedKFold
from sklearn.metrics import accuracy_score, balanced_accuracy_score, make_scorer
from sklearn.linear_model import LogisticRegression

parameters = {"C": np.logspace(-4, 1, 6), "penalty": ["l2"], "max_iter": [800]}

clf = LogisticRegression(verbose=3, class_weight="balanced")
cv = StratifiedKFold(3)
grid_cv = GridSearchCV(
    clf,
    parameters,
    refit=False,
    cv=cv,
    scoring=make_scorer(balanced_accuracy_score),
    n_jobs=20,
)

clf_ovr = OneVsRestClassifier(grid_cv, n_jobs=20)
clf_ovr.fit(X_np32, y)

Expected Results

More user-friendly exception/warning

Actual Results

Traceback (most recent call last):
  File "/opt/venv/python3/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 418, in _process_worker
    r = call_item()
  File "/opt/venv/python3/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 272, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/opt/venv/python3/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 600, in __call__
    return self.func(*args, **kwargs)
  File "/opt/venv/python3/lib/python3.6/site-packages/joblib/parallel.py", line 256, in __call__
    for func, args, kwargs in self.items]
  File "/opt/venv/python3/lib/python3.6/site-packages/joblib/parallel.py", line 256, in <listcomp>
    for func, args, kwargs in self.items]
  File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/multiclass.py", line 80, in _fit_binary
    estimator.fit(X, y)
  File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 715, in fit
    self.best_estimator_.fit(X, y, **fit_params)
  File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py", line 1532, in fit
    accept_large_sparse=solver != 'liblinear')
  File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 719, in check_X_y
    estimator=estimator)
  File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 486, in check_array
    accept_large_sparse=accept_large_sparse)
  File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 309, in _ensure_sparse_format
    spmatrix = spmatrix.astype(dtype)
  File "/opt/venv/python3/lib/python3.6/site-packages/scipy/sparse/data.py", line 71, in astype
    self._deduped_data().astype(dtype, casting=casting, copy=copy),
  File "/opt/venv/python3/lib/python3.6/site-packages/scipy/sparse/data.py", line 34, in _deduped_data
    self.sum_duplicates()
  File "/opt/venv/python3/lib/python3.6/site-packages/scipy/sparse/compressed.py", line 1013, in sum_duplicates
    self.sort_indices()
  File "/opt/venv/python3/lib/python3.6/site-packages/scipy/sparse/compressed.py", line 1059, in sort_indices
    self.indices, self.data)
ValueError: UPDATEIFCOPY base is read-only

Versions

ystem:
python: 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:09:58) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
executable: /opt/venv/python3/bin/python
machine: Linux-4.14.146-93.123.amzn1.x86_64-x86_64-with-glibc2.2.5

Python deps:
pip: 19.3.1
setuptools: 41.4.0
sklearn: 0.21.3
numpy: 1.13.1
scipy: 1.1.0
Cython: 0.26.1
pandas: 0.23.4

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions