Skip to content

Fitting default LogisticRegression on large sparse csr np.float32 matrix fails with n_jobs > 1 #15924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
peay opened this issue Dec 19, 2019 · 6 comments

Comments

@peay
Copy link
Contributor

peay commented Dec 19, 2019

Description

When fitting a LogisticRegression on a sparse csr matrix with dtype set to np.float32

<378019x13491 sparse matrix of type '<class 'numpy.float32'>'
    with 13405727 stored elements in Compressed Sparse Row format>

the call to fit fails, with the exception below, indicating we are trying to write to a read-only variable.

I have read #4597, and the issue seems similar here:

  • joblib thinks data matrix is large enough to be exposed as a read-only memory-mapped file
  • logistic.py runs the following check, which can lead to an implicit cast to np.float64:
        if solver == 'lbfgs':
            _dtype = np.float64
        else:
            _dtype = [np.float64, np.float32]

        X, y = check_X_y(X, y, accept_sparse='csr', dtype=_dtype, order="C",
                         accept_large_sparse=solver != 'liblinear')

Here, check_X_y tries to convert X to np.float64 which apparently cannot be done.

Solutions appear to be either:

  • Pass in a matrix with np.float64 type
  • Use a different solver than lbfgs

which I guess is acceptable. I am not sure if check_X_y could also do the type conversion without having to write X (this seems to be done as part of sorting)>

The logistic regression API and documentation could also be improved to be more user-friendly:

  • The documentation for LogisticRegression does not mention that lbfgs only supports np.float64. I think using np.float32 is a relatively common use case to limit memory usage, but here, even when memory mapping is not used, it silently leads to a copy with a different type being made.

  • Should check_X_y check whether X is read-only before attempting to do a type conversion? This could fail with a clearer exception, if we are not able to do the copy without touching the original matrix.

Code

from sklearn.model_selection import GridSearchCV, GroupKFold, StratifiedKFold
from sklearn.metrics import accuracy_score, balanced_accuracy_score, make_scorer
from sklearn.linear_model import LogisticRegression

parameters = {"C": np.logspace(-4, 1, 6), "penalty": ["l2"], "max_iter": [800]}

clf = LogisticRegression(verbose=3, class_weight="balanced")
cv = StratifiedKFold(3)
grid_cv = GridSearchCV(
    clf,
    parameters,
    refit=False,
    cv=cv,
    scoring=make_scorer(balanced_accuracy_score),
    n_jobs=20,
)

clf_ovr = OneVsRestClassifier(grid_cv, n_jobs=20)
clf_ovr.fit(X_np32, y)

Expected Results

More user-friendly exception/warning

Actual Results

Traceback (most recent call last):
  File "/opt/venv/python3/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 418, in _process_worker
    r = call_item()
  File "/opt/venv/python3/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 272, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/opt/venv/python3/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 600, in __call__
    return self.func(*args, **kwargs)
  File "/opt/venv/python3/lib/python3.6/site-packages/joblib/parallel.py", line 256, in __call__
    for func, args, kwargs in self.items]
  File "/opt/venv/python3/lib/python3.6/site-packages/joblib/parallel.py", line 256, in <listcomp>
    for func, args, kwargs in self.items]
  File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/multiclass.py", line 80, in _fit_binary
    estimator.fit(X, y)
  File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 715, in fit
    self.best_estimator_.fit(X, y, **fit_params)
  File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py", line 1532, in fit
    accept_large_sparse=solver != 'liblinear')
  File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 719, in check_X_y
    estimator=estimator)
  File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 486, in check_array
    accept_large_sparse=accept_large_sparse)
  File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 309, in _ensure_sparse_format
    spmatrix = spmatrix.astype(dtype)
  File "/opt/venv/python3/lib/python3.6/site-packages/scipy/sparse/data.py", line 71, in astype
    self._deduped_data().astype(dtype, casting=casting, copy=copy),
  File "/opt/venv/python3/lib/python3.6/site-packages/scipy/sparse/data.py", line 34, in _deduped_data
    self.sum_duplicates()
  File "/opt/venv/python3/lib/python3.6/site-packages/scipy/sparse/compressed.py", line 1013, in sum_duplicates
    self.sort_indices()
  File "/opt/venv/python3/lib/python3.6/site-packages/scipy/sparse/compressed.py", line 1059, in sort_indices
    self.indices, self.data)
ValueError: UPDATEIFCOPY base is read-only

Versions

ystem:
python: 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:09:58) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
executable: /opt/venv/python3/bin/python
machine: Linux-4.14.146-93.123.amzn1.x86_64-x86_64-with-glibc2.2.5

Python deps:
pip: 19.3.1
setuptools: 41.4.0
sklearn: 0.21.3
numpy: 1.13.1
scipy: 1.1.0
Cython: 0.26.1
pandas: 0.23.4

@glemaitre
Copy link
Member

Thanks for reporting a clear investigation. I think that you are right regarding the conversion mechanism, basically blocked by joblib since X is a read-only memmap to avoid making copy for each worker.

The only way to avoid this issue would be to be able to convert into 64 bits X before to memmap it, which would require to introspect the different check-array of the algorithm used. It seems already complicated :)

ping @ogrisel who will have a much better picture of the internal and who can think of an easy way to solve the issue?

The documentation for LogisticRegression does not mention that lbfgs only supports np.float64. I think using np.float32 is a relatively common use case to limit memory usage, but here, even when memory mapping is not used, it silently leads to a copy with a different type being made.

We are relying on the lbfgs implementation in scipy which is a Fortran code accepting only double. In an ideal world, if somebody skilled in Fortran could make a 32-bits version of the algorithm. But it might not be as easy as it looks :)

@glemaitre glemaitre added the Bug label Dec 19, 2019
@peay
Copy link
Contributor Author

peay commented Dec 19, 2019

We are relying on the lbfgs implementation in scipy which is a Fortran code accepting only double. In an ideal world, if somebody skilled in Fortran could make a 32-bits version of the algorithm. But it might not be as easy as it looks :)

Sounds reasonable. However, I wonder if check_array should issue a warning when doing such conversions?

After a bit more investigation, I ran into #6614, which apparently points to a bug in scipy (which makes sense, as astype with copy shouldn't need to overwrite the existing data): scipy/scipy#8679. The fix has not been merged, though. If this is a known bug, maybe _ensure_sparse could check if X is read-only as we know that astype will fail then?

@glemaitre
Copy link
Member

Sounds reasonable. However, I wonder if check_array should issue a warning when doing such conversions?

We previously had a DataConversionWarning but it was decided to remove it #13306 #13324

@peay
Copy link
Contributor Author

peay commented Dec 19, 2019

OK, then I guess we can close as duplicate of #6614, thanks!

@peay peay closed this as completed Dec 19, 2019
@glemaitre glemaitre reopened this Dec 19, 2019
@glemaitre
Copy link
Member

glemaitre commented Dec 19, 2019

I reopen since your script in the issue is correct and it should be good to discuss the underlying problem.

@glemaitre
Copy link
Member

I am closing this issue since it should be solved by installing the future SciPy release since scipy/scipy#18192 has been merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants