Description
Description
When fitting a LogisticRegression
on a sparse csr
matrix with dtype
set to np.float32
<378019x13491 sparse matrix of type '<class 'numpy.float32'>'
with 13405727 stored elements in Compressed Sparse Row format>
the call to fit
fails, with the exception below, indicating we are trying to write to a read-only variable.
I have read #4597, and the issue seems similar here:
joblib
thinks data matrix is large enough to be exposed as a read-only memory-mapped filelogistic.py
runs the following check, which can lead to an implicit cast tonp.float64
:
if solver == 'lbfgs':
_dtype = np.float64
else:
_dtype = [np.float64, np.float32]
X, y = check_X_y(X, y, accept_sparse='csr', dtype=_dtype, order="C",
accept_large_sparse=solver != 'liblinear')
Here, check_X_y
tries to convert X
to np.float64
which apparently cannot be done.
Solutions appear to be either:
- Pass in a matrix with
np.float64
type - Use a different solver than
lbfgs
which I guess is acceptable. I am not sure if check_X_y
could also do the type conversion without having to write X (this seems to be done as part of sorting)>
The logistic regression API and documentation could also be improved to be more user-friendly:
-
The documentation for
LogisticRegression
does not mention thatlbfgs
only supportsnp.float64
. I think usingnp.float32
is a relatively common use case to limit memory usage, but here, even when memory mapping is not used, it silently leads to a copy with a different type being made. -
Should
check_X_y
check whetherX
is read-only before attempting to do a type conversion? This could fail with a clearer exception, if we are not able to do the copy without touching the original matrix.
Code
from sklearn.model_selection import GridSearchCV, GroupKFold, StratifiedKFold
from sklearn.metrics import accuracy_score, balanced_accuracy_score, make_scorer
from sklearn.linear_model import LogisticRegression
parameters = {"C": np.logspace(-4, 1, 6), "penalty": ["l2"], "max_iter": [800]}
clf = LogisticRegression(verbose=3, class_weight="balanced")
cv = StratifiedKFold(3)
grid_cv = GridSearchCV(
clf,
parameters,
refit=False,
cv=cv,
scoring=make_scorer(balanced_accuracy_score),
n_jobs=20,
)
clf_ovr = OneVsRestClassifier(grid_cv, n_jobs=20)
clf_ovr.fit(X_np32, y)
Expected Results
More user-friendly exception/warning
Actual Results
Traceback (most recent call last):
File "/opt/venv/python3/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 418, in _process_worker
r = call_item()
File "/opt/venv/python3/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 272, in __call__
return self.fn(*self.args, **self.kwargs)
File "/opt/venv/python3/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 600, in __call__
return self.func(*args, **kwargs)
File "/opt/venv/python3/lib/python3.6/site-packages/joblib/parallel.py", line 256, in __call__
for func, args, kwargs in self.items]
File "/opt/venv/python3/lib/python3.6/site-packages/joblib/parallel.py", line 256, in <listcomp>
for func, args, kwargs in self.items]
File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/multiclass.py", line 80, in _fit_binary
estimator.fit(X, y)
File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 715, in fit
self.best_estimator_.fit(X, y, **fit_params)
File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py", line 1532, in fit
accept_large_sparse=solver != 'liblinear')
File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 719, in check_X_y
estimator=estimator)
File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 486, in check_array
accept_large_sparse=accept_large_sparse)
File "/opt/venv/python3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 309, in _ensure_sparse_format
spmatrix = spmatrix.astype(dtype)
File "/opt/venv/python3/lib/python3.6/site-packages/scipy/sparse/data.py", line 71, in astype
self._deduped_data().astype(dtype, casting=casting, copy=copy),
File "/opt/venv/python3/lib/python3.6/site-packages/scipy/sparse/data.py", line 34, in _deduped_data
self.sum_duplicates()
File "/opt/venv/python3/lib/python3.6/site-packages/scipy/sparse/compressed.py", line 1013, in sum_duplicates
self.sort_indices()
File "/opt/venv/python3/lib/python3.6/site-packages/scipy/sparse/compressed.py", line 1059, in sort_indices
self.indices, self.data)
ValueError: UPDATEIFCOPY base is read-only
Versions
ystem:
python: 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:09:58) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
executable: /opt/venv/python3/bin/python
machine: Linux-4.14.146-93.123.amzn1.x86_64-x86_64-with-glibc2.2.5
Python deps:
pip: 19.3.1
setuptools: 41.4.0
sklearn: 0.21.3
numpy: 1.13.1
scipy: 1.1.0
Cython: 0.26.1
pandas: 0.23.4