Description
Description
With NEP-18, numpy functions that previously converted an array-like to an ndarray may no longer do the (implicit) conversion. dask.array
recently implemented __array_function__
so np.unique(dask.array.Array)
now returns a dask.array.Array
.
Some more details in dask/dask-ml#541
Steps/Code to Reproduce
import dask.array as da
import dask_ml.datasets
import sklearn.linear_model
X, y = dask_ml.datasets.make_classification(chunks=50)
clf = sklearn.linear_model.LogisticRegression()
clf.fit(X, y)
Expected Results
No error, the same output as clf.fit(X.compute(), y.compute())
,
or by setting the environment variable NUMPY_EXPERIMENTAL_ARRAY_FUNCTION='0'
.
Actual Results
That raises
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-b0953fbb1d6e> in <module>
----> 1 clf.fit(X, y)
~/Envs/dask-dev/lib/python3.7/site-packages/sklearn/linear_model/logistic.py in fit(self, X, y, sample_weight)
1536
1537 multi_class = _check_multi_class(self.multi_class, solver,
-> 1538 len(self.classes_))
1539
1540 if solver == 'liblinear':
TypeError: 'float' object cannot be interpreted as an integer
This is because self.classes_ = np.unique(y)
is a Dask Array with unknown length
In [2]: np.unique(da.arange(12))
Out[2]: dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,), chunktype=numpy.ndarray>
since Dask is lazy and doesn't know the unique elements until compute time.
Versions
System:
python: 3.7.3 (default, Apr 5 2019, 14:56:38) [Clang 10.0.1 (clang-1001.0.46.3)]
executable: /Users/taugspurger/Envs/dask-dev/bin/python
machine: Darwin-18.6.0-x86_64-i386-64bit
Python deps:
pip: 19.2.1
setuptools: 41.0.1
sklearn: 0.21.3
numpy: 1.18.0.dev0+5e7e74b
scipy: 1.2.0
Cython: 0.29.9
pandas: 0.25.0+169.g5de4e55d6
I think this needs need NumPy>=1.17 and Dask>=2.0.0
Possible solution: Explicitly convert array-likes to concrete ndarrays where necessary (this is a bit hard to determine though). For example
scikit-learn/sklearn/linear_model/logistic.py
Line 1517 in 1484918
self.classes_ = np.asarray(np.unique(y))
. That may not be ideal for other libraries implementing __array_function__
(like pydata/sparse).