Skip to content

Classifiers may not work with arrays defining __array_function__ #14687

Closed
@TomAugspurger

Description

@TomAugspurger

Description

With NEP-18, numpy functions that previously converted an array-like to an ndarray may no longer do the (implicit) conversion. dask.array recently implemented __array_function__ so np.unique(dask.array.Array) now returns a dask.array.Array.

Some more details in dask/dask-ml#541

Steps/Code to Reproduce

import dask.array as da
import dask_ml.datasets
import sklearn.linear_model

X, y = dask_ml.datasets.make_classification(chunks=50)

clf = sklearn.linear_model.LogisticRegression()
clf.fit(X, y)

Expected Results

No error, the same output as clf.fit(X.compute(), y.compute()),
or by setting the environment variable NUMPY_EXPERIMENTAL_ARRAY_FUNCTION='0'.

Actual Results

That raises

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-b0953fbb1d6e> in <module>
----> 1 clf.fit(X, y)

~/Envs/dask-dev/lib/python3.7/site-packages/sklearn/linear_model/logistic.py in fit(self, X, y, sample_weight)
   1536
   1537         multi_class = _check_multi_class(self.multi_class, solver,
-> 1538                                          len(self.classes_))
   1539
   1540         if solver == 'liblinear':

TypeError: 'float' object cannot be interpreted as an integer

This is because self.classes_ = np.unique(y) is a Dask Array with unknown length

In [2]: np.unique(da.arange(12))
Out[2]: dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,), chunktype=numpy.ndarray>

since Dask is lazy and doesn't know the unique elements until compute time.

Versions

System:
    python: 3.7.3 (default, Apr  5 2019, 14:56:38)  [Clang 10.0.1 (clang-1001.0.46.3)]
executable: /Users/taugspurger/Envs/dask-dev/bin/python
   machine: Darwin-18.6.0-x86_64-i386-64bit

Python deps:
       pip: 19.2.1
setuptools: 41.0.1
   sklearn: 0.21.3
     numpy: 1.18.0.dev0+5e7e74b
     scipy: 1.2.0
    Cython: 0.29.9
    pandas: 0.25.0+169.g5de4e55d6

I think this needs need NumPy>=1.17 and Dask>=2.0.0


Possible solution: Explicitly convert array-likes to concrete ndarrays where necessary (this is a bit hard to determine though). For example

self.classes_ = np.unique(y)
would be self.classes_ = np.asarray(np.unique(y)). That may not be ideal for other libraries implementing __array_function__ (like pydata/sparse).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions