Pass through non-ndarray object that support nep13 and nep18: vaex+sklearn out of core #14963

maartenbreddels · 2019-09-12T08:43:27Z

Intro

At EuroScipy we (@ogrisel, @adrinjalali @JovanVeljanoski and me) discussed the options of 'dataframe in (same) dataframe out' for sklearn. Also, this was a popular topic at the dataframe summit (https://datapythonista.github.io/blog/dataframe-summit-at-euroscipy.html cc @datapythonista).

The idea was, to experiment first with some transformers, so see if we can get that to work with a vaex dataframe. The proof of concept is in this PR: vaexio/vaex#415

The result is that we/vaex can reuse sklearn without any modification and it works out of core (works on a billion rows for instance).

How does it work?

fit

By monkey patching (or using this PR), we pass through objects in check_array that are non-ndarray, but follow NEP13 and NEP18. The idea being that if you want to behave like numpy, we will not try to convert you.

Vaex implements NEP13 and NEP18 in the referenced PR, and for instance will intercept np.sum/np.std/np.var/np.isnan etc. During for instance StandardScalar.fit, it will use vaex for the computations.

transform

Executing StandardScalar.transform, the dataframe gets modified by building up virtual columns (we track what operations are being performed, we don't compute). The result is a dataframe with virtual columns that will be computed on the fly when needed.

What do we get out of this?

During the fit, no memory copies are being made (at least for the examples in the vaex PR) and we get multithreading for free. The standard scalar is 5x faster out of the box and will take virtually no memory (memory-mapped data).

After the transform, we don't have just the numerical result, we have the expressions that will lead to the numerical result. These expressions can be jitted using Numba, Pythran, of by using CuPy for extra performance or GPU acceleration. Also, since we don't compute the result, we don't take up any memory.

Also, instead of having a 'nameless' ndarray, we now have a dataframe with meaningful column names.

Conclusion / Question

Should sklearn pass through object following NEP13/NEP18? If yes, there probably needs to be a way to explicitly say 'now i need a real in memory c/fortran-contiguous array'.

I doubt that this PR will be the final step in getting for instance vaex and sklearn or other dataframe/numpy like libraries to work together*, I hope that at least it gives food for thought. I can also imagine having an even stricter opt-int than just NEP13 and NEP18 (another magic dunder?).

*) One issue I see is in the PCA, which is using scipy.linalg.svd, which vaex cannot intercept. If numpy.linalg.svd was used, vaex could intercept that and pass it on to for instance dask.

Demo notebook screenshot

datapythonista · 2019-09-12T09:46:23Z

Not sure if this can be relevant to dask-ml too, cc @TomAugspurger

adrinjalali · 2019-09-12T10:24:18Z

Related: #14702

thomasjpfan · 2019-09-13T20:56:40Z

This PR along with #14687 there seems to be interest in making this type of change from different projects.

amueller · 2019-09-23T17:00:15Z

Should sklearn pass through object following NEP13/NEP18?

At some point, yes. I made a slightly opposite change in #14702 to make sure things behave reasonable first and maintain backward-compatibility.
I definitely want this to work in the future. Having the behavior depend on the version of numpy that's installed is a bit odd, though. I'm not entirely sure how to solve this, other than having a global config that turns this on?
We can't require numpy 1.17 (and won't for a long time) so we can't ensure this works.

If yes, there probably needs to be a way to explicitly say 'now i need a real in memory c/fortran-contiguous array'.

It would be great if we could avoid that in some way. Where do you think we need this?

Also: how do we tell the users which estimators support this? For example for trees this is very unlikely to work in the foreseeable future. So passing some giant out-of-memory object to PCA might work, but the a tree (or TSNE) would try to load it into memory and crash. That's not great.

One option would be that if we ask for a pass-through and the estimator can't do it, the estimator will fail and ask the user to manually convert it to a numpy array. That way the responsibility to decide whether to convert to numpy would fall entirely on the user (if they enable the flag).

amueller · 2019-09-23T17:05:18Z

This might need a slep.
We probably need:

a global flag to experimentally turn on the feature (say enable_duck_array)
an estimator tag to say whether an estimator supports this (say passthrough_duck_array)
check_array getting a flag whether this should be allowed from the estimator (or _validate_X knowing it from the tag [MRG] Add n_features_in_ attribute to BaseEstimator #13603)
tests that check that if the enable_duck_array is false, then things are converted ( that's MNT Add estimator check for not calling __array_function__ #14702), if enable_duck_array is true, an estimator will not convert if it has passthrough_duck_array and it will fail if it doesn't have passthrough_duck_array.

This might sound a bit complicated but would be the only way I can think of to make sure it's actually transparent to the user what happens.

amueller · 2019-09-23T17:06:24Z

also see #11447

maartenbreddels · 2019-10-18T17:54:19Z

One option would be that if we ask for a pass-through and the estimator can't do it, the estimator will fail and ask the user to manually convert it to a numpy array.

I think this is a good approach.

So, to summarize, to make it work with for instance vaex:

a user has to opt in via a global config
an estimator has to opt in by letting check_array know (and it has to check for numpy 1.17+)

If a user has opted in, but passes a vaex dataframe to an estimator that does not support it, check_array will raise an exception.

I think that would be a good start!

What I plan to support in vaex, for now, is a context manager that monkey patches sklearn (and undoes that while leaving), which is equivalent to point 1.

For points 2 I can at least say these work:

MinMaxScaler
StandardScaler
PowerTransformer
PCA

amueller · 2020-01-09T19:22:46Z

Do you have any benchmarks on PCA for the different solvers we offer?

maartenbreddels · 2020-01-09T21:54:21Z

No, it's actually the point where I stopped, but I plan to continue soon. It is difficult to compare because vaex will be performing better with large datasets, and will require less memory (with sklearn disk swapping times start to count), while sklearn+numpy will probably be faster with smaller datasets due to less overhead.

Also, a lot of time is spend in the svd for the full solver, vaex does not have a svd solver, we outsource that to dask.array. Also, the PCA calls scipy.linalg.pca, which does not support NEP-like features, so we monkeypatch sklearn's pca module to use numpy.linalg.pca.

The simple scalers/transformers were easy to do, the PCA is where it gets interesting. Any idea for other svd solvers are welcome.

The other PCA solvers try to convert the dataframe explicitly to an array, and I have not found a way to monkeypatch scipy/numpy to avoid doing that yet.

adrinjalali · 2024-04-16T16:11:58Z

The array API support in scikit-learn is moving forward nicely now. Completely forgot about this. But this can be closed now.

pass through non-ndarray object that support nep13 and nep18

8f2e47d

maartenbreddels closed this Oct 18, 2019

maartenbreddels reopened this Oct 18, 2019

maartenbreddels mentioned this pull request Jan 24, 2020

Use np.empty_like in PolynomialFeatures for NEP18 support #16196

Closed

thomasjpfan mentioned this pull request Feb 27, 2020

WIP Enabling different array types (CuPy) in PCA with NEP 37 #16574

Closed

github-actions bot added the module:utils label Mar 2, 2020

MaxHalford mentioned this pull request May 17, 2020

Better integration with creme vaexio/vaex#747

Open

Base automatically changed from master to main January 22, 2021 10:51

adrinjalali closed this Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Pass through non-ndarray object that support nep13 and nep18: vaex+sklearn out of core #14963

Pass through non-ndarray object that support nep13 and nep18: vaex+sklearn out of core #14963

Uh oh!

maartenbreddels commented Sep 12, 2019

Uh oh!

datapythonista commented Sep 12, 2019

Uh oh!

adrinjalali commented Sep 12, 2019

Uh oh!

thomasjpfan commented Sep 13, 2019 •

edited

Loading

Uh oh!

amueller commented Sep 23, 2019

Uh oh!

amueller commented Sep 23, 2019

Uh oh!

amueller commented Sep 23, 2019

Uh oh!

maartenbreddels commented Oct 18, 2019

Uh oh!

amueller commented Jan 9, 2020

Uh oh!

maartenbreddels commented Jan 9, 2020

Uh oh!

adrinjalali commented Apr 16, 2024

Uh oh!

Uh oh!

Uh oh!

Pass through non-ndarray object that support nep13 and nep18: vaex+sklearn out of core #14963

Pass through non-ndarray object that support nep13 and nep18: vaex+sklearn out of core #14963

Uh oh!

Conversation

maartenbreddels commented Sep 12, 2019

Intro

How does it work?

fit

transform

What do we get out of this?

Conclusion / Question

Demo notebook screenshot

Uh oh!

datapythonista commented Sep 12, 2019

Uh oh!

adrinjalali commented Sep 12, 2019

Uh oh!

thomasjpfan commented Sep 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Sep 23, 2019

Uh oh!

amueller commented Sep 23, 2019

Uh oh!

amueller commented Sep 23, 2019

Uh oh!

maartenbreddels commented Oct 18, 2019

Uh oh!

amueller commented Jan 9, 2020

Uh oh!

maartenbreddels commented Jan 9, 2020

Uh oh!

adrinjalali commented Apr 16, 2024

Uh oh!

Uh oh!

thomasjpfan commented Sep 13, 2019 •

edited

Loading