Skip to content

Pass through non-ndarray object that support nep13 and nep18: vaex+sklearn out of core #14963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

maartenbreddels
Copy link

Intro

At EuroScipy we (@ogrisel, @adrinjalali @JovanVeljanoski and me) discussed the options of 'dataframe in (same) dataframe out' for sklearn. Also, this was a popular topic at the dataframe summit (https://datapythonista.github.io/blog/dataframe-summit-at-euroscipy.html cc @datapythonista).

The idea was, to experiment first with some transformers, so see if we can get that to work with a vaex dataframe. The proof of concept is in this PR: vaexio/vaex#415

The result is that we/vaex can reuse sklearn without any modification and it works out of core (works on a billion rows for instance).

How does it work?

fit

By monkey patching (or using this PR), we pass through objects in check_array that are non-ndarray, but follow NEP13 and NEP18. The idea being that if you want to behave like numpy, we will not try to convert you.

Vaex implements NEP13 and NEP18 in the referenced PR, and for instance will intercept np.sum/np.std/np.var/np.isnan etc. During for instance StandardScalar.fit, it will use vaex for the computations.

transform

Executing StandardScalar.transform, the dataframe gets modified by building up virtual columns (we track what operations are being performed, we don't compute). The result is a dataframe with virtual columns that will be computed on the fly when needed.

What do we get out of this?

During the fit, no memory copies are being made (at least for the examples in the vaex PR) and we get multithreading for free. The standard scalar is 5x faster out of the box and will take virtually no memory (memory-mapped data).

After the transform, we don't have just the numerical result, we have the expressions that will lead to the numerical result. These expressions can be jitted using Numba, Pythran, of by using CuPy for extra performance or GPU acceleration. Also, since we don't compute the result, we don't take up any memory.

Also, instead of having a 'nameless' ndarray, we now have a dataframe with meaningful column names.

Conclusion / Question

Should sklearn pass through object following NEP13/NEP18? If yes, there probably needs to be a way to explicitly say 'now i need a real in memory c/fortran-contiguous array'.

I doubt that this PR will be the final step in getting for instance vaex and sklearn or other dataframe/numpy like libraries to work together*, I hope that at least it gives food for thought. I can also imagine having an even stricter opt-int than just NEP13 and NEP18 (another magic dunder?).

*) One issue I see is in the PCA, which is using scipy.linalg.svd, which vaex cannot intercept. If numpy.linalg.svd was used, vaex could intercept that and pass it on to for instance dask.

Demo notebook screenshot

image

@datapythonista
Copy link

Not sure if this can be relevant to dask-ml too, cc @TomAugspurger

@adrinjalali
Copy link
Member

Related: #14702

@thomasjpfan
Copy link
Member

thomasjpfan commented Sep 13, 2019

This PR along with #14687 there seems to be interest in making this type of change from different projects.

@amueller
Copy link
Member

Should sklearn pass through object following NEP13/NEP18?

At some point, yes. I made a slightly opposite change in #14702 to make sure things behave reasonable first and maintain backward-compatibility.
I definitely want this to work in the future. Having the behavior depend on the version of numpy that's installed is a bit odd, though. I'm not entirely sure how to solve this, other than having a global config that turns this on?
We can't require numpy 1.17 (and won't for a long time) so we can't ensure this works.

If yes, there probably needs to be a way to explicitly say 'now i need a real in memory c/fortran-contiguous array'.

It would be great if we could avoid that in some way. Where do you think we need this?

Also: how do we tell the users which estimators support this? For example for trees this is very unlikely to work in the foreseeable future. So passing some giant out-of-memory object to PCA might work, but the a tree (or TSNE) would try to load it into memory and crash. That's not great.

One option would be that if we ask for a pass-through and the estimator can't do it, the estimator will fail and ask the user to manually convert it to a numpy array. That way the responsibility to decide whether to convert to numpy would fall entirely on the user (if they enable the flag).

@amueller
Copy link
Member

This might need a slep.
We probably need:

  • a global flag to experimentally turn on the feature (say enable_duck_array)
  • an estimator tag to say whether an estimator supports this (say passthrough_duck_array)
  • check_array getting a flag whether this should be allowed from the estimator (or _validate_X knowing it from the tag [MRG] Add n_features_in_ attribute to BaseEstimator #13603)
  • tests that check that if the enable_duck_array is false, then things are converted ( that's MNT Add estimator check for not calling __array_function__ #14702), if enable_duck_array is true, an estimator will not convert if it has passthrough_duck_array and it will fail if it doesn't have passthrough_duck_array.

This might sound a bit complicated but would be the only way I can think of to make sure it's actually transparent to the user what happens.

@amueller
Copy link
Member

also see #11447

@maartenbreddels
Copy link
Author

One option would be that if we ask for a pass-through and the estimator can't do it, the estimator will fail and ask the user to manually convert it to a numpy array.

I think this is a good approach.

So, to summarize, to make it work with for instance vaex:

  1. a user has to opt in via a global config
  2. an estimator has to opt in by letting check_array know (and it has to check for numpy 1.17+)

If a user has opted in, but passes a vaex dataframe to an estimator that does not support it, check_array will raise an exception.

I think that would be a good start!

What I plan to support in vaex, for now, is a context manager that monkey patches sklearn (and undoes that while leaving), which is equivalent to point 1.

For points 2 I can at least say these work:

  • MinMaxScaler
  • StandardScaler
  • PowerTransformer
  • PCA

@amueller
Copy link
Member

amueller commented Jan 9, 2020

Do you have any benchmarks on PCA for the different solvers we offer?

@maartenbreddels
Copy link
Author

No, it's actually the point where I stopped, but I plan to continue soon. It is difficult to compare because vaex will be performing better with large datasets, and will require less memory (with sklearn disk swapping times start to count), while sklearn+numpy will probably be faster with smaller datasets due to less overhead.

Also, a lot of time is spend in the svd for the full solver, vaex does not have a svd solver, we outsource that to dask.array. Also, the PCA calls scipy.linalg.pca, which does not support NEP-like features, so we monkeypatch sklearn's pca module to use numpy.linalg.pca.

The simple scalers/transformers were easy to do, the PCA is where it gets interesting. Any idea for other svd solvers are welcome.

The other PCA solvers try to convert the dataframe explicitly to an array, and I have not found a way to monkeypatch scipy/numpy to avoid doing that yet.

@adrinjalali
Copy link
Member

The array API support in scikit-learn is moving forward nicely now. Completely forgot about this. But this can be closed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants