-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Pass through non-ndarray object that support nep13 and nep18: vaex+sklearn out of core #14963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass through non-ndarray object that support nep13 and nep18: vaex+sklearn out of core #14963
Conversation
Not sure if this can be relevant to |
Related: #14702 |
This PR along with #14687 there seems to be interest in making this type of change from different projects. |
At some point, yes. I made a slightly opposite change in #14702 to make sure things behave reasonable first and maintain backward-compatibility.
It would be great if we could avoid that in some way. Where do you think we need this? Also: how do we tell the users which estimators support this? For example for trees this is very unlikely to work in the foreseeable future. So passing some giant out-of-memory object to PCA might work, but the a tree (or TSNE) would try to load it into memory and crash. That's not great. One option would be that if we ask for a pass-through and the estimator can't do it, the estimator will fail and ask the user to manually convert it to a numpy array. That way the responsibility to decide whether to convert to numpy would fall entirely on the user (if they enable the flag). |
This might need a slep.
This might sound a bit complicated but would be the only way I can think of to make sure it's actually transparent to the user what happens. |
also see #11447 |
I think this is a good approach. So, to summarize, to make it work with for instance vaex:
If a user has opted in, but passes a vaex dataframe to an estimator that does not support it, check_array will raise an exception. I think that would be a good start! What I plan to support in vaex, for now, is a context manager that monkey patches sklearn (and undoes that while leaving), which is equivalent to point 1. For points 2 I can at least say these work:
|
Do you have any benchmarks on PCA for the different solvers we offer? |
No, it's actually the point where I stopped, but I plan to continue soon. It is difficult to compare because vaex will be performing better with large datasets, and will require less memory (with sklearn disk swapping times start to count), while sklearn+numpy will probably be faster with smaller datasets due to less overhead. Also, a lot of time is spend in the svd for the full solver, vaex does not have a svd solver, we outsource that to dask.array. Also, the PCA calls scipy.linalg.pca, which does not support NEP-like features, so we monkeypatch sklearn's pca module to use numpy.linalg.pca. The simple scalers/transformers were easy to do, the PCA is where it gets interesting. Any idea for other svd solvers are welcome. The other PCA solvers try to convert the dataframe explicitly to an array, and I have not found a way to monkeypatch scipy/numpy to avoid doing that yet. |
The array API support in scikit-learn is moving forward nicely now. Completely forgot about this. But this can be closed now. |
Intro
At EuroScipy we (@ogrisel, @adrinjalali @JovanVeljanoski and me) discussed the options of 'dataframe in (same) dataframe out' for sklearn. Also, this was a popular topic at the dataframe summit (https://datapythonista.github.io/blog/dataframe-summit-at-euroscipy.html cc @datapythonista).
The idea was, to experiment first with some transformers, so see if we can get that to work with a vaex dataframe. The proof of concept is in this PR: vaexio/vaex#415
The result is that we/vaex can reuse sklearn without any modification and it works out of core (works on a billion rows for instance).
How does it work?
fit
By monkey patching (or using this PR), we pass through objects in check_array that are non-ndarray, but follow NEP13 and NEP18. The idea being that if you want to behave like numpy, we will not try to convert you.
Vaex implements NEP13 and NEP18 in the referenced PR, and for instance will intercept np.sum/np.std/np.var/np.isnan etc. During for instance StandardScalar.fit, it will use vaex for the computations.
transform
Executing StandardScalar.transform, the dataframe gets modified by building up virtual columns (we track what operations are being performed, we don't compute). The result is a dataframe with virtual columns that will be computed on the fly when needed.
What do we get out of this?
During the fit, no memory copies are being made (at least for the examples in the vaex PR) and we get multithreading for free. The standard scalar is 5x faster out of the box and will take virtually no memory (memory-mapped data).
After the transform, we don't have just the numerical result, we have the expressions that will lead to the numerical result. These expressions can be jitted using Numba, Pythran, of by using CuPy for extra performance or GPU acceleration. Also, since we don't compute the result, we don't take up any memory.
Also, instead of having a 'nameless' ndarray, we now have a dataframe with meaningful column names.
Conclusion / Question
Should sklearn pass through object following NEP13/NEP18? If yes, there probably needs to be a way to explicitly say 'now i need a real in memory c/fortran-contiguous array'.
I doubt that this PR will be the final step in getting for instance vaex and sklearn or other dataframe/numpy like libraries to work together*, I hope that at least it gives food for thought. I can also imagine having an even stricter opt-int than just NEP13 and NEP18 (another magic dunder?).
*) One issue I see is in the PCA, which is using scipy.linalg.svd, which vaex cannot intercept. If numpy.linalg.svd was used, vaex could intercept that and pass it on to for instance dask.
Demo notebook screenshot