Better integration with creme

I'm opening this issue to discuss some details that would allow the [creme library](https://github.com/creme-ml/creme) to make the most of vaex's engine. I haven't looked too deeply into this because until now we were not too concerned with performance.

We call "pure" online learning a scenario where a model learns with **one** observation at a time. We like this because it's elegant, easy on the implementation side, and simplifies things from a user perspective. However, we have to admit that the huge downside of our approach is that it's painfully slow. Essentially, training a creme model on a large dataset requires a Python for loop, which as we all know is slow.

To be able to do online learning in a fast way, we *have* to support mini-batches (e.g. the `partial_fit` from scikit-learn). The way we're thinking about this is that we don't want to change our design and philosophy. However, we're willing to add mini-batch support to a few selected models that 1) are performant 2) would benefit from vectorisation.

In terms of API, we're thinking of adding some `*_many` methods in addition to the `*_one` methods we already have. These methods would take as input a `pandas.DataFrame` and would output a `pandas.Series` or a `pandas.DataFrame`, depending on the context. We kind of have to with `pandas` because a big design choice of ours is to have access to feature names, so `numpy` arrays don't make the cut.

I sort of know how to proceed in terms of implementation. I know how to do this in a way that wouldn't require duplicating too much logic, and would provide the best of both worlds on our sides (pure online and mini-batching). As I've said to Jovan, I would rather fletch out with you guys what would be an ideal world on your side :). From what I understand that is being said in [this scikit-learn issue](https://github.com/scikit-learn/scikit-learn/pull/14963), you guys are essentially "intercepting" numpy ufunc calls and dispatching to whatever backend is available. Is that right? If so, I suppose that you would want to use numpy as much possible?

Also, if I understand correctly in your 1 billions rows in 20 minutes example, the `StandardScaler` class is benefiting from `vaex`, but not the `SGDClassifier` because it's implemented in Cython and not numpy? You'll have to excuse me if I'm completely missing the point, a lot of this is flying over my head :) 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better integration with creme #747

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Better integration with creme #747

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions