Skip to content

Better integration with creme #747

@MaxHalford

Description

@MaxHalford

I'm opening this issue to discuss some details that would allow the creme library to make the most of vaex's engine. I haven't looked too deeply into this because until now we were not too concerned with performance.

We call "pure" online learning a scenario where a model learns with one observation at a time. We like this because it's elegant, easy on the implementation side, and simplifies things from a user perspective. However, we have to admit that the huge downside of our approach is that it's painfully slow. Essentially, training a creme model on a large dataset requires a Python for loop, which as we all know is slow.

To be able to do online learning in a fast way, we have to support mini-batches (e.g. the partial_fit from scikit-learn). The way we're thinking about this is that we don't want to change our design and philosophy. However, we're willing to add mini-batch support to a few selected models that 1) are performant 2) would benefit from vectorisation.

In terms of API, we're thinking of adding some *_many methods in addition to the *_one methods we already have. These methods would take as input a pandas.DataFrame and would output a pandas.Series or a pandas.DataFrame, depending on the context. We kind of have to with pandas because a big design choice of ours is to have access to feature names, so numpy arrays don't make the cut.

I sort of know how to proceed in terms of implementation. I know how to do this in a way that wouldn't require duplicating too much logic, and would provide the best of both worlds on our sides (pure online and mini-batching). As I've said to Jovan, I would rather fletch out with you guys what would be an ideal world on your side :). From what I understand that is being said in this scikit-learn issue, you guys are essentially "intercepting" numpy ufunc calls and dispatching to whatever backend is available. Is that right? If so, I suppose that you would want to use numpy as much possible?

Also, if I understand correctly in your 1 billions rows in 20 minutes example, the StandardScaler class is benefiting from vaex, but not the SGDClassifier because it's implemented in Cython and not numpy? You'll have to excuse me if I'm completely missing the point, a lot of this is flying over my head :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions