-
Notifications
You must be signed in to change notification settings - Fork 598
Description
I'm opening this issue to discuss some details that would allow the creme library to make the most of vaex's engine. I haven't looked too deeply into this because until now we were not too concerned with performance.
We call "pure" online learning a scenario where a model learns with one observation at a time. We like this because it's elegant, easy on the implementation side, and simplifies things from a user perspective. However, we have to admit that the huge downside of our approach is that it's painfully slow. Essentially, training a creme model on a large dataset requires a Python for loop, which as we all know is slow.
To be able to do online learning in a fast way, we have to support mini-batches (e.g. the partial_fit
from scikit-learn). The way we're thinking about this is that we don't want to change our design and philosophy. However, we're willing to add mini-batch support to a few selected models that 1) are performant 2) would benefit from vectorisation.
In terms of API, we're thinking of adding some *_many
methods in addition to the *_one
methods we already have. These methods would take as input a pandas.DataFrame
and would output a pandas.Series
or a pandas.DataFrame
, depending on the context. We kind of have to with pandas
because a big design choice of ours is to have access to feature names, so numpy
arrays don't make the cut.
I sort of know how to proceed in terms of implementation. I know how to do this in a way that wouldn't require duplicating too much logic, and would provide the best of both worlds on our sides (pure online and mini-batching). As I've said to Jovan, I would rather fletch out with you guys what would be an ideal world on your side :). From what I understand that is being said in this scikit-learn issue, you guys are essentially "intercepting" numpy ufunc calls and dispatching to whatever backend is available. Is that right? If so, I suppose that you would want to use numpy as much possible?
Also, if I understand correctly in your 1 billions rows in 20 minutes example, the StandardScaler
class is benefiting from vaex
, but not the SGDClassifier
because it's implemented in Cython and not numpy? You'll have to excuse me if I'm completely missing the point, a lot of this is flying over my head :)