-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler #5104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As additional note, I do not support sparse inputs at the moment. It is not a trivial extension and I would like to have a feedback on what I have so far. |
It assumed that X is already 2d. | ||
""" | ||
if X.shape[1] == 1: | ||
return np.transpose(np.array(X, copy=copy)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this destroy sparsity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does. I have edited and integrated a simple test in the next commit.
e269b0f
to
a189567
Compare
|
|
||
# This is the first partial_fit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember what the consensus was wrt calling fit
and then partial_fit
. Should we forget the model? I think there are some arguments about that in #3907.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read #3907 last week: I don't think consensus was reached. In a way, there is no contract with the final user. Last comment by @jnothman
The point is that the interaction between fit and partial_fit hasn't been previously addressed. The assumption is that the user will choose one or the other, but not both. Forcing a particular behaviour by way of the proposed tests will make the matter more consistent, but is also perhaps more of a framework than is actually needed, because the case is somewhat pathological.
I think @ogrisel is planning to put some thought into the possibility of an API for partial fitting, which would settle the semantics for all the case.
Do you mean that we should not allow the batches to have size 1? This would be a fairly standard application for an online scaler. |
No, I mean that a batch of size one should have a shape of |
Agreed. Are you suggesting to raise an error? Again, it's hard to see a common behaviour in all the |
There is no currently respected behavior but we will move to raising errors if |
b0d457a
to
01f9d2f
Compare
Possible refactoring
|
01f9d2f
to
f279a16
Compare
+1 for raising a |
# New array to avoid side-effects | ||
new_scale = np.array([x for x in scale]) | ||
new_scale[new_scale == 0.0] = 1.0 | ||
new_scale[~np.isfinite(new_scale)] = 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why silently remove non-finite values? The check_array
-based input checks in StandardScaler
and the like should already raise an exception if there are non-finite values in the data. No need to deal with that case again here. If the variance computation overflows (in the event our implementation is not numerically stable) we should not silently hide it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The conversion was already there and I replicated it. No test is broken if that's removed.
Can you please add a test to check that https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm |
self.min_ = feature_range[0] - data_min * self.scale_ | ||
self.data_min = data_min | ||
self.data_max = data_max | ||
self.data_range = data_range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all attributes that are computed from the data in the fit method should be suffixed with _
(scikit-learn convention).
9ef580c
to
93fc4b3
Compare
These tests pass also for numpy versions 1.6, 1.7 and 1.8. |
@sdvillal do you refer to |
I have double checked |
That's it. I made a full pass. +1 for merge after comments are addressed. Thanks a lot for your work and clear tests! |
Thank you again @MechCoder. I have started to work on issues related to 1d inputs. Scalers were not fully covered by #5152. I will need to fix some tests too indeed. |
By the way, I have noticed that |
0baeb3d
to
b771b4f
Compare
Done. I have edited the tests quite a lot, to address your comments and to deal with the new conventions for 1d inputs. Another pass on The only last thing I want to do is to test the |
94e462f
to
78ce989
Compare
elif isinstance(scale, np.ndarray): | ||
scale[scale == 0.0] = 1.0 | ||
scale[~np.isfinite(scale)] = 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that I did not follow the conservation before, but what was the reason this was removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infinite values from the input are filtered out by check_array
. In case those are produced by our computation, we should not silently remove them anyway.
78ce989
to
b157589
Compare
@ogrisel we can merge? |
Should I also update |
b157589
to
067df52
Compare
Done. |
067df52
to
6a5a2f7
Compare
Thanks :) |
[MRG+2] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler
Merged #5104.
Whooooot!
|
Hi - I am trying to find a way to do partial_fit on RobustScaler. Reading through this discussion and the merged code, it looks like RobustScaler doesn't have partial_fit. I need to use RobustScaler due to outliers but running into memory issues. Do you have any suggestions to do partial_fit on RobustScaler? Appreciate any help |
To my understanding quantiles cannot be exactly computed over a stream. I'd
just suggest you fit the RobustScaler on a sample of data and transform
with that fitted model.
|
Ok, Thanks jnothman! I see other libraries such as (msmbuilder) that supports partial_fit. I am not sure how they are implemented differently but will try to find the difference. |
msmbuilder implements partial_fit as fit:
https://github.com/msmbuilder/msmbuilder/blob/556a93a170782f47be53f4a1e9d740fb1c8272b3/msmbuilder/preprocessing/base.py#L129-L144
|
I am fixing #5028. This is only about
StandardScaler
. I moved the call to _handle_zeros_in_scale intotrasform
andinverse_trasform
. This is necessary since otherwise we lose the true value of std (and var) that is needed for incremental computation —or otherwise we would need an additional variable only for that, which is not very elegant. Incidentally, this addresses #4609. I have also slightly extendedutils.extmath._batch_mean_variance_update
and did some refactoring indecomposition.incremental_PCA
, that calls the former.A more conceptual change is the following.
partial_fit
must interpret an 1d array as a row matrix, i.e. must be able to process batches of size 1. This clashes with the behaviour offit
which by default assumes that 1d input arrays are column matrices. I think this is a design issue that should be tackled more generally, hence out of scope of this PR. See for example #4511. @GaelVaroquaux may want to comment here.I have not written tests on the interplay of subsequent calls to
fit
andpartial_fit
, as the semantics does not seem defined yet. See #3896Related open discussions:
partial_fit
and a few other related fixes / enhancements #3907 Invariance testing for partial_fit #3896EDIT ~ to do list:
partial_fit
forStandardScaler
,MaxAbsScaler
andMinMaxScaler
as discussed, there is now a validator for 2d input, that interprets it as 1-row matrix (instead of 1-column) in case of need:input is not checked any more, see below and [MRG + 1] Warn on 1D arrays, addresses #4511 #5152row_or_2d
scale_
is now common attribute to all scalers, and it is different from the stat that it is computed from, e.g.mean_
,max_abs
etc. The latter comes before_handle_zeros_in_scale
, the former after it.partial_fit
requires the true stats value for the incremental update. As we need to save it, I also expose it for the object --also in the case onlyfit
is used._handle_zeros_in_scale
incr_mean_variance_axis0
insparsefuncs_fast.pyx
for avoiding some redundant computation_batch_mean_variance_update