[MRG+2] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler #5104

giorgiop · 2015-08-10T12:31:26Z

I am fixing #5028. This is only about StandardScaler. I moved the call to _handle_zeros_in_scale into trasform and inverse_trasform. This is necessary since otherwise we lose the true value of std (and var) that is needed for incremental computation —or otherwise we would need an additional variable only for that, which is not very elegant. Incidentally, this addresses #4609. I have also slightly extended utils.extmath._batch_mean_variance_update and did some refactoring in decomposition.incremental_PCA, that calls the former.

A more conceptual change is the following. partial_fit must interpret an 1d array as a row matrix, i.e. must be able to process batches of size 1. This clashes with the behaviour of fit which by default assumes that 1d input arrays are column matrices. I think this is a design issue that should be tackled more generally, hence out of scope of this PR. See for example #4511. @GaelVaroquaux may want to comment here.

I have not written tests on the interplay of subsequent calls to fit and partial_fit, as the semantics does not seem defined yet. See #3896

Related open discussions:

EDIT ~ to do list:

giorgiop · 2015-08-10T12:50:33Z

As additional note, I do not support sparse inputs at the moment. It is not a trivial extension and I would like to have a feedback on what I have so far.

untom · 2015-08-10T13:26:20Z

sklearn/preprocessing/data.py

+    It assumed that X is already 2d.
+    """
+    if X.shape[1] == 1:
+        return np.transpose(np.array(X, copy=copy))


Doesn't this destroy sparsity?

Yes it does. I have edited and integrated a simple test in the next commit.

amueller · 2015-08-13T18:30:41Z

partial_fit should only accept 2d X if possible.

amueller · 2015-08-13T18:33:22Z

sklearn/decomposition/incremental_pca.py


+        # This is the first partial_fit


I don't remember what the consensus was wrt calling fit and then partial_fit. Should we forget the model? I think there are some arguments about that in #3907.

I read #3907 last week: I don't think consensus was reached. In a way, there is no contract with the final user. Last comment by @jnothman

The point is that the interaction between fit and partial_fit hasn't been previously addressed. The assumption is that the user will choose one or the other, but not both. Forcing a particular behaviour by way of the proposed tests will make the matter more consistent, but is also perhaps more of a framework than is actually needed, because the case is somewhat pathological.

I think @ogrisel is planning to put some thought into the possibility of an API for partial fitting, which would settle the semantics for all the case.

giorgiop · 2015-08-13T18:51:29Z

partial_fit should only accept 2d X if possible.

Do you mean that we should not allow the batches to have size 1? This would be a fairly standard application for an online scaler.

amueller · 2015-08-13T19:50:13Z

No, I mean that a batch of size one should have a shape of (1, n_features). [Also probably not a good idea for python but I would still allow it]

giorgiop · 2015-08-13T20:07:34Z

Agreed. Are you suggesting to raise an error? Again, it's hard to see a common behaviour in all the partial_fit methods in the code.

amueller · 2015-08-13T20:29:06Z

There is no currently respected behavior but we will move to raising errors if X.ndim != 2.
Otherwise you can raise an error (as this is a new function).

giorgiop · 2015-08-16T21:00:11Z

partial_fit for StandardScaler, MaxAbsScaler and MinMaxScaler
support to CSC and CSR for all but the last
as discussed above, there is now a validator for 2d input, that interprets it as 1-row matrix (instead of 1-column) in case of need: row_or_2d
scale_ is now common attribute to all scalers, and it is different from the stat that it is computed from, e.g. mean_, max_abs etc. The latter comes before _handle_zeros_in_scale, the former after it. partial_fit requires the true stats value for the incremental update. As we need to save it, I also expose it for the object --also in the case only fit is used.

Possible refactoring

utils.extmath._batch_mean_variance_update
utils.sparsefunc_fast.incr_mean_variance_axis
are the same algorithm, the second is cythonized for manipulating sparse input, but they are essentially the same thing.

ogrisel · 2015-08-17T14:24:31Z

+1 for raising a ValueError with an informative error message when X.ndim != 2 in partial_fit.

ogrisel · 2015-08-18T11:03:39Z

sklearn/preprocessing/data.py

+            # New array to avoid side-effects
+            new_scale = np.array([x for x in scale])
+            new_scale[new_scale == 0.0] = 1.0
+            new_scale[~np.isfinite(new_scale)] = 1.0


Why silently remove non-finite values? The check_array-based input checks in StandardScaler and the like should already raise an exception if there are non-finite values in the data. No need to deal with that case again here. If the variance computation overflows (in the event our implementation is not numerically stable) we should not silently hide it.

Agreed. The conversion was already there and I replicated it. No test is broken if that's removed.

ogrisel · 2015-08-18T13:39:27Z

Can you please add a test to check that _batch_mean_variance_update is numerically stable on some synthetic data that is specifically engineered to trigger numerical problems on a naive version of the online mean / variance computation such as the first formulas presented in:

https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm

ogrisel · 2015-08-20T09:17:50Z

sklearn/preprocessing/data.py

+            self.min_ = feature_range[0] - data_min * self.scale_
+            self.data_min = data_min
+            self.data_max = data_max
+            self.data_range = data_range


all attributes that are computed from the data in the fit method should be suffixed with _ (scikit-learn convention).

sdvillal · 2015-08-30T11:08:26Z

These tests pass also for numpy versions 1.6, 1.7 and 1.8.

giorgiop · 2015-08-31T06:56:49Z

These tests pass also for numpy versions 1.6, 1.7 and 1.8.

@sdvillal do you refer to test_data's ones or to the ones in this PR?

giorgiop · 2015-08-31T07:39:54Z

I have double checked preprocessing.tests.test_data.test_standard_scaler_numerical_stability which does raise a warning if numpy 1.8 is used, as opposed to utils.tests.test_extmath.test_incremental_variance_update_formulas that does not.

MechCoder · 2015-09-16T17:11:19Z

That's it. I made a full pass. +1 for merge after comments are addressed.

Thanks a lot for your work and clear tests!

giorgiop · 2015-09-16T17:15:43Z

Thank you again @MechCoder. I have started to work on issues related to 1d inputs. Scalers were not fully covered by #5152. I will need to fix some tests too indeed.

giorgiop · 2015-09-17T09:18:15Z

By the way, I have noticed that copy is argument of transform and inverse_transform only in StandardScaler. Do we want to uniform and add it everywhere? It might be useful as it seems to me more natural to think about copy when we are transforming the data than when we are creating the object --where we must keep the parameter anyway, to avoid hidden overwriting of the input with through fit.

giorgiop · 2015-09-17T12:09:12Z

Done. I have edited the tests quite a lot, to address your comments and to deal with the new conventions for 1d inputs. Another pass on preprocessing.test.test_data by reviewers would be safe before merging. Thanks again @ogrisel @amueller @MechCoder !

The only last thing I want to do is to test the StandardScaler with "infinite" streams of data, so as to see effects of under/overflow. I will open another PR for that.

MechCoder · 2015-09-20T17:47:15Z

sklearn/preprocessing/data.py

    elif isinstance(scale, np.ndarray):
-        scale[scale == 0.0] = 1.0
-        scale[~np.isfinite(scale)] = 1.0


Sorry, that I did not follow the conservation before, but what was the reason this was removed.

Infinite values from the input are filtered out by check_array. In case those are produced by our computation, we should not silently remove them anyway.

MechCoder · 2015-09-20T18:06:51Z

I shall leave the final comments and merge to @ogrisel and @amueller (Though I already see +1 by Olivier)

MechCoder · 2015-10-07T14:14:46Z

@ogrisel we can merge?

giorgiop · 2015-10-13T07:42:34Z

Should I also update whats_new in here?

giorgiop · 2015-10-13T09:35:18Z

Should I also update whats_new in here?

Done.

amueller · 2015-10-13T21:55:34Z

Thanks :)

[MRG+2] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler

GaelVaroquaux · 2015-10-14T05:31:12Z

Merged #5104.

Whooooot!

talipini · 2019-07-31T12:19:14Z

Hi - I am trying to find a way to do partial_fit on RobustScaler. Reading through this discussion and the merged code, it looks like RobustScaler doesn't have partial_fit. I need to use RobustScaler due to outliers but running into memory issues. Do you have any suggestions to do partial_fit on RobustScaler? Appreciate any help

jnothman · 2019-07-31T14:15:01Z

To my understanding quantiles cannot be exactly computed over a stream. I'd just suggest you fit the RobustScaler on a sample of data and transform with that fitted model.

talipini · 2019-08-01T12:33:45Z

Ok, Thanks jnothman! I see other libraries such as (msmbuilder) that supports partial_fit. I am not sure how they are implemented differently but will try to find the difference.

jnothman · 2019-08-02T01:25:50Z

msmbuilder implements partial_fit as fit: https://github.com/msmbuilder/msmbuilder/blob/556a93a170782f47be53f4a1e9d740fb1c8272b3/msmbuilder/preprocessing/base.py#L129-L144

giorgiop mentioned this pull request Aug 10, 2015

Partial_fit for Preprocessing StandardScaler #5028

Closed

untom reviewed Aug 10, 2015
View reviewed changes

giorgiop force-pushed the scaler-partial-fit branch 3 times, most recently from e269b0f to a189567 Compare August 10, 2015 17:08

amueller reviewed Aug 13, 2015
View reviewed changes

giorgiop force-pushed the scaler-partial-fit branch 2 times, most recently from b0d457a to 01f9d2f Compare August 16, 2015 20:51

giorgiop force-pushed the scaler-partial-fit branch from 01f9d2f to f279a16 Compare August 16, 2015 21:12

ogrisel reviewed Aug 18, 2015
View reviewed changes

giorgiop changed the title ~~[WIP] partial_fit for StandardScaler~~ [WIP] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler Aug 20, 2015

ogrisel reviewed Aug 20, 2015
View reviewed changes

giorgiop force-pushed the scaler-partial-fit branch 3 times, most recently from 9ef580c to 93fc4b3 Compare August 21, 2015 11:03

MechCoder changed the title ~~[MRG+1] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler~~ [MRG+2] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler Sep 16, 2015

giorgiop force-pushed the scaler-partial-fit branch from 0baeb3d to b771b4f Compare September 17, 2015 12:04

giorgiop force-pushed the scaler-partial-fit branch 2 times, most recently from 94e462f to 78ce989 Compare September 17, 2015 12:19

MechCoder reviewed Sep 20, 2015
View reviewed changes

giorgiop force-pushed the scaler-partial-fit branch from 78ce989 to b157589 Compare September 21, 2015 13:07

giorgiop force-pushed the scaler-partial-fit branch from b157589 to 067df52 Compare October 13, 2015 09:34

partial_fit for scalers

6a5a2f7

giorgiop force-pushed the scaler-partial-fit branch from 067df52 to 6a5a2f7 Compare October 13, 2015 09:37

amueller added a commit that referenced this pull request Oct 13, 2015

Merge pull request #5104 from giorgiop/scaler-partial-fit

9564b94

[MRG+2] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler

amueller merged commit 9564b94 into scikit-learn:master Oct 13, 2015

giorgiop deleted the scaler-partial-fit branch November 3, 2015 12:30

rth mentioned this pull request Jan 29, 2020

TST Fix unreachable code in tests #16110

Merged

Uh oh!

[MRG+2] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler #5104

[MRG+2] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler #5104

Uh oh!

Conversation

giorgiop commented Aug 10, 2015

Uh oh!

giorgiop commented Aug 10, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Aug 13, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

giorgiop commented Aug 13, 2015

Uh oh!

amueller commented Aug 13, 2015

Uh oh!

giorgiop commented Aug 13, 2015

Uh oh!

amueller commented Aug 13, 2015

Uh oh!

giorgiop commented Aug 16, 2015

Uh oh!

ogrisel commented Aug 17, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Aug 18, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdvillal commented Aug 30, 2015

Uh oh!

giorgiop commented Aug 31, 2015

Uh oh!

giorgiop commented Aug 31, 2015

Uh oh!

MechCoder commented Sep 16, 2015

Uh oh!

giorgiop commented Sep 16, 2015

Uh oh!

giorgiop commented Sep 17, 2015

Uh oh!

giorgiop commented Sep 17, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Sep 20, 2015

Uh oh!

MechCoder commented Oct 7, 2015

Uh oh!

giorgiop commented Oct 13, 2015

Uh oh!

giorgiop commented Oct 13, 2015

Uh oh!

amueller commented Oct 13, 2015

Uh oh!

GaelVaroquaux commented Oct 14, 2015 via email

Uh oh!

talipini commented Jul 31, 2019

Uh oh!

jnothman commented Jul 31, 2019 via email

Uh oh!

talipini commented Aug 1, 2019

Uh oh!

jnothman commented Aug 2, 2019 via email

Uh oh!

Uh oh!