Skip to content

[MRG+2] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler #5104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 13, 2015

Conversation

giorgiop
Copy link
Contributor

I am fixing #5028. This is only about StandardScaler. I moved the call to _handle_zeros_in_scale into trasform and inverse_trasform. This is necessary since otherwise we lose the true value of std (and var) that is needed for incremental computation —or otherwise we would need an additional variable only for that, which is not very elegant. Incidentally, this addresses #4609. I have also slightly extended utils.extmath._batch_mean_variance_update and did some refactoring in decomposition.incremental_PCA, that calls the former.

A more conceptual change is the following. partial_fit must interpret an 1d array as a row matrix, i.e. must be able to process batches of size 1. This clashes with the behaviour of fit which by default assumes that 1d input arrays are column matrices. I think this is a design issue that should be tackled more generally, hence out of scope of this PR. See for example #4511. @GaelVaroquaux may want to comment here.

I have not written tests on the interplay of subsequent calls to fit and partial_fit, as the semantics does not seem defined yet. See #3896

Related open discussions:

EDIT ~ to do list:

  • partial_fit for StandardScaler, MaxAbsScaler and MinMaxScaler
  • support to CSC and CSR for all but the last
  • as discussed, there is now a validator for 2d input, that interprets it as 1-row matrix (instead of 1-column) in case of need: row_or_2d input is not checked any more, see below and [MRG + 1] Warn on 1D arrays, addresses #4511 #5152
  • scale_ is now common attribute to all scalers, and it is different from the stat that it is computed from, e.g. mean_, max_abs etc. The latter comes before _handle_zeros_in_scale, the former after it. partial_fit requires the true stats value for the incremental update. As we need to save it, I also expose it for the object --also in the case only fit is used.
  • cosmetics in docstring, functions calls and cython identations
  • remove silent copy inside _handle_zeros_in_scale
  • double check incr_mean_variance_axis0 in sparsefuncs_fast.pyx for avoiding some redundant computation
  • additional test for numerical stability of _batch_mean_variance_update
  • respect attribute name convention
  • test for numerical issues from Numpy < 1.9
  • cosmetics for line breaks in expression etc.
  • remove tests with 1d inputs

@giorgiop
Copy link
Contributor Author

As additional note, I do not support sparse inputs at the moment. It is not a trivial extension and I would like to have a feedback on what I have so far.

It assumed that X is already 2d.
"""
if X.shape[1] == 1:
return np.transpose(np.array(X, copy=copy))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this destroy sparsity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does. I have edited and integrated a simple test in the next commit.

@giorgiop giorgiop force-pushed the scaler-partial-fit branch 3 times, most recently from e269b0f to a189567 Compare August 10, 2015 17:08
@amueller
Copy link
Member

partial_fit should only accept 2d X if possible.


# This is the first partial_fit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember what the consensus was wrt calling fit and then partial_fit. Should we forget the model? I think there are some arguments about that in #3907.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read #3907 last week: I don't think consensus was reached. In a way, there is no contract with the final user. Last comment by @jnothman

The point is that the interaction between fit and partial_fit hasn't been previously addressed. The assumption is that the user will choose one or the other, but not both. Forcing a particular behaviour by way of the proposed tests will make the matter more consistent, but is also perhaps more of a framework than is actually needed, because the case is somewhat pathological.

I think @ogrisel is planning to put some thought into the possibility of an API for partial fitting, which would settle the semantics for all the case.

@giorgiop
Copy link
Contributor Author

partial_fit should only accept 2d X if possible.

Do you mean that we should not allow the batches to have size 1? This would be a fairly standard application for an online scaler.

@amueller
Copy link
Member

No, I mean that a batch of size one should have a shape of (1, n_features). [Also probably not a good idea for python but I would still allow it]

@giorgiop
Copy link
Contributor Author

Agreed. Are you suggesting to raise an error? Again, it's hard to see a common behaviour in all the partial_fit methods in the code.

@amueller
Copy link
Member

There is no currently respected behavior but we will move to raising errors if X.ndim != 2.
Otherwise you can raise an error (as this is a new function).

@giorgiop giorgiop force-pushed the scaler-partial-fit branch 2 times, most recently from b0d457a to 01f9d2f Compare August 16, 2015 20:51
@giorgiop
Copy link
Contributor Author

  • partial_fit for StandardScaler, MaxAbsScaler and MinMaxScaler
  • support to CSC and CSR for all but the last
  • as discussed above, there is now a validator for 2d input, that interprets it as 1-row matrix (instead of 1-column) in case of need: row_or_2d
  • scale_ is now common attribute to all scalers, and it is different from the stat that it is computed from, e.g. mean_, max_abs etc. The latter comes before _handle_zeros_in_scale, the former after it. partial_fit requires the true stats value for the incremental update. As we need to save it, I also expose it for the object --also in the case only fit is used.

Possible refactoring

  • utils.extmath._batch_mean_variance_update
  • utils.sparsefunc_fast.incr_mean_variance_axis
    are the same algorithm, the second is cythonized for manipulating sparse input, but they are essentially the same thing.

@giorgiop giorgiop force-pushed the scaler-partial-fit branch from 01f9d2f to f279a16 Compare August 16, 2015 21:12
@ogrisel
Copy link
Member

ogrisel commented Aug 17, 2015

+1 for raising a ValueError with an informative error message when X.ndim != 2 in partial_fit.

# New array to avoid side-effects
new_scale = np.array([x for x in scale])
new_scale[new_scale == 0.0] = 1.0
new_scale[~np.isfinite(new_scale)] = 1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why silently remove non-finite values? The check_array-based input checks in StandardScaler and the like should already raise an exception if there are non-finite values in the data. No need to deal with that case again here. If the variance computation overflows (in the event our implementation is not numerically stable) we should not silently hide it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The conversion was already there and I replicated it. No test is broken if that's removed.

@ogrisel
Copy link
Member

ogrisel commented Aug 18, 2015

Can you please add a test to check that _batch_mean_variance_update is numerically stable on some synthetic data that is specifically engineered to trigger numerical problems on a naive version of the online mean / variance computation such as the first formulas presented in:

https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm

@giorgiop giorgiop changed the title [WIP] partial_fit for StandardScaler [WIP] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler Aug 20, 2015
self.min_ = feature_range[0] - data_min * self.scale_
self.data_min = data_min
self.data_max = data_max
self.data_range = data_range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all attributes that are computed from the data in the fit method should be suffixed with _ (scikit-learn convention).

@giorgiop giorgiop force-pushed the scaler-partial-fit branch 3 times, most recently from 9ef580c to 93fc4b3 Compare August 21, 2015 11:03
@sdvillal
Copy link
Contributor

These tests pass also for numpy versions 1.6, 1.7 and 1.8.

@giorgiop
Copy link
Contributor Author

These tests pass also for numpy versions 1.6, 1.7 and 1.8.

@sdvillal do you refer to test_data's ones or to the ones in this PR?

@giorgiop
Copy link
Contributor Author

I have double checked preprocessing.tests.test_data.test_standard_scaler_numerical_stability which does raise a warning if numpy 1.8 is used, as opposed to utils.tests.test_extmath.test_incremental_variance_update_formulas that does not.

@MechCoder
Copy link
Member

That's it. I made a full pass. +1 for merge after comments are addressed.

Thanks a lot for your work and clear tests!

@MechCoder MechCoder changed the title [MRG+1] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler [MRG+2] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler Sep 16, 2015
@giorgiop
Copy link
Contributor Author

Thank you again @MechCoder. I have started to work on issues related to 1d inputs. Scalers were not fully covered by #5152. I will need to fix some tests too indeed.

@giorgiop
Copy link
Contributor Author

By the way, I have noticed that copy is argument of transform and inverse_transform only in StandardScaler. Do we want to uniform and add it everywhere? It might be useful as it seems to me more natural to think about copy when we are transforming the data than when we are creating the object --where we must keep the parameter anyway, to avoid hidden overwriting of the input with through fit.

@giorgiop
Copy link
Contributor Author

Done. I have edited the tests quite a lot, to address your comments and to deal with the new conventions for 1d inputs. Another pass on preprocessing.test.test_data by reviewers would be safe before merging. Thanks again @ogrisel @amueller @MechCoder !

The only last thing I want to do is to test the StandardScaler with "infinite" streams of data, so as to see effects of under/overflow. I will open another PR for that.

@giorgiop giorgiop force-pushed the scaler-partial-fit branch 2 times, most recently from 94e462f to 78ce989 Compare September 17, 2015 12:19
elif isinstance(scale, np.ndarray):
scale[scale == 0.0] = 1.0
scale[~np.isfinite(scale)] = 1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that I did not follow the conservation before, but what was the reason this was removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infinite values from the input are filtered out by check_array. In case those are produced by our computation, we should not silently remove them anyway.

@MechCoder
Copy link
Member

I shall leave the final comments and merge to @ogrisel and @amueller (Though I already see +1 by Olivier)

@MechCoder
Copy link
Member

@ogrisel we can merge?

@giorgiop
Copy link
Contributor Author

Should I also update whats_new in here?

@giorgiop
Copy link
Contributor Author

Should I also update whats_new in here?

Done.

@amueller
Copy link
Member

Thanks :)

amueller added a commit that referenced this pull request Oct 13, 2015
[MRG+2] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler
@amueller amueller merged commit 9564b94 into scikit-learn:master Oct 13, 2015
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 14, 2015 via email

@giorgiop giorgiop deleted the scaler-partial-fit branch November 3, 2015 12:30
@talipini
Copy link

Hi - I am trying to find a way to do partial_fit on RobustScaler. Reading through this discussion and the merged code, it looks like RobustScaler doesn't have partial_fit. I need to use RobustScaler due to outliers but running into memory issues. Do you have any suggestions to do partial_fit on RobustScaler? Appreciate any help

@jnothman
Copy link
Member

jnothman commented Jul 31, 2019 via email

@talipini
Copy link

talipini commented Aug 1, 2019

Ok, Thanks jnothman! I see other libraries such as (msmbuilder) that supports partial_fit. I am not sure how they are implemented differently but will try to find the difference.

@jnothman
Copy link
Member

jnothman commented Aug 2, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants