Skip to content

[MRG+1] BUG: reset internal state of scaler before fitting #5416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 16, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions sklearn/preprocessing/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,15 +252,36 @@ def data_range(self):
def data_min(self):
return self.data_min_

def _reset(self):
"""Reset internal data-dependent state of the scaler, if necessary.

__init__ parameters are not touched.
"""

# Checking one attribute is enough, becase they are all set together
# in partial_fit
if hasattr(self, 'scale_'):
del self.scale_
del self.min_
del self.n_samples_seen_
del self.data_min_
del self.data_max_
del self.data_range_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth trying to write slightly more generic code, something like:

attributes = [a for a in dir(self) if a.endswith('_')]

for attr in attributes:
    delattr(self, attr)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could go into BaseEstimator even. but not now ;)


def fit(self, X, y=None):
"""Compute the minimum and maximum to be used for later scaling.

It always resets the object's internal state first.

Parameters
----------
X : array-like, shape [n_samples, n_features]
The data used to compute the per-feature minimum and maximum
used for later scaling along the features axis.
"""

# Reset internal state before fitting
self._reset()
return self.partial_fit(X, y)

def partial_fit(self, X, y=None):
Expand Down Expand Up @@ -489,9 +510,25 @@ def __init__(self, copy=True, with_mean=True, with_std=True):
def std_(self):
return self.scale_

def _reset(self):
"""Reset internal data-dependent state of the scaler, if necessary.

__init__ parameters are not touched.
"""

# Checking one attribute is enough, becase they are all set together
# in partial_fit
if hasattr(self, 'scale_'):
del self.scale_
del self.n_samples_seen_
del self.mean_
del self.var_

def fit(self, X, y=None):
"""Compute the mean and std to be used for later scaling.

It always resets the object's internal state first.

Parameters
----------
X : {array-like, sparse matrix}, shape [n_samples, n_features]
Expand All @@ -500,6 +537,9 @@ def fit(self, X, y=None):

y: Passthrough for ``Pipeline`` compatibility.
"""

# Reset internal state before fitting
self._reset()
return self.partial_fit(X, y)

def partial_fit(self, X, y=None):
Expand Down Expand Up @@ -671,15 +711,33 @@ class MaxAbsScaler(BaseEstimator, TransformerMixin):
def __init__(self, copy=True):
self.copy = copy

def _reset(self):
"""Reset internal data-dependent state of the scaler, if necessary.

__init__ parameters are not touched.
"""

# Checking one attribute is enough, becase they are all set together
# in partial_fit
if hasattr(self, 'scale_'):
del self.scale_
del self.n_samples_seen_
del self.max_abs_

def fit(self, X, y=None):
"""Compute the maximum absolute value to be used for later scaling.

It always resets the object's internal state first.

Parameters
----------
X : {array-like, sparse matrix}, shape [n_samples, n_features]
The data used to compute the per-feature minimum and maximum
used for later scaling along the features axis.
"""

# Reset internal state before fitting
self._reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argh I didn't pay attention to this before.

return self.partial_fit(X, y)

def partial_fit(self, X, y=None):
Expand Down
16 changes: 16 additions & 0 deletions sklearn/preprocessing/tests/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -1498,3 +1498,19 @@ def test_one_hot_encoder_unknown_transform():
oh = OneHotEncoder(handle_unknown='42')
oh.fit(X)
assert_raises(ValueError, oh.transform, y)


def test_fit_cold_start():
X = iris.data
X_2d = X[:, :2]

# Scalers that have a partial_fit method
scalers = [StandardScaler(with_mean=False, with_std=False),
MinMaxScaler(),
MaxAbsScaler()]

for scaler in scalers:
scaler.fit_transform(X)
# with a different shape, this may break the scaler unless the internal
# state is reset
scaler.fit_transform(X_2d)