Skip to content

[MRG] ignore NaNs in PowerTransformer #11306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Jun 21, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
5dba57a
EHN ignore NaN when incrementing mean and var
glemaitre Jun 5, 2018
591c18f
EHN ignore NaNs in StandardScaler for dense case
glemaitre Jun 5, 2018
43fa54b
EHN should handle sparse case
glemaitre Jun 8, 2018
faab12a
TST launch common tests
glemaitre Jun 8, 2018
fa12fe7
FIX use loop
glemaitre Jun 8, 2018
4d60bfe
FIX number of samples first iteration
glemaitre Jun 8, 2018
eba3087
FIX use proper sparse constructor
glemaitre Jun 11, 2018
1782e69
BUG wrong index and remove nan counter
glemaitre Jun 11, 2018
0b074ce
revert
glemaitre Jun 11, 2018
a6cb76d
cleanup
glemaitre Jun 11, 2018
ab2c465
FIX use sum on the boolean array
glemaitre Jun 11, 2018
76691a9
TST equivalance function and class
glemaitre Jun 11, 2018
d5ece66
Merge remote-tracking branch 'origin/master' into nan_standardscaler
glemaitre Jun 12, 2018
504323c
Merge remote-tracking branch 'origin/master' into nan_standardscaler
glemaitre Jun 15, 2018
8d13da1
backward compatibility for n_samples_seen_
glemaitre Jun 15, 2018
124742b
spelling
glemaitre Jun 15, 2018
2ffe497
TST revert some test for back compatibility
glemaitre Jun 15, 2018
424fdba
TST check NaN are ignored in incremental_mean_and_variance
glemaitre Jun 15, 2018
0fe8a3e
TST check NaNs ignore in incr_mean_variance
glemaitre Jun 15, 2018
f267a35
OPTIM cython variable typing
glemaitre Jun 15, 2018
4785fb2
DOC corrections
glemaitre Jun 15, 2018
082633d
DOC mentioned that NaNs are ignored in Notes
glemaitre Jun 16, 2018
d79b867
TST shape of n_samples_seen with missing values
glemaitre Jun 16, 2018
cb077ea
DOC fix spelling
glemaitre Jun 16, 2018
449d24a
DOC whats new entry
glemaitre Jun 16, 2018
48d70b0
tmp
glemaitre Jun 17, 2018
43cbe9a
ignore nan by slicing
glemaitre Jun 17, 2018
4019c3b
iter
glemaitre Jun 17, 2018
9cbb7e8
TST mark as allowing nan
glemaitre Jun 17, 2018
45f4dfd
COMPAT scipy lower than 0.14
glemaitre Jun 18, 2018
0072288
Update data.py
glemaitre Jun 18, 2018
3b29485
Merge remote-tracking branch 'origin/master' into nan_powertransformer
glemaitre Jun 21, 2018
08960ec
EHN: add boxcox in fixes
glemaitre Jun 21, 2018
e626a39
PEP8
glemaitre Jun 21, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions doc/whats_new/v0.20.rst
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,10 @@ Preprocessing
ignore and pass-through NaN values.
:issue:`11206` by :user:`Guillaume Lemaitre <glemaitre>`.

- :class:`preprocessing.PowerTransformer` and
:func:`preprocessing.power_transform` ignore and pass-through NaN values.
:issue:`11306` by :user:`Guillaume Lemaitre <glemaitre>`.

Model evaluation and meta-estimators

- A scorer based on :func:`metrics.brier_score_loss` is also available.
Expand Down
26 changes: 21 additions & 5 deletions sklearn/preprocessing/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
from ..utils import check_array
from ..utils.extmath import row_norms
from ..utils.extmath import _incremental_mean_and_var
from ..utils.fixes import nanpercentile
from ..utils.fixes import boxcox, nanpercentile
from ..utils.sparsefuncs_fast import (inplace_csr_row_normalize_l1,
inplace_csr_row_normalize_l2)
from ..utils.sparsefuncs import (inplace_column_scale,
Expand Down Expand Up @@ -836,6 +836,9 @@ class MaxAbsScaler(BaseEstimator, TransformerMixin):

Notes
-----
NaNs are treated as missing values: disregarded in fit, and maintained in
transform.

For a comparison of the different scalers, transformers, and normalizers,
see :ref:`examples/preprocessing/plot_all_scaling.py
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
Expand Down Expand Up @@ -973,6 +976,9 @@ def maxabs_scale(X, axis=0, copy=True):

Notes
-----
NaNs are treated as missing values: disregarded to compute the statistics,
and maintained during the data transformation.

For a comparison of the different scalers, transformers, and normalizers,
see :ref:`examples/preprocessing/plot_all_scaling.py
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
Expand Down Expand Up @@ -2429,6 +2435,9 @@ class PowerTransformer(BaseEstimator, TransformerMixin):

Notes
-----
NaNs are treated as missing values: disregarded in fit, and maintained in
transform.

For a comparison of the different scalers, transformers, and normalizers,
see :ref:`examples/preprocessing/plot_all_scaling.py
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
Expand Down Expand Up @@ -2468,7 +2477,10 @@ def fit(self, X, y=None):
transformed = []

for col in X.T:
col_trans, lmbda = stats.boxcox(col, lmbda=None)
# the computation of lambda is influenced by NaNs and we need to
# get rid of them to compute them.
_, lmbda = stats.boxcox(col[~np.isnan(col)], lmbda=None)
col_trans = boxcox(col, lmbda)
self.lambdas_.append(lmbda)
transformed.append(col_trans)

Expand All @@ -2493,7 +2505,7 @@ def transform(self, X):
X = self._check_input(X, check_positive=True, check_shape=True)

for i, lmbda in enumerate(self.lambdas_):
X[:, i] = stats.boxcox(X[:, i], lmbda=lmbda)
X[:, i] = boxcox(X[:, i], lmbda)

if self.standardize:
X = self._scaler.transform(X)
Expand Down Expand Up @@ -2548,9 +2560,10 @@ def _check_input(self, X, check_positive=False, check_shape=False,
check_method : bool
If True, check that the transformation method is valid.
"""
X = check_array(X, ensure_2d=True, dtype=FLOAT_DTYPES, copy=self.copy)
X = check_array(X, ensure_2d=True, dtype=FLOAT_DTYPES, copy=self.copy,
force_all_finite='allow-nan')

if check_positive and self.method == 'box-cox' and np.any(X <= 0):
if check_positive and self.method == 'box-cox' and np.nanmin(X) <= 0:
raise ValueError("The Box-Cox transformation can only be applied "
"to strictly positive data")

Expand Down Expand Up @@ -2622,6 +2635,9 @@ def power_transform(X, method='box-cox', standardize=True, copy=True):

Notes
-----
NaNs are treated as missing values: disregarded to compute the statistics,
and maintained during the data transformation.

For a comparison of the different scalers, transformers, and normalizers,
see :ref:`examples/preprocessing/plot_all_scaling.py
<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
Expand Down
17 changes: 11 additions & 6 deletions sklearn/preprocessing/tests/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,12 @@

from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import scale
from sklearn.preprocessing import power_transform
from sklearn.preprocessing import quantile_transform

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import QuantileTransformer

from sklearn.utils.testing import assert_array_equal
Expand All @@ -28,19 +30,22 @@ def _get_valid_samples_by_column(X, col):


@pytest.mark.parametrize(
"est, func, support_sparse",
[(MinMaxScaler(), minmax_scale, False),
(StandardScaler(), scale, False),
(StandardScaler(with_mean=False), scale, True),
(QuantileTransformer(n_quantiles=10), quantile_transform, True)]
"est, func, support_sparse, strictly_positive",
[(MinMaxScaler(), minmax_scale, False, False),
(StandardScaler(), scale, False, False),
(StandardScaler(with_mean=False), scale, True, False),
(PowerTransformer(), power_transform, False, True),
(QuantileTransformer(n_quantiles=10), quantile_transform, True, False)]
)
def test_missing_value_handling(est, func, support_sparse):
def test_missing_value_handling(est, func, support_sparse, strictly_positive):
# check that the preprocessing method let pass nan
rng = np.random.RandomState(42)
X = iris.data.copy()
n_missing = 50
X[rng.randint(X.shape[0], size=n_missing),
rng.randint(X.shape[1], size=n_missing)] = np.nan
if strictly_positive:
X += np.nanmin(X) + 0.1
X_train, X_test = train_test_split(X, random_state=1)
# sanity check
assert not np.all(np.isnan(X_train), axis=0).any()
Expand Down
3 changes: 2 additions & 1 deletion sklearn/utils/estimator_checks.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,8 @@
'RandomForestRegressor', 'Ridge', 'RidgeCV']

ALLOW_NAN = ['Imputer', 'SimpleImputer', 'MICEImputer',
'MinMaxScaler', 'StandardScaler', 'QuantileTransformer']
'MinMaxScaler', 'StandardScaler', 'PowerTransformer',
'QuantileTransformer']


def _yield_non_meta_checks(name, estimator):
Expand Down
11 changes: 11 additions & 0 deletions sklearn/utils/fixes.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,17 @@ def divide(x1, x2, out=None, dtype=None):
return out


# boxcox ignore NaN in scipy.special.boxcox after 0.14
if sp_version < (0, 14):
from scipy import stats

def boxcox(x, lmbda):
with np.errstate(invalid='ignore'):
return stats.boxcox(x, lmbda)
else:
from scipy.special import boxcox # noqa


if sp_version < (0, 15):
# Backport fix for scikit-learn/scikit-learn#2986 / scipy/scipy#4142
from ._scipy_sparse_lsqr_backport import lsqr as sparse_lsqr
Expand Down