Skip to content

[MRG+1] SimpleImputer(strategy="constant") #11211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Jun 20, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
74384c6
added tests for constant impute strategy in simpleImputer
jeremiedbb Jun 6, 2018
6dd6a5e
typos
jeremiedbb Jun 6, 2018
2b101fb
typos
jeremiedbb Jun 6, 2018
6e13e68
typos
jeremiedbb Jun 6, 2018
f300fe2
added constant strategy to the SimpleImputer.
jeremiedbb Jun 6, 2018
35e30ac
bug fixes on the SimpleImputer and change for default value to np.nan…
jeremiedbb Jun 7, 2018
ea4a929
object dtypes support for "most_frequent" strategy in SimpleImputer
jeremiedbb Jun 7, 2018
10f165b
minor fixes regarding the change of default missing_values="NaN" to n…
jeremiedbb Jun 11, 2018
a6c33b1
Changed the test in estimator_check to allow np.nan as default value …
jeremiedbb Jun 11, 2018
9c2a407
fix for older versions of numpy
jeremiedbb Jun 11, 2018
df8608b
.
jeremiedbb Jun 11, 2018
4517687
fix for old versions of numpy v2
jeremiedbb Jun 12, 2018
1f1c6a0
minor fixes and added doc example for categorical inputs
jeremiedbb Jun 13, 2018
2f6d0b1
DOCTEST fix printing estimator
glemaitre Jun 13, 2018
64fa1bc
Merge remote-tracking branch 'origin/master' into jeremiedbb-constant…
glemaitre Jun 13, 2018
3884d4e
EXA fix example using constant strategy
glemaitre Jun 13, 2018
72eb6b5
COSMIT
glemaitre Jun 13, 2018
cc9aa6f
COSMIT
glemaitre Jun 13, 2018
d4e5226
adressed @glemaitre remarks
jeremiedbb Jun 14, 2018
6efd122
small corrections
jeremiedbb Jun 14, 2018
fbaaa38
small corrections
jeremiedbb Jun 14, 2018
724a4a1
fixed np.nan is not np.float('nan') issue
jeremiedbb Jun 15, 2018
e5f4a1b
add tests for is_scalar_nan
jeremiedbb Jun 15, 2018
d69f855
fixed
jeremiedbb Jun 15, 2018
c3a730d
fixed v2
jeremiedbb Jun 15, 2018
94d7964
adressed @jnothman remark
jeremiedbb Jun 15, 2018
20456f4
add tests for warnings and errors catch
jeremiedbb Jun 17, 2018
e2ae626
Merge branch 'master' into constant-imputer
jeremiedbb Jun 18, 2018
7d3d1b5
dtype checks modifications + more tests
jeremiedbb Jun 18, 2018
972668b
fixed exception catching + go back to not allow any but object dtype
jeremiedbb Jun 18, 2018
f1da7b8
error message update
jeremiedbb Jun 20, 2018
fb1a4e9
with tests update is better
jeremiedbb Jun 20, 2018
c8246f2
TypeError -> ValueError
jeremiedbb Jun 20, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions doc/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,13 @@ def setup_compose():
raise SkipTest("Skipping compose.rst, pandas not installed")


def setup_impute():
try:
import pandas # noqa
except ImportError:
raise SkipTest("Skipping impute.rst, pandas not installed")


def pytest_runtest_setup(item):
fname = item.fspath.strpath
if fname.endswith('datasets/labeled_faces.rst'):
Expand All @@ -76,6 +83,8 @@ def pytest_runtest_setup(item):
setup_working_with_text_data()
elif fname.endswith('modules/compose.rst'):
setup_compose()
elif fname.endswith('modules/impute.rst'):
setup_impute()


def pytest_runtest_teardown(item):
Expand Down
32 changes: 25 additions & 7 deletions doc/modules/impute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,19 +20,20 @@ Univariate feature imputation
=============================

The :class:`SimpleImputer` class provides basic strategies for imputing missing
values, either using the mean, the median or the most frequent value of
the row or column in which the missing values are located. This class
also allows for different missing values encodings.
values. Missing values can be imputed with a provided constant value, or using
the statistics (mean, median or most frequent) of each column in which the
missing values are located. This class also allows for different missing values
encodings.

The following snippet demonstrates how to replace missing values,
encoded as ``np.nan``, using the mean value of the columns (axis 0)
that contain the missing values::

>>> import numpy as np
>>> from sklearn.impute import SimpleImputer
>>> imp = SimpleImputer(missing_values='NaN', strategy='mean')
>>> imp = SimpleImputer(missing_values=np.nan, strategy='mean')
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]]) # doctest: +NORMALIZE_WHITESPACE
SimpleImputer(copy=True, missing_values='NaN', strategy='mean', verbose=0)
SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean', verbose=0)
>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
>>> print(imp.transform(X)) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
[[4. 2. ]
Expand All @@ -45,7 +46,7 @@ The :class:`SimpleImputer` class also supports sparse matrices::
>>> X = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])
>>> imp = SimpleImputer(missing_values=0, strategy='mean')
>>> imp.fit(X) # doctest: +NORMALIZE_WHITESPACE
SimpleImputer(copy=True, missing_values=0, strategy='mean', verbose=0)
SimpleImputer(copy=True, fill_value=None, missing_values=0, strategy='mean', verbose=0)
>>> X_test = sp.csc_matrix([[0, 2], [6, 0], [7, 6]])
>>> print(imp.transform(X_test)) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
[[4. 2. ]
Expand All @@ -56,6 +57,23 @@ Note that, here, missing values are encoded by 0 and are thus implicitly stored
in the matrix. This format is thus suitable when there are many more missing
values than observed values.

The :class:`SimpleImputer` class also supports categorical data represented as
string values or pandas categoricals when using the ``'most_frequent'`` or
``'constant'`` strategy::

>>> import pandas as pd
>>> df = pd.DataFrame([["a", "x"],
... [np.nan, "y"],
... ["a", np.nan],
... ["b", "y"]], dtype="category")
...
>>> imp = SimpleImputer(strategy="most_frequent")
>>> print(imp.fit_transform(df)) # doctest: +NORMALIZE_WHITESPACE
[['a' 'x']
['a' 'y']
['a' 'y']
['b' 'y']]

.. _mice:

Multivariate feature imputation
Expand All @@ -76,7 +94,7 @@ Here is an example snippet::
>>> imp = MICEImputer(n_imputations=10, random_state=0)
>>> imp.fit([[1, 2], [np.nan, 3], [7, np.nan]])
MICEImputer(imputation_order='ascending', initial_strategy='mean',
max_value=None, min_value=None, missing_values='NaN', n_burn_in=10,
max_value=None, min_value=None, missing_values=nan, n_burn_in=10,
n_imputations=10, n_nearest_features=None, predictor=None,
random_state=0, verbose=False)
>>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
Expand Down
11 changes: 11 additions & 0 deletions doc/whats_new/v0.20.rst
Original file line number Diff line number Diff line change
Expand Up @@ -613,6 +613,17 @@ Imputer
SimpleImputer().fit_transform(X.T).T)``). :issue:`10829` by :user:`Guillaume
Lemaitre <glemaitre>` and :user:`Gilberto Olimpio <gilbertoolimpio>`.

- The :class:`impute.SimpleImputer` has a new strategy, ``'constant'``, to
complete missing values with a fixed one, given by the ``fill_value``
parameter. This strategy supports numeric and non-numeric data, and so does
the ``'most_frequent'`` strategy now. :issue:`11211` by :user:`Jeremie du
Boisberranger <jeremiedbb>`.

- The NaN marker for the missing values has been changed between the
:class:`preprocessing.Imputer` and the :class:`impute.SimpleImputer`.
``missing_values='NaN'`` should now be ``missing_values=np.nan``.
:issue:`11211` by :user:`Jeremie du Boisberranger <jeremiedbb>`.

Outlier Detection models

- More consistent outlier detection API:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,14 +27,16 @@
from __future__ import print_function

import pandas as pd
import numpy as np

from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, CategoricalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
Expand All @@ -49,36 +51,37 @@
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.
numeric_features = ['age', 'fare']
categorical_features = ['embarked', 'sex', 'pclass']

# Provisionally, use pd.fillna() to impute missing values for categorical
# features; SimpleImputer will eventually support strategy="constant".
data[categorical_features] = data[categorical_features].fillna(value='missing')

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_transformer = make_pipeline(SimpleImputer(), StandardScaler())
categorical_transformer = CategoricalEncoder('onehot-dense',
handle_unknown='ignore')
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', CategoricalEncoder('onehot-dense', handle_unknown='ignore'))])

preprocessing_pl = make_column_transformer(
(numeric_features, numeric_transformer),
(categorical_features, categorical_transformer),
remainder='drop'
)
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)],
remainder='drop')

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = make_pipeline(preprocessing_pl, LogisticRegression())
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression())])

X = data.drop('survived', axis=1)
y = data.survived.values
y = data['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
shuffle=True)

clf.fit(X_train, y_train)
print("model score: %f" % clf.score(X_test, y_test))
print("model score: %.3f" % clf.score(X_test, y_test))


###############################################################################
Expand All @@ -93,12 +96,12 @@


param_grid = {
'columntransformer__pipeline__simpleimputer__strategy': ['mean', 'median'],
'logisticregression__C': [0.1, 1.0, 1.0],
'preprocessor__num__imputer__strategy': ['mean', 'median'],
'classifier__C': [0.1, 1.0, 10, 100],
}

grid_search = GridSearchCV(clf, param_grid, cv=10, iid=False)
grid_search.fit(X_train, y_train)

print(("best logistic regression from grid search: %f"
print(("best logistic regression from grid search: %.3f"
% grid_search.score(X_test, y_test)))
Loading