Skip to content

[MRG+1] SimpleImputer(strategy="constant") #11211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Jun 20, 2018

Conversation

jeremiedbb
Copy link
Member

This is WIP to implement the constant strategy for the SimpleImputer as described in #11208.

@@ -94,6 +104,13 @@ class SimpleImputer(BaseEstimator, TransformerMixin):
each column.
- If "most_frequent", then replace missing using the most frequent
value along each column.
- If "constant", then replace missing values with fill_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: end with "."

if self.missing_values == "NaN" or np.isnan(self.missing_values):
force_all_finite = "allow-nan"
else:
force_all_finite = True
Copy link
Member

@ogrisel ogrisel Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this should be:

if isinstance(self.missing_value, numbers.Real) and np.isnan(self.missing_values):
    force_all_finite = "allow-nan"
else:
    force_all_finite = True

or even:

force_all_finite = True if self.missing_values is not np.nan else 'allow-nan'

@@ -115,16 +132,41 @@ class SimpleImputer(BaseEstimator, TransformerMixin):
Notes
-----
Columns which only contained missing values at `fit` are discarded upon
`transform`.
`transform` is strategy is not "constant"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is -> if

@sklearn-lgtm

This comment has been minimized.

@sklearn-lgtm

This comment has been minimized.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments

if value_to_mask == "NaN" or np.isnan(value_to_mask):
return np.isnan(X)
if value_to_mask is np.nan:
# nan values are never equal to themselves
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add that this way it also works for object dtypes (in case somebody would wonder why not np.isnan is used)

@@ -80,10 +82,10 @@ class SimpleImputer(BaseEstimator, TransformerMixin):

Parameters
----------
missing_values : integer or "NaN", optional (default="NaN")
missing_values : real number, string, np.nan or None,
optional (default=np.nan).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sphinx does not really like this way of wrapping the line ..
If you want it to render nicely we can do

    missing_values : real number, string, np.nan or None, \
optional (default=np.nan).
         The placeholder for ...

@sklearn-lgtm

This comment has been minimized.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is starting to look really good. As told in real life the main thing missing is a narrative doc update.

Here are some further comments in the code:


def test_imputation_constant_object():
# Test imputation using the constant strategy
# on objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: comment fits on one line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has been copied several times.

else:
dtype = object

return np.full(X.shape[1], fill_value, dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should always be:

return np.full(X.shape[1], fill_value, dtype=X.dtype)

independently of fill_value no? If fill_value is inconsistent with X.dtype after validation, fit should have already raised an informative error message above.

if self.strategy == "constant":
if (X.dtype.kind in ("i", "f")
and not isinstance(fill_value, numbers.Real)):
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a TypeError because we check isintance in the condition.

and not isinstance(fill_value, numbers.Real)):
raise ValueError(
"fill_value={0} is invalid. Expected a numerical value "
"to numerical data".format(fill_value))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also issue informative error messages the following cases:

elif X.dtype.kind == "O" and fill_value not in six.string_types:
        raise TypeError("fill_value={0} is invalid. Expected the an str instance when" 
                        " imputing categorical data.".format(fill_value))
else:
    raise ValueError("SimpleImputer cannot work on data with dtype={0}:"
                     " expecting numerical or categorical data with dtype=object"
                     "".format(X.dtype))

We should also add tests to check that those exceptions and their message are raised on invalid inputs.

git grep "with pytest.raises" to find examples on how to test exception in the scikit-learn test suite using pytest.

def _dense_fit(self, X, strategy, missing_values):
# Constant
elif strategy == "constant":

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style consistency: no need for a blank line here.

if self.missing_values == 'NaN'
or np.isnan(self.missing_values) else True)
force_all_finite="allow-nan"
if self.missing_values is np.nan else True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style:please define a local variable to improve readability:

force_all_finite = "allow-nan" if self.missing_values is np.nan else True

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment.

if X.dtype.kind == "O":
most_frequent = np.empty(X.shape[0], dtype=object)
else:
most_frequent = np.empty(X.shape[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be:

most_frequent = np.empty(X.shape[0], dtype=X.dtype)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel actually, we don't always want most_frequent to have the same dtype as X. For example, if there is a column full of NaNs in an integer array, X has integer dtype, whereas most_frequent (which stores the statistics column-wise) will have a np.nan for the column of NaNs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so discussed here: if you have integer with all -1, specify missing_value of -1, then the most_frequent will be NaN (due to implementation details to signal this case), and then X.dtype would clash with NaN.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@ogrisel
Copy link
Member

ogrisel commented Jun 8, 2018

BTW there are still some tests are broken because of the switch from "NaN" to np.nan as the default missing value marker.

We decided to not use a string as the default marker to avoid having weird behavior in case a user has the "NaN" string in a CSV file for instance. I think it's less surprising to not have "magic" string in SimpleImputer and now is the time to change this because of the unreleased introduction of the new SimpleImputer class.

@glemaitre glemaitre added this to the 0.20 milestone Jun 8, 2018
@jeremiedbb
Copy link
Member Author

The switch from "NaN" to np.nan for the default value of missing_values does not seem to be a good idea eventually. Setting a parameter to np.nan as default is not possible right now. There is a general test for all estimators that checks the default parameters.

    # We get the default parameters from init and then
    # compare these against the actual values of the attributes.

(from estimator_checks.py in check_parameters_default_constructible)

It results in a nan != nan error.

I see 3 solutions:

  • Modify the check_parameters_default_constructible to allow np.nan. I don't think that one is a good idea
  • Go back to "NaN" for the default value.
  • Set default value to None and call the fit with missing_values=np.nan if X is numerical and leave None otherwise.

Which solution should we apply ?

@ogrisel
Copy link
Member

ogrisel commented Jun 11, 2018

Hum pandas.read_csv is using object dtype columns with np.nan for missing values when parsing CSV files with string contents in a column. So using None as the default missing value for non-numerical columns will not work out of the box for this kind of pipeline.

Maybe we should go back to the "NaN" special magic string. But this mean that we will not be able to properly handle columns that hat this token as a legit value and missing values in the same column.

Another option would be to introduce a module level MISSING singleton marker:

class MissingValueMarker:
    pass


MISSING = MissingValueMarker()


class SimpleImputer(strategy='...', missing_value=MISSING, ...):
    ...

This marker could by default match np.nan both for floating point and object dtype data. For integer dtyped data it would not match anything though and the user would be required to pass a specific integer value explicitly.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Jun 11, 2018

I said "and leave None otherwise" but we could call the fit with missing_values = np.nan for all dtypes.
So instead of

 if sparse.issparse(X):
      self.statistics_ = self._sparse_fit(X,
                                          self.strategy,
                                          self.missing_values,
                                          fill_value)

we'd have something like

missing_values = np.nan if self.missing_values is None else self.missing_values

 if sparse.issparse(X):
     self.statistics_ = self._sparse_fit(X,
                                         self.strategy,
                                         missing_values,
                                         fill_value)

is it fine ?

@ogrisel
Copy link
Member

ogrisel commented Jun 11, 2018

The problem with this approach is that nobody could use None as the missing value marker in that case. There might be pipelines where None is used upstream as the missing value marker for object dtyped data even if it's not the default behavior of pandas dataframes.

@jeremiedbb
Copy link
Member Author

That's a problem indeed.

Another option would be to introduce a module level MISSING singleton marker:
it seems that's the only way to avoid the problem of having a default value different of the np.nan missing_values. And it's more elegant.
I'm going with this solution if it's ok for you

@jnothman
Copy link
Member

jnothman commented Jun 11, 2018 via email

@jeremiedbb
Copy link
Member Author

Yes it's the only remaining issue regarding np.nan as the default value. It only happens when checking the constructor. Afterward, the imputer works fine with nans.

@jorisvandenbossche
Copy link
Member

Yes, agree with @jnothman. np.nan is the logical default, so then it seems a bit stupid to introduce something some singleton marker that will be repladed with np.nan afterwards anyhow, to just satisfy the current estimator checks.

@sklearn-lgtm

This comment has been minimized.

if not isinstance(fill_value, six.string_types):
raise TypeError(
"fill_value={0} is invalid. Expected an str instance "
"when imputing categorical data.".format(fill_value))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is good / needed to be that strict here. For example, you can have Categorical data in pandas with integer categories (or other non-string values) where it can make sense to fill with another value (and column.dtype.kind will be "O" for categorical columns)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a test for this specific use case: imputing a categorical column with a pandas categorical value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe
if not (isinstance(fill_value, six.string_types) or isinstance(fill_value, numbers.Reals) or fill_value is None)
is enough ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked IRL with @jorisvandenbossche and if the categorical datafram has integer values with missing values, then the check_array conversion will yield a floating point array so its already covered by the existing tests.

Let's stay strict in this checks for now. If a user reports a valid use case for being laxer we can always change that later.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also avoid @ referencing me in commit messages... if it gets merged, it sends me spam notifications every time someone does something silly (like rebasing) with public forks of scikit-learn.

def _validate_input(self, X):
allowed_strategies = ["mean", "median", "most_frequent", "constant"]
if self.strategy not in allowed_strategies:
raise ValueError("Can only use these strategies: {0} "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No test coverage

if invalid_mask.any():
missing = np.arange(X.shape[1])[invalid_mask]
if self.verbose:
warnings.warn("Deleting features without "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not tested

@sklearn-lgtm
Copy link

This pull request introduces 3 alerts when merging 20456f4 into bb38539 - view on LGTM.com

new alerts:

  • 3 for Comparison of identical values

Comment posted by LGTM.com

@jnothman
Copy link
Member

Please add an entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

Or perhaps modify the existing entry introducing SimpleImputer. Note the differences:

  • strategy='constant'
  • strategy='most_frequent' with string columns
  • missing_values="NaN" should now be missing_values=np.nan

@@ -80,10 +95,10 @@ class SimpleImputer(BaseEstimator, TransformerMixin):

Parameters
----------
missing_values : integer or "NaN", optional (default="NaN")
missing_values : real number, string, np.nan or None, \
optional (default=np.nan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace these two lines with:

    missing_values : number, string, np.nan (default) or None

what we currently have here is unnecessarily verbose.

" value when imputing numerical"
" data".format(fill_value))

elif X.dtype.kind == "O":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... object dtype does not necessarily mean string data. And string data does not necessarily mean object dtype (numpy.asarray with strings will not result in object data; we need to test this case). So I'm not sure this validation or its error message is quite right. I don't think we need to support numeric data in object arrays necessarily, but we do need to support string arrays if we're going to support object arrays, IMO, and we need to make sure the error messages are appropriate to the various cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for adding support and tests for accepting string arrays even if in most cases that are not a recommended data structure for representing categorical variables.

As for the input validation on object dtype array, we can just drop the elif X.dtype.kind == "O" case and not raise any TypeError and accept any fill_value when X.dtype.kind == "O".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there are some complexities to deal with in that case, eg the fact that filling with new values might truncate silently:

In [5]: a = np.array(['a', 'b', 'a'])

In [6]: mask = a == 'b'

In [7]: a[mask] = 'missing'

In [8]: a
Out[8]: 
array(['a', 'm', 'a'],
      dtype='<U1')

But that's then up to the user's responsibility?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object dtype does not necessarily mean string data

On this part (so not about supporting string dtype), I think we briefly discussed this and our feeling was: "let's be more restrictive than theoretically needed for now, if there is request for it we can always relax later".

For example, a situation where you can get the case of array of object dtype consisting of mixed types (and not only strings), is if you pass a subset of columns of a dataframe with both string and numerical columns to the SimpleImputer. But, normally you will want a different fill value for both types of columns, and you will need a ColumnTransformer anyhow to have a specific SimpleImputer for each type of columns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we force dtype = object as soon as input dtype is not numeric ?
Something like:

if X.dtype.kind not in ("i", "f"):
    X = X.astype(object)

right after check_array ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche, indeed imputing a string array is not simple to implement and probably useless in practice.

I think I prefer to raise an informative error message asking the user to provide either numerical data with integer or floating point data types or categorical represented with integer or object datatypes.

If we discover a common pipeline where the implicit conversion from string dtype to object dtype suggested above by @jeremiedbb is helpful we might want to consider it later, but to me it sounds like a YAGNI.

@@ -94,6 +113,13 @@ class SimpleImputer(BaseEstimator, TransformerMixin):
each column.
- If "most_frequent", then replace missing using the most frequent
value along each column.
- If "constant", then replace missing values with fill_value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note here (or in a Notes section) that most_frequent and constant work with strings or numeric data, while the others only work with numeric data.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Jun 18, 2018

outdated

@sklearn-lgtm
Copy link

This pull request introduces 3 alerts when merging 7d3d1b5 into 4143356 - view on LGTM.com

new alerts:

  • 3 for Comparison of identical values

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 3 alerts when merging 972668b into 4143356 - view on LGTM.com

new alerts:

  • 3 for Comparison of identical values

Comment posted by LGTM.com

raise TypeError("The SimpleImputer does not support this datatype"
" ({0}). Please provide either numeric data or"
" categorical data represented by integer or "
"object datatypes.".format(X.dtype))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering: do we usually raise TypeError or ValueError when the object type is valid, but the dtype is not?

Also, the message is a bit confusing: one could get the impression that float data is not supported.

"""
SimpleImputer does not work on data with dtype {0}. Please provide either a numeric array (with a floating point or integer dtype) or categorical data represented either as an array with integer dtype or an array of string values with an object dtype.
"""

Better be explicit in error message. Verbosity is not a problem here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy raises TypeError for bad dtypes. E.g. np.log(np.asarray(['hello']))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy is not very consistent:

>>> np.array(['a', 'b'], dtype=object).astype(np.uint8)
Traceback (most recent call last):
  File "<ipython-input-6-3200a6709d46>", line 1, in <module>
    np.array(['a', 'b'], dtype=object).astype(np.uint8)
ValueError: invalid literal for int() with base 10: 'a'

It's true that if you consider the individual scalar operation the TypeError raised by a ufunc might make sense. If you consider that the array as a whole is invalid, then ValueError makes more sense. I think both are fine in this case.

@sklearn-lgtm
Copy link

This pull request introduces 3 alerts when merging fb1a4e9 into 4143356 - view on LGTM.com

new alerts:

  • 3 for Comparison of identical values

Comment posted by LGTM.com

not isinstance(fill_value, numbers.Real)):
raise TypeError("'fill_value'={0} is invalid. Expected a numerical"
" value when imputing numerical"
" data".format(fill_value))
Copy link
Member

@ogrisel ogrisel Jun 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we switched to ValueError above we should probably be consistent and do it here as well.

I think value error is fine because this is already what is raised by most models who use the standard check_X_y validation:

>>> from sklearn.linear_model import LogisticRegression
>>> LogisticRegression().fit([['invalid'], ['invalid']], [0, 1])
Traceback (most recent call last):
  File "<ipython-input-6-91496e8d832f>", line 1, in <module>
    LogisticRegression().fit([['invalid', 'invalid']], [0, 1])
  File "/home/ogrisel/code/scikit-learn/sklearn/linear_model/logistic.py", line 1217, in fit
    order="C")
  File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 671, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 494, in check_array
    array = np.asarray(array, dtype=dtype, order=order)
  File "/home/ogrisel/.virtualenvs/py36/lib/python3.6/site-packages/numpy/core/numeric.py", line 492, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'invalid'

@sklearn-lgtm
Copy link

This pull request introduces 3 alerts when merging c8246f2 into 4143356 - view on LGTM.com

new alerts:

  • 3 for Comparison of identical values

Comment posted by LGTM.com

@ogrisel ogrisel merged commit 0d8a04b into scikit-learn:master Jun 20, 2018
@ogrisel
Copy link
Member

ogrisel commented Jun 20, 2018

Thank you very much @jeremiedbb!

@amueller
Copy link
Member

I like the new example, but I feel the example could be simplified. Can we not remove the fill_value and handle_unknown parameters?
Also, I forgot why remainder='passthrough' by default. What was the reason for that again?
What was the reason to going from the make_pipeline to Pipeline? it seems it makes stuff quite a bit longer...

@glemaitre
Copy link
Member

What was the reason to going from the make_pipeline to Pipeline? it seems it makes stuff quite a bit longer...

It makes the example a bit longer but the parameter names to be set are more user friendly for an example:

  • Initial name: 'columntransformer__pipeline-0__simpleimputer__strategy'
  • Now: 'preprocessor__num__imputer__strategy'

@jorisvandenbossche
Copy link
Member

Also, I forgot why remainder='passthrough' by default. What was the reason for that again?

Should we reconsider this? If we want to have 'drop' in most of our examples, that might be an indication.
I also had a colleague who had not dropped 'y' from its data, and because it was passed through in the ColumnTransformer, noticed very good results ...

@jnothman
Copy link
Member

jnothman commented Jul 3, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants