-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] SimpleImputer(strategy="constant") #11211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/impute.py
Outdated
@@ -94,6 +104,13 @@ class SimpleImputer(BaseEstimator, TransformerMixin): | |||
each column. | |||
- If "most_frequent", then replace missing using the most frequent | |||
value along each column. | |||
- If "constant", then replace missing values with fill_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: end with "."
sklearn/impute.py
Outdated
if self.missing_values == "NaN" or np.isnan(self.missing_values): | ||
force_all_finite = "allow-nan" | ||
else: | ||
force_all_finite = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this should be:
if isinstance(self.missing_value, numbers.Real) and np.isnan(self.missing_values):
force_all_finite = "allow-nan"
else:
force_all_finite = True
or even:
force_all_finite = True if self.missing_values is not np.nan else 'allow-nan'
sklearn/impute.py
Outdated
@@ -115,16 +132,41 @@ class SimpleImputer(BaseEstimator, TransformerMixin): | |||
Notes | |||
----- | |||
Columns which only contained missing values at `fit` are discarded upon | |||
`transform`. | |||
`transform` is strategy is not "constant" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is -> if
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments
sklearn/impute.py
Outdated
if value_to_mask == "NaN" or np.isnan(value_to_mask): | ||
return np.isnan(X) | ||
if value_to_mask is np.nan: | ||
# nan values are never equal to themselves |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add that this way it also works for object dtypes (in case somebody would wonder why not np.isnan is used)
sklearn/impute.py
Outdated
@@ -80,10 +82,10 @@ class SimpleImputer(BaseEstimator, TransformerMixin): | |||
|
|||
Parameters | |||
---------- | |||
missing_values : integer or "NaN", optional (default="NaN") | |||
missing_values : real number, string, np.nan or None, | |||
optional (default=np.nan). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sphinx does not really like this way of wrapping the line ..
If you want it to render nicely we can do
missing_values : real number, string, np.nan or None, \
optional (default=np.nan).
The placeholder for ...
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is starting to look really good. As told in real life the main thing missing is a narrative doc update.
Here are some further comments in the code:
sklearn/tests/test_impute.py
Outdated
|
||
def test_imputation_constant_object(): | ||
# Test imputation using the constant strategy | ||
# on objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: comment fits on one line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment has been copied several times.
sklearn/impute.py
Outdated
else: | ||
dtype = object | ||
|
||
return np.full(X.shape[1], fill_value, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should always be:
return np.full(X.shape[1], fill_value, dtype=X.dtype)
independently of fill_value
no? If fill_value
is inconsistent with X.dtype
after validation, fit
should have already raised an informative error message above.
sklearn/impute.py
Outdated
if self.strategy == "constant": | ||
if (X.dtype.kind in ("i", "f") | ||
and not isinstance(fill_value, numbers.Real)): | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be a TypeError
because we check isintance
in the condition.
sklearn/impute.py
Outdated
and not isinstance(fill_value, numbers.Real)): | ||
raise ValueError( | ||
"fill_value={0} is invalid. Expected a numerical value " | ||
"to numerical data".format(fill_value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also issue informative error messages the following cases:
elif X.dtype.kind == "O" and fill_value not in six.string_types:
raise TypeError("fill_value={0} is invalid. Expected the an str instance when"
" imputing categorical data.".format(fill_value))
else:
raise ValueError("SimpleImputer cannot work on data with dtype={0}:"
" expecting numerical or categorical data with dtype=object"
"".format(X.dtype))
We should also add tests to check that those exceptions and their message are raised on invalid inputs.
git grep "with pytest.raises"
to find examples on how to test exception in the scikit-learn test suite using pytest.
sklearn/impute.py
Outdated
def _dense_fit(self, X, strategy, missing_values): | ||
# Constant | ||
elif strategy == "constant": | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style consistency: no need for a blank line here.
sklearn/impute.py
Outdated
if self.missing_values == 'NaN' | ||
or np.isnan(self.missing_values) else True) | ||
force_all_finite="allow-nan" | ||
if self.missing_values is np.nan else True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style:please define a local variable to improve readability:
force_all_finite = "allow-nan" if self.missing_values is np.nan else True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comment.
if X.dtype.kind == "O": | ||
most_frequent = np.empty(X.shape[0], dtype=object) | ||
else: | ||
most_frequent = np.empty(X.shape[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be:
most_frequent = np.empty(X.shape[0], dtype=X.dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel actually, we don't always want most_frequent to have the same dtype as X. For example, if there is a column full of NaNs in an integer array, X has integer dtype, whereas most_frequent (which stores the statistics column-wise) will have a np.nan for the column of NaNs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so discussed here: if you have integer with all -1, specify missing_value of -1, then the most_frequent will be NaN (due to implementation details to signal this case), and then X.dtype would clash with NaN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
BTW there are still some tests are broken because of the switch from "NaN" to We decided to not use a string as the default marker to avoid having weird behavior in case a user has the "NaN" string in a CSV file for instance. I think it's less surprising to not have "magic" string in |
The switch from "NaN" to np.nan for the default value of missing_values does not seem to be a good idea eventually. Setting a parameter to np.nan as default is not possible right now. There is a general test for all estimators that checks the default parameters.
(from estimator_checks.py in check_parameters_default_constructible) It results in a nan != nan error. I see 3 solutions:
Which solution should we apply ? |
Hum Maybe we should go back to the "NaN" special magic string. But this mean that we will not be able to properly handle columns that hat this token as a legit value and missing values in the same column. Another option would be to introduce a module level class MissingValueMarker:
pass
MISSING = MissingValueMarker()
class SimpleImputer(strategy='...', missing_value=MISSING, ...):
... This marker could by default match |
I said "and leave None otherwise" but we could call the fit with missing_values = np.nan for all dtypes. if sparse.issparse(X):
self.statistics_ = self._sparse_fit(X,
self.strategy,
self.missing_values,
fill_value) we'd have something like missing_values = np.nan if self.missing_values is None else self.missing_values
if sparse.issparse(X):
self.statistics_ = self._sparse_fit(X,
self.strategy,
missing_values,
fill_value) is it fine ? |
The problem with this approach is that nobody could use |
That's a problem indeed.
|
Is the only issue with np.nan as the default value an issue with
estimator_checks? I'd be in favour of loosening that check to allow for
(old is new or old == new) instead of just (old == new)
…On 11 June 2018 at 20:03, jeremiedbb ***@***.***> wrote:
That's a problem indeed.
Another option would be to introduce a module level MISSING singleton
marker:
it seems that's the only way to avoid the problem of having a default
value different of the np.nan missing_values. And it's more elegant.
I'm going with this solution if it's ok for you
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#11211 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz653i-i8scNxuYmhnYcUQarzXYNAIks5t7kB8gaJpZM4UceDA>
.
|
Yes it's the only remaining issue regarding np.nan as the default value. It only happens when checking the constructor. Afterward, the imputer works fine with nans. |
Yes, agree with @jnothman. |
…in constructor ; + minor corrections
This comment has been minimized.
This comment has been minimized.
sklearn/impute.py
Outdated
if not isinstance(fill_value, six.string_types): | ||
raise TypeError( | ||
"fill_value={0} is invalid. Expected an str instance " | ||
"when imputing categorical data.".format(fill_value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure it is good / needed to be that strict here. For example, you can have Categorical data in pandas with integer categories (or other non-string values) where it can make sense to fill with another value (and column.dtype.kind
will be "O" for categorical columns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a test for this specific use case: imputing a categorical column with a pandas categorical value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe
if not (isinstance(fill_value, six.string_types) or isinstance(fill_value, numbers.Reals) or fill_value is None)
is enough ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked IRL with @jorisvandenbossche and if the categorical datafram has integer values with missing values, then the check_array conversion will yield a floating point array so its already covered by the existing tests.
Let's stay strict in this checks for now. If a user reports a valid use case for being laxer we can always change that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also avoid @ referencing me in commit messages... if it gets merged, it sends me spam notifications every time someone does something silly (like rebasing) with public forks of scikit-learn.
def _validate_input(self, X): | ||
allowed_strategies = ["mean", "median", "most_frequent", "constant"] | ||
if self.strategy not in allowed_strategies: | ||
raise ValueError("Can only use these strategies: {0} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No test coverage
if invalid_mask.any(): | ||
missing = np.arange(X.shape[1])[invalid_mask] | ||
if self.verbose: | ||
warnings.warn("Deleting features without " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not tested
This pull request introduces 3 alerts when merging 20456f4 into bb38539 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
Please add an entry to the change log at Or perhaps modify the existing entry introducing SimpleImputer. Note the differences:
|
sklearn/impute.py
Outdated
@@ -80,10 +95,10 @@ class SimpleImputer(BaseEstimator, TransformerMixin): | |||
|
|||
Parameters | |||
---------- | |||
missing_values : integer or "NaN", optional (default="NaN") | |||
missing_values : real number, string, np.nan or None, \ | |||
optional (default=np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace these two lines with:
missing_values : number, string, np.nan (default) or None
what we currently have here is unnecessarily verbose.
sklearn/impute.py
Outdated
" value when imputing numerical" | ||
" data".format(fill_value)) | ||
|
||
elif X.dtype.kind == "O": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... object dtype does not necessarily mean string data. And string data does not necessarily mean object dtype (numpy.asarray with strings will not result in object data; we need to test this case). So I'm not sure this validation or its error message is quite right. I don't think we need to support numeric data in object arrays necessarily, but we do need to support string arrays if we're going to support object arrays, IMO, and we need to make sure the error messages are appropriate to the various cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for adding support and tests for accepting string arrays even if in most cases that are not a recommended data structure for representing categorical variables.
As for the input validation on object dtype array, we can just drop the elif X.dtype.kind == "O"
case and not raise any TypeError
and accept any fill_value
when X.dtype.kind == "O"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that there are some complexities to deal with in that case, eg the fact that filling with new values might truncate silently:
In [5]: a = np.array(['a', 'b', 'a'])
In [6]: mask = a == 'b'
In [7]: a[mask] = 'missing'
In [8]: a
Out[8]:
array(['a', 'm', 'a'],
dtype='<U1')
But that's then up to the user's responsibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object dtype does not necessarily mean string data
On this part (so not about supporting string dtype), I think we briefly discussed this and our feeling was: "let's be more restrictive than theoretically needed for now, if there is request for it we can always relax later".
For example, a situation where you can get the case of array of object dtype consisting of mixed types (and not only strings), is if you pass a subset of columns of a dataframe with both string and numerical columns to the SimpleImputer. But, normally you will want a different fill value for both types of columns, and you will need a ColumnTransformer anyhow to have a specific SimpleImputer for each type of columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we force dtype = object
as soon as input dtype is not numeric ?
Something like:
if X.dtype.kind not in ("i", "f"):
X = X.astype(object)
right after check_array
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche, indeed imputing a string array is not simple to implement and probably useless in practice.
I think I prefer to raise an informative error message asking the user to provide either numerical data with integer or floating point data types or categorical represented with integer or object datatypes.
If we discover a common pipeline where the implicit conversion from string dtype to object dtype suggested above by @jeremiedbb is helpful we might want to consider it later, but to me it sounds like a YAGNI.
sklearn/impute.py
Outdated
@@ -94,6 +113,13 @@ class SimpleImputer(BaseEstimator, TransformerMixin): | |||
each column. | |||
- If "most_frequent", then replace missing using the most frequent | |||
value along each column. | |||
- If "constant", then replace missing values with fill_value. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note here (or in a Notes section) that most_frequent and constant work with strings or numeric data, while the others only work with numeric data.
outdated |
This pull request introduces 3 alerts when merging 7d3d1b5 into 4143356 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
This pull request introduces 3 alerts when merging 972668b into 4143356 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
sklearn/impute.py
Outdated
raise TypeError("The SimpleImputer does not support this datatype" | ||
" ({0}). Please provide either numeric data or" | ||
" categorical data represented by integer or " | ||
"object datatypes.".format(X.dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering: do we usually raise TypeError
or ValueError
when the object type is valid, but the dtype
is not?
Also, the message is a bit confusing: one could get the impression that float data is not supported.
"""
SimpleImputer does not work on data with dtype {0}. Please provide either a numeric array (with a floating point or integer dtype) or categorical data represented either as an array with integer dtype or an array of string values with an object dtype.
"""
Better be explicit in error message. Verbosity is not a problem here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numpy raises TypeError for bad dtypes. E.g. np.log(np.asarray(['hello']))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numpy is not very consistent:
>>> np.array(['a', 'b'], dtype=object).astype(np.uint8)
Traceback (most recent call last):
File "<ipython-input-6-3200a6709d46>", line 1, in <module>
np.array(['a', 'b'], dtype=object).astype(np.uint8)
ValueError: invalid literal for int() with base 10: 'a'
It's true that if you consider the individual scalar operation the TypeError raised by a ufunc might make sense. If you consider that the array as a whole is invalid, then ValueError makes more sense. I think both are fine in this case.
This pull request introduces 3 alerts when merging fb1a4e9 into 4143356 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
sklearn/impute.py
Outdated
not isinstance(fill_value, numbers.Real)): | ||
raise TypeError("'fill_value'={0} is invalid. Expected a numerical" | ||
" value when imputing numerical" | ||
" data".format(fill_value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we switched to ValueError
above we should probably be consistent and do it here as well.
I think value error is fine because this is already what is raised by most models who use the standard check_X_y
validation:
>>> from sklearn.linear_model import LogisticRegression
>>> LogisticRegression().fit([['invalid'], ['invalid']], [0, 1])
Traceback (most recent call last):
File "<ipython-input-6-91496e8d832f>", line 1, in <module>
LogisticRegression().fit([['invalid', 'invalid']], [0, 1])
File "/home/ogrisel/code/scikit-learn/sklearn/linear_model/logistic.py", line 1217, in fit
order="C")
File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 671, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 494, in check_array
array = np.asarray(array, dtype=dtype, order=order)
File "/home/ogrisel/.virtualenvs/py36/lib/python3.6/site-packages/numpy/core/numeric.py", line 492, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'invalid'
This pull request introduces 3 alerts when merging c8246f2 into 4143356 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
Thank you very much @jeremiedbb! |
I like the new example, but I feel the example could be simplified. Can we not remove the |
It makes the example a bit longer but the parameter names to be set are more user friendly for an example:
|
Should we reconsider this? If we want to have 'drop' in most of our examples, that might be an indication. |
hahahaha. yeah, I suppose that's a risk.
I'm okay with changing it again...
|
This is WIP to implement the constant strategy for the SimpleImputer as described in #11208.