-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
SimpleImputer's fill_value validation seems too strict #29381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
From
In any case, both return In [12]: np.can_cast(type(1), np.uint8, casting="same_kind")
Out[12]: False
In [13]: np.can_cast(type(1), np.uint8, casting="safe")
Out[13]: False |
Ah, yes, sorry, my suggestion was using the actual In [3]: np.can_cast(type(0), "uint8", casting="same_kind")
Out[3]: False
In [4]: np.can_cast(0, "uint8", casting="same_kind")
Out[4]: True
In [6]: np.can_cast(-1, "uint8", casting="same_kind")
Out[6]: False The |
Hm, but just saw this warning in the docs:
So probably a no go. One workaround is using a |
It seems like currently this is up to the user to fix it by using import pandas as pd
from sklearn import impute
df = pd.Series([0, 1, 2], dtype="uint8").to_frame()
impute.SimpleImputer(strategy="constant", fill_value=np.uint8(0)).fit_transform(df) I am wondering whether things are good enough or whether there is room for improvement, maybe in the error message when For completeness, I noticed too that trying In [1]: import numpy as np
In [2]: np.can_cast(1, np.uint8)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[2], line 1
----> 1 np.can_cast(1, np.uint8)
TypeError: can_cast() does not support Python ints, floats, and complex because the result used to depend on the value.
This change was part of adopting NEP 50, we may explicitly allow them again in the future. |
It feels like long-term Numpy may potentially support casting from Python scalars as mentioned in the error message above and that doing nothing in the mean time (even if this takes a bit of time) is a reasonable option. Two potential alternatives as a stop-gap improvement:
|
was brainstorming potential solutions for this issue, was trying to use @lesteve 's second option but i think there is no clean way to convert python scalars to a dtype which is not known in advance, here is a relevant stack overflow thread. perhaps i am missing something? in any case i think it would make for a nicer user experience if we took either of the two options presented above, and would be interested in helping to write something. edit: will probably not look at this. for anyone wants to take this issue, feel free |
@daustria great if you are interested in contributing, I would start with my suggestion and see where that goes. There may be some tricky edge cases with my current idea, I am guessing this is why Numpy 2 np.array(1000).astype(np.int8, casting='same_kind')
# output is array(-24, dtype=int8) Some pieces of advice:
|
I will try to work on this. |
I believe this is the best way to solve this problem. Since NumPy does not handle Python scalars within its own library, I think trying to solve the problem on our end would be an unfeasible option. What do you think about adding to the error message the information that the fill_value needs to be of a NumPy type? |
@ojoaomorais feel free to have a go at this and open a PR! I think something that would be nice if we chose the "better error message" route (in contrary to the "automatic conversion" route):
As a reminder, the current error message is like this:
I am proposing something like (improvements more than welcome of course!):
|
Hey @lesteve any update on this? If there's any issue in the PR made can I work on this? |
Describe the bug
The
SimpleImputer
checks whether the type of thefill_value
can be cast with numpy to the dtype of the input data (X
) usingnp.can_cast(fill_value_dtype, X.dtype, casting="same_kind")
:scikit-learn/sklearn/impute/_base.py
Line 397 in 0ad90d5
This seems too strict to me, and means one cannot impute a uint8 array with fill_value=0 (a python int). Replacing the validation with something like
np.can_cast(fill_value, X.dtype, casting="safe")
would be more permissible, and without knowing all the details looks safe enough, but perhaps I'm missing something.Steps/Code to Reproduce
Though the example in isolation doesn't make a lot of sense (imputing data that doesn't contain missing values), this case might occur with generically defined transformation pipelines applied to unknown datasets.
Expected Results
Actual Results
Versions
The text was updated successfully, but these errors were encountered: