-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Use np.uint8
as default dtype for OneHotEncoder
instead of np.float64
#26063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
np.uint8
as default dtype for preprocessing.OneHotEncoder
instead of np.float64
np.uint8
as default dtype for preprocessing.OneHotEncoder
instead of np.float64
np.uint8
as default dtype for OneHotEncoder
instead of np.float64
I would like to take this issue. Thanks Edit - I change the |
For backwards compatibility, we would need a deprecation cycle before changing the default. There are two use cases I see:
For case 1, changing the default to int will create a memory copy, but I think it is not very common to only have a For case 2, changing the default to int will use less memory and the number of memory copies are the same. In summary, I think changing to default to an integer is okay and worth going through a deprecation. |
@thomasjpfan I am not convinced this is worth the annoyance of the deprecation where everyone must set the If we go this path then, we should also consider modifying the default of |
The |
Just to make it explicit: it is currently possible to If it is possible to stick with a sparse matrix the difference between float and int should be less, but I'm not sure what happens in Thomas' case (2). Has someone done a benchmark looking at how much memory would actually be saved for some (reasonably sized) datasets? To me it feels like we have a good option for the rare(?) users that have very large outputs from |
I have found in my use case that doing the single change (changing |
If we decide that we should standardize on a |
If not, then we could do the change of default value as a lightweight breaking change (without an annoying deprecation cycle) targeting a still hypothetical scikit-learn 2.0. |
Describe the workflow you want to enable
sklearn.preprocessing.OneHotEncoder
should use asdtype
thenp.uint8
by default instead ofnp.float64
as it is currently. I don't see the reasoning behind usingnp.float64
, this if anything only causes memory explosions and potentially (?) even slows down the computation.Describe your proposed solution
Make
np.uint8
as the defaultdtype
parameter ofsklearn.preprocessing.OneHotEncoder
Describe alternatives you've considered, if relevant
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: