Skip to content

Use np.uint8 as default dtype for OneHotEncoder instead of np.float64 #26063

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
smith558 opened this issue Apr 3, 2023 · 8 comments
Open

Comments

@smith558
Copy link
Contributor

smith558 commented Apr 3, 2023

Describe the workflow you want to enable

sklearn.preprocessing.OneHotEncoder should use as dtype the np.uint8 by default instead of np.float64 as it is currently. I don't see the reasoning behind using np.float64, this if anything only causes memory explosions and potentially (?) even slows down the computation.

Describe your proposed solution

Make np.uint8 as the default dtype parameter of sklearn.preprocessing.OneHotEncoder

Describe alternatives you've considered, if relevant

No response

Additional context

No response

@smith558 smith558 added Needs Triage Issue requires triage New Feature labels Apr 3, 2023
@betatim betatim changed the title sklearn.preprocessing.OneHotEncoder to use np.uint8 by default instead of np.float64 Use np.uint8 as default dtype for preprocessing.OneHotEncoder instead of np.float64 Apr 3, 2023
@betatim betatim changed the title Use np.uint8 as default dtype for preprocessing.OneHotEncoder instead of np.float64 Use np.uint8 as default dtype for OneHotEncoder instead of np.float64 Apr 3, 2023
@BabaYaga1221
Copy link
Contributor

BabaYaga1221 commented Apr 3, 2023

I would like to take this issue. Thanks

Edit - I change the dtype from np.float64 to np.uint8 for sklearn.preprocessing.OneHotEncoder. Can you review my PR #26068.

@thomasjpfan
Copy link
Member

For backwards compatibility, we would need a deprecation cycle before changing the default. There are two use cases I see:

  1. I suspect the output is float because most of our estimators will convert it's input to a float64. If OneHotEncoder outputs an integer array and is a part of a Pipeline then the integer array will be passed to an estimator. That estimator will likely convert the integer array into a floating array which is a memory copy. With OneHotEncoder's default float output, there is no memory copy taking place.
  2. That being said, OneHotEncoder is usually in a ColumnTransformer, and the ColumnTransformer will make a copy because it has to run np.hstack or pd.concat on each transformer. The copy happens regardless of the output dtype of OneHotEncoder.

For case 1, changing the default to int will create a memory copy, but I think it is not very common to only have a OneHotEnoder in a pipeline.

For case 2, changing the default to int will use less memory and the number of memory copies are the same.

In summary, I think changing to default to an integer is okay and worth going through a deprecation.

@glemaitre
Copy link
Member

@thomasjpfan I am not convinced this is worth the annoyance of the deprecation where everyone must set the dtype during 2 versions. I agree that integers would be more efficient, but if you get an out-of-memory error, you can always set the dtype to go around.

If we go this path then, we should also consider modifying the default of OrdinalEncoder.

@smith558
Copy link
Contributor Author

smith558 commented Apr 4, 2023

@thomasjpfan I am not convinced this is worth the annoyance of the deprecation where everyone must set the dtype during 2 versions. I agree that integers would be more efficient, but if you get an out-of-memory error, you can always set the dtype to go around.

If we go this path then, we should also consider modifying the default of OrdinalEncoder.

The OrdinalEncoder does not present much of an issue, though, as that's not binary encoding and you could forgive it the memory wastage (if any) as np.int32 or even np.int64 would be used there anyway. This is not the case with the OneHotEncoder however, where the memory explosion (and wastage) when working with large datasets is significant and completely unnecessary.

@betatim
Copy link
Member

betatim commented Apr 5, 2023

Just to make it explicit: it is currently possible to OneHotEncoder(dtype=np.uint8, ...) for users who run into memory size problems. Another parameter relevant to the amount of memory used is sparse_output, which is set to True by default.

If it is possible to stick with a sparse matrix the difference between float and int should be less, but I'm not sure what happens in Thomas' case (2).

Has someone done a benchmark looking at how much memory would actually be saved for some (reasonably sized) datasets?

To me it feels like we have a good option for the rare(?) users that have very large outputs from OneHotEncoder, and for the more common(?) user who is far away from memory limits and probably in case (2) it doesn't really matter. This makes me lean towards "no change needed".

@smith558
Copy link
Contributor Author

smith558 commented Apr 5, 2023

Just to make it explicit: it is currently possible to OneHotEncoder(dtype=np.uint8, ...) for users who run into memory size problems. Another parameter relevant to the amount of memory used is sparse_output, which is set to True by default.

If it is possible to stick with a sparse matrix the difference between float and int should be less, but I'm not sure what happens in Thomas' case (2).

Has someone done a benchmark looking at how much memory would actually be saved for some (reasonably sized) datasets?

To me it feels like we have a good option for the rare(?) users that have very large outputs from OneHotEncoder, and for the more common(?) user who is far away from memory limits and probably in case (2) it doesn't really matter. This makes me lean towards "no change needed".

I have found in my use case that doing the single change (changing dtype to np.uint8) cut my RAM usage by about 40 GB. (which represented about 40% of my total memory usage) Now, my dataset is large, and I use a lot of multiprocessing, but it is important to remember (I found) that even with something like GridSearchCV with having n_jobs bigger than one, this memory requirement difference will be further multiplied by each process, ultimately resulting in a hugely significant memory requirement difference to when dtype=np.uint8 is used.

@thomasjpfan thomasjpfan added this to the 2.0 milestone Apr 6, 2023
@ogrisel
Copy link
Member

ogrisel commented Apr 6, 2023

If we decide that we should standardize on a output_dtype constructor parameter for several preprocessing transformers (as is being discussed in #25845), the we could use a deprecation cycle to do both changes (the renaming and the change of default value to np.uint8) at the same time.

@ogrisel
Copy link
Member

ogrisel commented Apr 6, 2023

If not, then we could do the change of default value as a lightweight breaking change (without an annoying deprecation cycle) targeting a still hypothetical scikit-learn 2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants