Use `np.uint8` as default dtype for `OneHotEncoder` instead of `np.float64` #26063

smith558 · 2023-04-03T14:18:25Z

Describe the workflow you want to enable

sklearn.preprocessing.OneHotEncoder should use as dtype the np.uint8 by default instead of np.float64 as it is currently. I don't see the reasoning behind using np.float64, this if anything only causes memory explosions and potentially (?) even slows down the computation.

Describe your proposed solution

Make np.uint8 as the default dtype parameter of sklearn.preprocessing.OneHotEncoder

Describe alternatives you've considered, if relevant

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

BabaYaga1221 · 2023-04-03T20:15:33Z

I would like to take this issue. Thanks

Edit - I change the dtype from np.float64 to np.uint8 for sklearn.preprocessing.OneHotEncoder. Can you review my PR #26068.

thomasjpfan · 2023-04-04T19:16:20Z

For backwards compatibility, we would need a deprecation cycle before changing the default. There are two use cases I see:

I suspect the output is float because most of our estimators will convert it's input to a float64. If OneHotEncoder outputs an integer array and is a part of a Pipeline then the integer array will be passed to an estimator. That estimator will likely convert the integer array into a floating array which is a memory copy. With OneHotEncoder's default float output, there is no memory copy taking place.
That being said, OneHotEncoder is usually in a ColumnTransformer, and the ColumnTransformer will make a copy because it has to run np.hstack or pd.concat on each transformer. The copy happens regardless of the output dtype of OneHotEncoder.

For case 1, changing the default to int will create a memory copy, but I think it is not very common to only have a OneHotEnoder in a pipeline.

For case 2, changing the default to int will use less memory and the number of memory copies are the same.

In summary, I think changing to default to an integer is okay and worth going through a deprecation.

glemaitre · 2023-04-04T20:47:54Z

@thomasjpfan I am not convinced this is worth the annoyance of the deprecation where everyone must set the dtype during 2 versions. I agree that integers would be more efficient, but if you get an out-of-memory error, you can always set the dtype to go around.

If we go this path then, we should also consider modifying the default of OrdinalEncoder.

smith558 · 2023-04-04T22:05:36Z

@thomasjpfan I am not convinced this is worth the annoyance of the deprecation where everyone must set the dtype during 2 versions. I agree that integers would be more efficient, but if you get an out-of-memory error, you can always set the dtype to go around.

If we go this path then, we should also consider modifying the default of OrdinalEncoder.

The OrdinalEncoder does not present much of an issue, though, as that's not binary encoding and you could forgive it the memory wastage (if any) as np.int32 or even np.int64 would be used there anyway. This is not the case with the OneHotEncoder however, where the memory explosion (and wastage) when working with large datasets is significant and completely unnecessary.

betatim · 2023-04-05T06:53:01Z

Just to make it explicit: it is currently possible to OneHotEncoder(dtype=np.uint8, ...) for users who run into memory size problems. Another parameter relevant to the amount of memory used is sparse_output, which is set to True by default.

If it is possible to stick with a sparse matrix the difference between float and int should be less, but I'm not sure what happens in Thomas' case (2).

Has someone done a benchmark looking at how much memory would actually be saved for some (reasonably sized) datasets?

To me it feels like we have a good option for the rare(?) users that have very large outputs from OneHotEncoder, and for the more common(?) user who is far away from memory limits and probably in case (2) it doesn't really matter. This makes me lean towards "no change needed".

smith558 · 2023-04-05T09:33:49Z

Just to make it explicit: it is currently possible to OneHotEncoder(dtype=np.uint8, ...) for users who run into memory size problems. Another parameter relevant to the amount of memory used is sparse_output, which is set to True by default.

If it is possible to stick with a sparse matrix the difference between float and int should be less, but I'm not sure what happens in Thomas' case (2).

Has someone done a benchmark looking at how much memory would actually be saved for some (reasonably sized) datasets?

To me it feels like we have a good option for the rare(?) users that have very large outputs from OneHotEncoder, and for the more common(?) user who is far away from memory limits and probably in case (2) it doesn't really matter. This makes me lean towards "no change needed".

I have found in my use case that doing the single change (changing dtype to np.uint8) cut my RAM usage by about 40 GB. (which represented about 40% of my total memory usage) Now, my dataset is large, and I use a lot of multiprocessing, but it is important to remember (I found) that even with something like GridSearchCV with having n_jobs bigger than one, this memory requirement difference will be further multiplied by each process, ultimately resulting in a hugely significant memory requirement difference to when dtype=np.uint8 is used.

ogrisel · 2023-04-06T15:27:29Z

If we decide that we should standardize on a output_dtype constructor parameter for several preprocessing transformers (as is being discussed in #25845), the we could use a deprecation cycle to do both changes (the renaming and the change of default value to np.uint8) at the same time.

ogrisel · 2023-04-06T15:28:49Z

If not, then we could do the change of default value as a lightweight breaking change (without an annoying deprecation cycle) targeting a still hypothetical scikit-learn 2.0.

smith558 added Needs Triage Issue requires triage New Feature labels Apr 3, 2023

betatim changed the title ~~sklearn.preprocessing.OneHotEncoder to use np.uint8 by default instead of np.float64~~ Use np.uint8 as default dtype for preprocessing.OneHotEncoder instead of np.float64 Apr 3, 2023

betatim changed the title ~~Use np.uint8 as default dtype for preprocessing.OneHotEncoder instead of np.float64~~ Use np.uint8 as default dtype for OneHotEncoder instead of np.float64 Apr 3, 2023

thomasjpfan added module:preprocessing Needs Decision - API and removed Needs Triage Issue requires triage labels Apr 4, 2023

BabaYaga1221 mentioned this issue Apr 4, 2023

Changing float64 to uint8 in OneHotEncoder #26088

Closed

thomasjpfan removed the Needs Decision - API label Apr 6, 2023

thomasjpfan added this to the 2.0 milestone Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `np.uint8` as default dtype for `OneHotEncoder` instead of `np.float64` #26063

Use `np.uint8` as default dtype for `OneHotEncoder` instead of `np.float64` #26063

smith558 commented Apr 3, 2023

BabaYaga1221 commented Apr 3, 2023 •

edited

Loading

thomasjpfan commented Apr 4, 2023

glemaitre commented Apr 4, 2023

smith558 commented Apr 4, 2023 •

edited

Loading

betatim commented Apr 5, 2023

smith558 commented Apr 5, 2023

ogrisel commented Apr 6, 2023 •

edited by betatim

Loading

ogrisel commented Apr 6, 2023

Use np.uint8 as default dtype for OneHotEncoder instead of np.float64 #26063

Use np.uint8 as default dtype for OneHotEncoder instead of np.float64 #26063

Comments

smith558 commented Apr 3, 2023

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

BabaYaga1221 commented Apr 3, 2023 • edited Loading

thomasjpfan commented Apr 4, 2023

glemaitre commented Apr 4, 2023

smith558 commented Apr 4, 2023 • edited Loading

betatim commented Apr 5, 2023

smith558 commented Apr 5, 2023

ogrisel commented Apr 6, 2023 • edited by betatim Loading

ogrisel commented Apr 6, 2023

Use `np.uint8` as default dtype for `OneHotEncoder` instead of `np.float64` #26063

Use `np.uint8` as default dtype for `OneHotEncoder` instead of `np.float64` #26063

BabaYaga1221 commented Apr 3, 2023 •

edited

Loading

smith558 commented Apr 4, 2023 •

edited

Loading

ogrisel commented Apr 6, 2023 •

edited by betatim

Loading