Description
Description
I am working with the Column transformer, and for memory issues, am trying to produce a float32 sparse matrix. Unfortunately, regardless of pandas input type, the output is always float64.
I've identified one of the Pipeline scalers, MaxAbsScaler, as being the culprit. Other preprocessing functions, such as OneHotEncoder, have an optional dtype
argument. This argument does not exist in MaxAbsScaler (among others). It appears that the upcasting happens when check_array
is executed.
Is it possible to specify a dtype? Or is there a commonly accepted practice to do so from the Column Transformer?
Thank you!
Steps/Code to Reproduce
Example:
import pandas as pd
from sklearn.preprocessing import MaxAbsScaler
df = pd.DataFrame({
'DOW': [0, 1, 2, 3, 4, 5, 6],
'Month': [3, 2, 4, 3, 2, 6, 7],
'Value': [3.4, 4., 8, 5, 3, 6, 4]
})
df = df.astype('float32')
print(df.dtypes)
a = MaxAbsScaler()
scaled = a.fit_transform(df) # providing df.values will produce correct response
print('Transformed Type: ', scaled.dtype)
Expected Results
DOW float32
Month float32
Value float32
dtype: object
Transformed Type: float32
Actual Results
DOW float32
Month float32
Value float32
dtype: object
Transformed Type: float64
Versions
Darwin-18.7.0-x86_64-i386-64bit
Python 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:07:37)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.17.1
SciPy 1.3.1
Scikit-Learn 0.20.3
Pandas 0.25.1