-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Closed
Closed
Copy link
Labels
Description
Describe the bug
The drop
parameter in OneHotEncoder
when set to if_binary
drops one column from all categorical variables not only binary variables.
I need this option in #15706, therefore I would like to propose a PR unless @rushabh-v would take care of this.
Steps/Code to Reproduce
import numpy as np
import scipy as sp
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.compose import TransformedTargetRegressor
from sklearn.model_selection import train_test_split
survey = fetch_openml(data_id=534, as_frame=True)
X = survey.data[survey.feature_names]
y = survey.target.values.ravel()
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=42
)
categorical_columns = ['RACE', 'OCCUPATION', 'SECTOR',
'MARR', 'UNION', 'SEX', 'SOUTH']
preprocessor = make_column_transformer(
(OneHotEncoder(drop='if_binary'), categorical_columns),
remainder='passthrough'
)
model = make_pipeline(
preprocessor,
TransformedTargetRegressor(
regressor=Ridge(alpha=1e-10),
func=np.log10,
inverse_func=sp.special.exp10
)
)
# Fit the model only on categorical variables
model.fit(X_train[categorical_columns], y_train)
print("Input feature names")
print(model.named_steps['columntransformer']
.named_transformers_['onehotencoder'].categories_)
print("Number of modeled input features")
print(len(model.named_steps['transformedtargetregressor'].regressor_.coef_))
print(model.named_steps['columntransformer'].named_transformers_['onehotencoder'].drop_idx_)
feature_names = (model.named_steps['columntransformer']
.named_transformers_['onehotencoder']
.get_feature_names(input_features=categorical_columns))
print("Output feature names")
print(feature_names)
print("Number of output feature names")
print(len(feature_names))
Expected Results
The length of input and output feature array is the same.
Actual Results
Input feature names
[array(['Hispanic', 'Other', 'White'], dtype=object), array(['Clerical', 'Management', 'Other', 'Professional', 'Sales',
'Service'], dtype=object), array(['Construction', 'Manufacturing', 'Other'], dtype=object), array(['Married', 'Unmarried'], dtype=object), array(['member', 'not_member'], dtype=object), array(['female', 'male'], dtype=object), array(['no', 'yes'], dtype=object)]
Number of modeled input features
16
Output feature names
['RACE_Hispanic' 'RACE_Other' 'OCCUPATION_Clerical'
'OCCUPATION_Management' 'OCCUPATION_Other' 'OCCUPATION_Professional'
'OCCUPATION_Sales' 'SECTOR_Construction' 'SECTOR_Manufacturing'
'MARR_Unmarried' 'UNION_not_member' 'SEX_male' 'SOUTH_yes']
Number of output feature names
13
Versions
>>> import sklearn; sklearn.show_versions()
System:
python: 3.7.5 (default, Dec 15 2019, 17:54:26) [GCC 9.2.1 20190827 (Red Hat 9.2.1-1)]
executable: /home/cmarmo/.skldevenv/bin/python
machine: Linux-5.3.16-300.fc31.x86_64-x86_64-with-fedora-31-Thirty_One
Python dependencies:
pip: 20.0.2
setuptools: 40.8.0
sklearn: 0.23.dev0
numpy: 1.17.2
scipy: 1.3.1
Cython: 0.29.13
pandas: 0.25.1
matplotlib: 3.1.1
joblib: 0.13.2
Built with OpenMP: True
amueller