Skip to content

OneHotEncoder drop 'if_binary' drop one column from all categorical variables #16552

@cmarmo

Description

@cmarmo

Describe the bug

The drop parameter in OneHotEncoder when set to if_binary drops one column from all categorical variables not only binary variables.
I need this option in #15706, therefore I would like to propose a PR unless @rushabh-v would take care of this.

Steps/Code to Reproduce

import numpy as np
import scipy as sp
import pandas as pd

from sklearn.datasets import fetch_openml
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.compose import TransformedTargetRegressor
from sklearn.model_selection import train_test_split

survey = fetch_openml(data_id=534, as_frame=True)
X = survey.data[survey.feature_names]
y = survey.target.values.ravel()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42
)

categorical_columns = ['RACE', 'OCCUPATION', 'SECTOR',
                       'MARR', 'UNION', 'SEX', 'SOUTH']

preprocessor = make_column_transformer(
    (OneHotEncoder(drop='if_binary'), categorical_columns),
    remainder='passthrough'
)

model = make_pipeline(
    preprocessor,
    TransformedTargetRegressor(
        regressor=Ridge(alpha=1e-10),
        func=np.log10,
        inverse_func=sp.special.exp10
    )
)

# Fit the model only on categorical variables
model.fit(X_train[categorical_columns], y_train)

print("Input feature names")
print(model.named_steps['columntransformer']
                      .named_transformers_['onehotencoder'].categories_)

print("Number of modeled input features")
print(len(model.named_steps['transformedtargetregressor'].regressor_.coef_))
print(model.named_steps['columntransformer'].named_transformers_['onehotencoder'].drop_idx_)

feature_names = (model.named_steps['columntransformer']
                      .named_transformers_['onehotencoder']
                      .get_feature_names(input_features=categorical_columns))

print("Output feature names")
print(feature_names)
print("Number of output feature names")
print(len(feature_names))

Expected Results

The length of input and output feature array is the same.

Actual Results

Input feature names
[array(['Hispanic', 'Other', 'White'], dtype=object), array(['Clerical', 'Management', 'Other', 'Professional', 'Sales',
       'Service'], dtype=object), array(['Construction', 'Manufacturing', 'Other'], dtype=object), array(['Married', 'Unmarried'], dtype=object), array(['member', 'not_member'], dtype=object), array(['female', 'male'], dtype=object), array(['no', 'yes'], dtype=object)]
Number of modeled input features
16
Output feature names
['RACE_Hispanic' 'RACE_Other' 'OCCUPATION_Clerical'
 'OCCUPATION_Management' 'OCCUPATION_Other' 'OCCUPATION_Professional'
 'OCCUPATION_Sales' 'SECTOR_Construction' 'SECTOR_Manufacturing'
 'MARR_Unmarried' 'UNION_not_member' 'SEX_male' 'SOUTH_yes']
Number of output feature names
13

Versions

>>> import sklearn; sklearn.show_versions()

System:
    python: 3.7.5 (default, Dec 15 2019, 17:54:26)  [GCC 9.2.1 20190827 (Red Hat 9.2.1-1)]
executable: /home/cmarmo/.skldevenv/bin/python
   machine: Linux-5.3.16-300.fc31.x86_64-x86_64-with-fedora-31-Thirty_One

Python dependencies:
       pip: 20.0.2
setuptools: 40.8.0
   sklearn: 0.23.dev0
     numpy: 1.17.2
     scipy: 1.3.1
    Cython: 0.29.13
    pandas: 0.25.1
matplotlib: 3.1.1
    joblib: 0.13.2

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions