Closed
Description
Describe the bug
I'm trying to do a grid search with n_jobs=-1
, working with pandas output, and it fails despite set_config(transform_output = "pandas")
I have to manually .set_output(transform='pandas')
in the ColumnTransformer for it to work.
Steps/Code to Reproduce
Preparation
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn import set_config
set_config(transform_output = "pandas")
# Toy dataframe
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
X, y = df.drop(columns='D'), df['D']>0
# Custom transformer that needs dataframe
class CustomTransformer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
assert isinstance(X, pd.DataFrame), "Fit failed"
self.cols = X.columns
return self
def transform(self, X, y=None):
assert isinstance(X, pd.DataFrame), "Transform failed"
return X.copy()
def get_feature_names_out(self):
return self.cols
prepro = CustomTransformer()
model = LogisticRegression()
param_grid = {'logisticregression__C': np.logspace(3,-3, num=50)}
This WORKS (n_jobs=1
):
drop = make_column_transformer(('drop', [0]), remainder='passthrough')
pipe = make_pipeline(drop, prepro, model)
gs = GridSearchCV(pipe, param_grid, cv=10, n_jobs=1)
gs.fit(X,y)
This FAILS (n_jobs=-1
):
drop = make_column_transformer(('drop', [0]), remainder='passthrough')
pipe = make_pipeline(drop, prepro, model)
gs = GridSearchCV(pipe, param_grid, cv=10, n_jobs=-1)
gs.fit(X,y)
This WORKS again (n_jobs=-1
and force output):
drop = make_column_transformer(('drop', [0]), remainder='passthrough').set_output(transform='pandas')
pipe = make_pipeline(drop, prepro, model)
gs = GridSearchCV(pipe, param_grid, cv=10, n_jobs=-1)
gs.fit(X,y)
Actual Results
AssertionError: Fit failed
Versions
System:
python: 3.11.1 (main, Dec 7 2022, 08:49:13) [GCC 12.2.0]
executable: /home/susensio/.local/share/venv/bin/python3
machine: Linux-6.0.0-6-amd64-x86_64-with-glibc2.36
Python dependencies:
sklearn: 1.2.0
pip: 22.3.1
setuptools: 65.6.3
numpy: 1.24.1
scipy: 1.9.3
Cython: None
pandas: 1.5.2
matplotlib: 3.6.2
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
prefix: libgomp
filepath: /home/susensio/.local/share/venv/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None
num_threads: 8
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /home/susensio/.local/share/venv/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
version: 0.3.21
threading_layer: pthreads
architecture: Haswell
num_threads: 8
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /home/susensio/.local/share/venv/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
version: 0.3.18
threading_layer: pthreads
architecture: Haswell
num_threads: 8