Skip to content

ColumnTransformers don't honor set_config(transform_output="pandas") when multiprocessing with n_jobs>1 #25239

Closed
@Susensio

Description

@Susensio

Describe the bug

I'm trying to do a grid search with n_jobs=-1, working with pandas output, and it fails despite set_config(transform_output = "pandas")

I have to manually .set_output(transform='pandas') in the ColumnTransformer for it to work.

Steps/Code to Reproduce

Preparation

import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer, make_column_selector

from sklearn import set_config
set_config(transform_output = "pandas")

# Toy dataframe
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
X, y = df.drop(columns='D'), df['D']>0

# Custom transformer that needs dataframe
class CustomTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        assert isinstance(X, pd.DataFrame), "Fit failed"
        self.cols = X.columns
        return self
    
    def transform(self, X, y=None):
        assert isinstance(X, pd.DataFrame), "Transform failed"
        return X.copy()
    
    def get_feature_names_out(self):
        return self.cols


prepro = CustomTransformer()
model = LogisticRegression()

param_grid = {'logisticregression__C': np.logspace(3,-3, num=50)}

This WORKS (n_jobs=1):

drop = make_column_transformer(('drop', [0]), remainder='passthrough')
pipe = make_pipeline(drop, prepro,  model)

gs = GridSearchCV(pipe, param_grid, cv=10, n_jobs=1)
gs.fit(X,y)

This FAILS (n_jobs=-1):

drop = make_column_transformer(('drop', [0]), remainder='passthrough')
pipe = make_pipeline(drop, prepro,  model)

gs = GridSearchCV(pipe, param_grid, cv=10, n_jobs=-1)
gs.fit(X,y)

This WORKS again (n_jobs=-1 and force output):

drop = make_column_transformer(('drop', [0]), remainder='passthrough').set_output(transform='pandas')
pipe = make_pipeline(drop, prepro,  model)

gs = GridSearchCV(pipe, param_grid, cv=10, n_jobs=-1)
gs.fit(X,y)

Actual Results

AssertionError: Fit failed

Versions

System:
    python: 3.11.1 (main, Dec  7 2022, 08:49:13) [GCC 12.2.0]
executable: /home/susensio/.local/share/venv/bin/python3
   machine: Linux-6.0.0-6-amd64-x86_64-with-glibc2.36

Python dependencies:
      sklearn: 1.2.0
          pip: 22.3.1
   setuptools: 65.6.3
        numpy: 1.24.1
        scipy: 1.9.3
       Cython: None
       pandas: 1.5.2
   matplotlib: 3.6.2
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/susensio/.local/share/venv/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 8

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/susensio/.local/share/venv/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/susensio/.local/share/venv/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions