Skip to content

DecisionTreeClassifier became slower in v1.1 when fitting encoded variables #23397

@ArturoAmorQ

Description

@ArturoAmorQ

Describe the bug

The evaluation of a pipeline that encodes categorical data with v1.1 takes around 8 times longer than using v1.0.2

Steps/Code to Reproduce

import numpy as np
import pandas as pd
from time import time
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer, make_column_selector

rng = np.random.RandomState(0)
n_samples, n_features = 50_000, 2
X = pd.DataFrame(rng.randn(n_samples, n_features))
X[2] = np.random.choice(
    ["male", "female", "other"], size=n_samples, p=[0.49, 0.49, 0.02]
)
X[3] = np.random.choice(
    ["jan", "feb", "mar", "apr", "may", "jun",
     "jul", "aug", "sep", "oct", "nov", "dec"],
    size=n_samples,
)
y = np.random.choice(
    [0, 1, 2], size=n_samples, p=[0.01, 0.49, 0.5]
)

preprocessor = make_column_transformer(
    (OrdinalEncoder(), make_column_selector(dtype_include=object)),
    remainder="passthrough"
)
X_transformed = preprocessor.fit_transform(X)

t0 = time()
DecisionTreeClassifier().fit(X_transformed, y)
duration = time() - t0
duration

Expected Results

~450ms

Actual Results

3s

Versions

System:
    python: 3.9.5 | packaged by conda-forge | (default, Jun 19 2021, 00:32:32)  [GCC 9.3.0]
executable: /home/arturoamor/miniforge3/envs/scikit-learn-course/bin/python
   machine: Linux-5.14.0-1036-oem-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.1.0
          pip: 21.1.3
   setuptools: 49.6.0.post20210108
        numpy: 1.21.0
        scipy: 1.7.0
       Cython: None
       pandas: 1.3.0
   matplotlib: 3.4.2
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /home/arturoamor/miniforge3/envs/scikit-learn-course/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
         prefix: libgomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8

       filepath: /home/arturoamor/miniforge3/envs/scikit-learn-course/lib/libopenblasp-r0.3.15.so
         prefix: libopenblas
       user_api: blas
   internal_api: openblas
        version: 0.3.15
    num_threads: 8
threading_layer: pthreads

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions