-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Closed
Labels
EnhancementModerateAnything that requires some knowledge of conventions and best practicesAnything that requires some knowledge of conventions and best practicesPerformance
Description
Describe the bug
SimpleImputer is extremely slow on string/object categories using option "most_frequent".
Steps/Code to Reproduce
import sys; print('python', sys.version)
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='most_frequent')
n = 1_000_000
df = pd.DataFrame(
dict(a_string_column = np.random.choice(['a', 'b'], n),
a_numeric_column = np.random.choice([1, 2], n))
)
print('np.unique:')
%time np.unique(df.a_string_column, return_counts=True)
print('\nsimple imputer numeric:')
%time si.fit(df[['a_numeric_column']])
print('\nsimple imputer string/object:')
%time si.fit(df[['a_string_column']])
Output
np.unique:
Wall time: 766 ms
simple imputer numeric:
Wall time: 43 ms
simple imputer string/object:
Wall time: 29.7 s
Versions
System:
python: 3.7.9
machine: Windows-10
Python dependencies:
pip: 20.2.4
setuptools: 50.3.0.post20201006
sklearn: 0.23.2
numpy: 1.19.4
scipy: 1.4.1
Cython: None
pandas: 1.1.3
matplotlib: 3.3.1
joblib: 0.17.0
threadpoolctl: 2.1.0
Metadata
Metadata
Assignees
Labels
EnhancementModerateAnything that requires some knowledge of conventions and best practicesAnything that requires some knowledge of conventions and best practicesPerformance