Skip to content

SimpleImputer performance on strings  #18978

@mayer79

Description

@mayer79

Describe the bug

SimpleImputer is extremely slow on string/object categories using option "most_frequent".

Steps/Code to Reproduce

import sys; print('python', sys.version)
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

si = SimpleImputer(strategy='most_frequent')

n = 1_000_000
df = pd.DataFrame(
    dict(a_string_column = np.random.choice(['a', 'b'], n),
         a_numeric_column = np.random.choice([1, 2], n))
)

print('np.unique:')
%time np.unique(df.a_string_column, return_counts=True)

print('\nsimple imputer numeric:')
%time si.fit(df[['a_numeric_column']])

print('\nsimple imputer string/object:')
%time si.fit(df[['a_string_column']])

Output

np.unique:
Wall time: 766 ms

simple imputer numeric:
Wall time: 43 ms

simple imputer string/object:
Wall time: 29.7 s

Versions

System:
    python: 3.7.9
   machine: Windows-10

Python dependencies:
          pip: 20.2.4
   setuptools: 50.3.0.post20201006
      sklearn: 0.23.2
        numpy: 1.19.4
        scipy: 1.4.1
       Cython: None
       pandas: 1.1.3
   matplotlib: 3.3.1
       joblib: 0.17.0
threadpoolctl: 2.1.0

Metadata

Metadata

Assignees

Labels

EnhancementModerateAnything that requires some knowledge of conventions and best practicesPerformance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions