Skip to content

Mixed-type imputation for IterativeImputer #17087

@adriangb

Description

@adriangb

I am opening this issue to more clearly document scattered comments I have made about this idea in several issues.

Workflow
For even the simplest dataset (say the Titanic dataset), dealing with missing values and encoding data are essential pre-processing steps. Scikit-Learn has a lot of great tools to deal with this, but I am going to focus on IterativeImputer for missing data and ColumnTransfromer (and by extension, any other transformers that can be fit inside of it) for encoding.

Please correct me if I am mistaken, but I believe that there is currently no easy way to use these two tools together. Transformers do not work with missing values, and IterativeImputer only works with numerical continuous data. Of course one could use SimpleImputer -> ColumnTransformer but then you cannot take advantage of more advanced imputation strategies.

I would like to have the ability to feed IterativeImputer a ColumnTransformer object (that may already even be fit) along with a list of estimators (so basically, I am telling what estimators to use for each column and how to transform the data for those estimators) and have it give me back my data, in it's original format, with imputed values. Then of course you can manually pass it through that same ColumnTransformer again and move it down the pipeline.

Proposed solution
Changes to IterativeImputer:

  1. A new parameter called transformer that defaults to None.
  2. Making the estimator parameter accept an iterable in addition to the single estimator it currently supports.
  3. Introduce a new step where ColumnTransformer gets applied. This would be between the initial imputation step (using SimpleImputer) and the estimator steps.
  4. Some internal changing to avoid errors from trying to do numerical operations on object dtype data. I tested implementing these fixes, this part is trivial.

I think these changes should be backwards compatible.

In terms of ColumnTransformer, the two needed features for this to work are:

  1. inverse_transform: work on this seems to be underway in [MRG] ENH: Adds inverse_transform to ColumnTransformer #11639, but it's a bit stalled it seems.
  2. Ability to select which columns are present when using transform: a similar issue was brought up in Using ColumnTransformer when not all specified columns present in data #15781. I think this could even be achieved with the private method _calculate_inverse_indices proposed in [MRG] ENH: Adds inverse_transform to ColumnTransformer #11639 if it is made public.

Problems with this approach
I can think of two main problems:

  1. Convergence: there is already concern regarding convergence of IterativeImputer, I am fairly certain that introducing classifiers would make it worse. A simple fix would be to not support tolerance/convergence based early termination if one or more classifiers are used as estimators. It is easy to check if any of the estimators are classifiers. Several parameters related to convergence (ex: tol) would need to raise errors if they are not set to None (and maybe the default should be changed?).
  2. Initial imputation: we would have to restrict the initial_strategy parameter to constant and most_frequent when classification tasks are present.

Existing work
I tried implementing this proposal. In some sense, I got pretty far, I resolved the errors from trying to apply numerical numpy operations to object dtype data as well as getting a list of estimators to work. Where I ran into issues was the ColumnTransformer problems mentioned above.

Example of desired usage
Given a dataset that looks something like this (this is X only):

Age Sex Cabin
16 M NaN
56 NaN C19
NaN F XYZ

We have missing continuous numerical data (Age), two-level categorical data (Sex) and multi-level categorical data (Cabin). We want to impute these missing values. The idea would be something as follows:

X = .... # the data described above

estimators = [
    BayesianRidge(),  # for Age
    LogisticRegression(),  # for Sex
    DecisionTreeClassifier(),  # for Cabin
]

transformer = ColumnTransformer(transformers= [
    ('num', StandardScaler(), [0]),
    ('cat', OneHotEncoder(), [1, 2]),
])

imputer = IterativeImputer(
    estimator=estimators,
    transformer=transformer,
)

imputer.fit_transform(X)

And we'd get back something like:

Age Sex Cabin
16 M F32
56 F C19
55 F XYZ

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions