Mixed-type imputation for IterativeImputer

I am opening this issue to more clearly document scattered comments I have made about this idea in several issues.

**Workflow**
For even the simplest dataset (say the Titanic dataset), dealing with missing values and encoding data are essential pre-processing steps. Scikit-Learn has a lot of great tools to deal with this, but I am going to focus on `IterativeImputer` for missing data and `ColumnTransfromer` (and by extension, any other transformers that can be fit inside of it) for encoding.

Please correct me if I am mistaken, but I believe that there is currently no easy way to use these two tools together. Transformers do not work with missing values, and `IterativeImputer` only works with numerical continuous data. Of course one could use `SimpleImputer -> ColumnTransformer` but then you cannot take advantage of more advanced imputation strategies.

I would like to have the ability to feed `IterativeImputer` a `ColumnTransformer` object (that may already even be fit) along with a list of estimators (so basically, I am telling what estimators to use for each column and how to transform the data for those estimators) and have it give me back my data, in it's original format, with imputed values. Then of course you can manually pass it through that same `ColumnTransformer` again and move it down the pipeline.

**Proposed solution**
Changes to  `IterativeImputer`:

1. A new parameter called `transformer` that defaults to `None`.
2. Making the `estimator` parameter accept an iterable in addition to the single estimator it currently supports.
3. Introduce a new step where `ColumnTransformer` gets applied. This would be between the initial imputation step (using `SimpleImputer`) and the estimator steps. 
4. Some internal changing to avoid errors from trying to do numerical operations on `object` `dtype` data. I tested implementing these fixes, this part is trivial.

I think these changes should be backwards compatible.

In terms of `ColumnTransformer`, the two needed features for this to work are:

1. `inverse_transform`: work on this seems to be underway in #11639, but it's a bit stalled it seems.
2. Ability to select which columns are present when using `transform`: a similar issue was brought up in #15781. I think this could even be achieved with the private method `_calculate_inverse_indices` proposed in #11639 if it is made public.

**Problems with this approach**
I can think of two main problems:

1. Convergence: there is already concern regarding convergence of `IterativeImputer`, I am fairly certain that introducing classifiers would make it worse. A simple fix would be to not support tolerance/convergence based early termination if one or more classifiers are used as estimators. It is easy to check if any of the estimators are classifiers. Several parameters related to convergence (ex: `tol`) would need to raise errors if they are not set to `None` (and maybe the default should be changed?).
2. Initial imputation: we would have to restrict the `initial_strategy` parameter to `constant` and `most_frequent` when classification tasks are present.

**Existing work**
I tried implementing this proposal. In some sense, I got pretty far, I resolved the errors from trying to apply numerical numpy operations to object dtype data as well as getting a list of estimators to work. Where I ran into issues was the `ColumnTransformer` problems mentioned above.

**Example of desired usage**
Given a dataset that looks something like this (this is `X` only):
| Age | Sex | Cabin |
|-----|-----|-------|
| 16  | M   | NaN   |
| 56  | NaN | C19   |
| NaN | F   | XYZ   |

We have missing continuous numerical data (Age), two-level categorical data (Sex) and multi-level categorical data (Cabin). We want to impute these missing values. The idea would be something as follows:

```python3
X = .... # the data described above

estimators = [
    BayesianRidge(),  # for Age
    LogisticRegression(),  # for Sex
    DecisionTreeClassifier(),  # for Cabin
]

transformer = ColumnTransformer(transformers= [
    ('num', StandardScaler(), [0]),
    ('cat', OneHotEncoder(), [1, 2]),
])

imputer = IterativeImputer(
    estimator=estimators,
    transformer=transformer,
)

imputer.fit_transform(X)
``` 

And we'd get back something like:
| Age | Sex | Cabin |
|-----|-----|-------|
| 16  | M   | F32   |
| 56  | F | C19   |
| 55 | F   | XYZ   |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Mixed-type imputation for IterativeImputer #17087

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Mixed-type imputation for IterativeImputer #17087

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions