-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
I am opening this issue to more clearly document scattered comments I have made about this idea in several issues.
Workflow
For even the simplest dataset (say the Titanic dataset), dealing with missing values and encoding data are essential pre-processing steps. Scikit-Learn has a lot of great tools to deal with this, but I am going to focus on IterativeImputer
for missing data and ColumnTransfromer
(and by extension, any other transformers that can be fit inside of it) for encoding.
Please correct me if I am mistaken, but I believe that there is currently no easy way to use these two tools together. Transformers do not work with missing values, and IterativeImputer
only works with numerical continuous data. Of course one could use SimpleImputer -> ColumnTransformer
but then you cannot take advantage of more advanced imputation strategies.
I would like to have the ability to feed IterativeImputer
a ColumnTransformer
object (that may already even be fit) along with a list of estimators (so basically, I am telling what estimators to use for each column and how to transform the data for those estimators) and have it give me back my data, in it's original format, with imputed values. Then of course you can manually pass it through that same ColumnTransformer
again and move it down the pipeline.
Proposed solution
Changes to IterativeImputer
:
- A new parameter called
transformer
that defaults toNone
. - Making the
estimator
parameter accept an iterable in addition to the single estimator it currently supports. - Introduce a new step where
ColumnTransformer
gets applied. This would be between the initial imputation step (usingSimpleImputer
) and the estimator steps. - Some internal changing to avoid errors from trying to do numerical operations on
object
dtype
data. I tested implementing these fixes, this part is trivial.
I think these changes should be backwards compatible.
In terms of ColumnTransformer
, the two needed features for this to work are:
inverse_transform
: work on this seems to be underway in [MRG] ENH: Adds inverse_transform to ColumnTransformer #11639, but it's a bit stalled it seems.- Ability to select which columns are present when using
transform
: a similar issue was brought up in Using ColumnTransformer when not all specified columns present in data #15781. I think this could even be achieved with the private method_calculate_inverse_indices
proposed in [MRG] ENH: Adds inverse_transform to ColumnTransformer #11639 if it is made public.
Problems with this approach
I can think of two main problems:
- Convergence: there is already concern regarding convergence of
IterativeImputer
, I am fairly certain that introducing classifiers would make it worse. A simple fix would be to not support tolerance/convergence based early termination if one or more classifiers are used as estimators. It is easy to check if any of the estimators are classifiers. Several parameters related to convergence (ex:tol
) would need to raise errors if they are not set toNone
(and maybe the default should be changed?). - Initial imputation: we would have to restrict the
initial_strategy
parameter toconstant
andmost_frequent
when classification tasks are present.
Existing work
I tried implementing this proposal. In some sense, I got pretty far, I resolved the errors from trying to apply numerical numpy operations to object dtype data as well as getting a list of estimators to work. Where I ran into issues was the ColumnTransformer
problems mentioned above.
Example of desired usage
Given a dataset that looks something like this (this is X
only):
Age | Sex | Cabin |
---|---|---|
16 | M | NaN |
56 | NaN | C19 |
NaN | F | XYZ |
We have missing continuous numerical data (Age), two-level categorical data (Sex) and multi-level categorical data (Cabin). We want to impute these missing values. The idea would be something as follows:
X = .... # the data described above
estimators = [
BayesianRidge(), # for Age
LogisticRegression(), # for Sex
DecisionTreeClassifier(), # for Cabin
]
transformer = ColumnTransformer(transformers= [
('num', StandardScaler(), [0]),
('cat', OneHotEncoder(), [1, 2]),
])
imputer = IterativeImputer(
estimator=estimators,
transformer=transformer,
)
imputer.fit_transform(X)
And we'd get back something like:
Age | Sex | Cabin |
---|---|---|
16 | M | F32 |
56 | F | C19 |
55 | F | XYZ |