Description
I wonder if we could make outlier removal available in pipelines.
I tried implementing it for example using the IsolationForest but so far I couldn't solve it and I know why.
The problem boils down to fit_transform
only returning a transformed X
this suffices in the vast majority of cases, since we typically only throw away columns (think of a PCA). However, using outlier removal in a pipeline, we need to throw away rows of X
and y
during training and do nothing during testing. This is not supported so far. Essentially, we would need to turn the predict
function into some kind of transform
function during training.
Investigating the pipeline implementation shows, that fit_transform
is called if present during the fitting part of the pipeline, rather than fit(X, y).transform(X)
. Particularly, in a cross validation fit_transform
is only called during training. This would be perfect for outlier removal. However, it remains to do nothing in the test step. But to this end we can simply implement a "do-nothing" transform
-function.
The most direct way to implement this, would be an API-change of the TransformerMixin-class, unfortunately.
So my questions are:
Would it be interesting to contain feature removal in pipelines?
Are there other more suitable ideas of implementing this feature in a pipeline?
If the content of this question is somehow inapropriate (e.g. since I'm only an active user, not an active developer of the project) or at the wrong place, feel free to remove the thread.