Skip to content

Feature Request: Pipelining Outlier Removal #9630

Open
@datajanko

Description

@datajanko

I wonder if we could make outlier removal available in pipelines.

I tried implementing it for example using the IsolationForest but so far I couldn't solve it and I know why.

The problem boils down to fit_transform only returning a transformed X this suffices in the vast majority of cases, since we typically only throw away columns (think of a PCA). However, using outlier removal in a pipeline, we need to throw away rows of X and y during training and do nothing during testing. This is not supported so far. Essentially, we would need to turn the predict function into some kind of transform function during training.

Investigating the pipeline implementation shows, that fit_transformis called if present during the fitting part of the pipeline, rather than fit(X, y).transform(X). Particularly, in a cross validation fit_transform is only called during training. This would be perfect for outlier removal. However, it remains to do nothing in the test step. But to this end we can simply implement a "do-nothing" transform-function.

The most direct way to implement this, would be an API-change of the TransformerMixin-class, unfortunately.

So my questions are:

Would it be interesting to contain feature removal in pipelines?
Are there other more suitable ideas of implementing this feature in a pipeline?

If the content of this question is somehow inapropriate (e.g. since I'm only an active user, not an active developer of the project) or at the wrong place, feel free to remove the thread.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions