Skip to content

Pipeline: apply all transformations except the last classifier #8414

@mratsim

Description

@mratsim

Pipeline should provide a method to apply its transformations to an arbitrary dataset without transform from the last classifier step.

Use case:

Boosted tree models like XGBoost and LightGBM use a validation set for early stopping.
We can trivially apply the pipeline to train and test via fit and predict but not for the validation set.


After raising the issue and proposing 2 ideas at LightGBM, microsoft/LightGBM#299 and XGBoost, dmlc/xgboost#2039, I believe it should be handled at Scikit-learn level.

Idea 1, have a dummy transform method in XGBClassifier and LGBMClassifier

The transform method for pipeline/classifier is already extremely inconsistent :

Idea 2, Implement a validation_split parameter for early stopping

Early stopping in KerasClassifier is controlled by a validation_split parameter.
At first I thought that could be used in XGBClassifier and LGBMClassifier and everything else that would need a validation set for early stopping.
The issue here is that there is no control over the validation set and split. Furthermore if there is a need to inspect deeper into validation issues, I suppose it would be non-trivial to extract it from the classifier or provide an API for it.


Hence I think Scikit-learn need a method or parameter in transform to ignore the last step or the last n steps.

If needed I can raise a related issue on having a consistent transform method for classifiers and keep this one focused on applying transform without classification on arbitrary data.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions