-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Pipeline
should provide a method to apply its transformations to an arbitrary dataset without transform
from the last classifier step.
Use case:
Boosted tree models like XGBoost
and LightGBM
use a validation set for early stopping.
We can trivially apply the pipeline to train and test via fit and predict but not for the validation set.
After raising the issue and proposing 2 ideas at LightGBM, microsoft/LightGBM#299 and XGBoost, dmlc/xgboost#2039, I believe it should be handled at Scikit-learn
level.
Idea 1, have a dummy transform
method in XGBClassifier
and LGBMClassifier
The transform
method for pipeline/classifier is already extremely inconsistent :
- Failure because the classifier step does not implement transform
- Deprecated feature importance extraction for trees ensemble
- NN features proposition for MLPClassifier transform method in MLPClassifier #8291
- Decision path proposition for trees ensemble
transform
method of tree ensembles should return thedecision_path
#7907
Furthermore the issue will pop up again if the last classifier is an ensemble of multiple models
Idea 2, Implement a validation_split parameter for early stopping
Early stopping in KerasClassifier is controlled by a validation_split parameter.
At first I thought that could be used in XGBClassifier
and LGBMClassifier
and everything else that would need a validation set for early stopping.
The issue here is that there is no control over the validation set and split. Furthermore if there is a need to inspect deeper into validation issues, I suppose it would be non-trivial to extract it from the classifier or provide an API for it.
Hence I think Scikit-learn
need a method or parameter in transform to ignore the last step or the last n steps.
If needed I can raise a related issue on having a consistent transform
method for classifiers and keep this one focused on applying transform
without classification on arbitrary data.