Description
My team and I are working on an application of regression by classification, a technique described in this article.
In a nut shell regression by classification is approaching a regression problem with multi-class classification algorithms. The key part of this technique is to perform discretization, or binning, of the (continous) target prior to classification. The article mentions 3 different approaches for target discretization which are supported by sklearn's KBinsDiscretizer.
- Equally probable interval (this is the quantile strategy of KBinsDiscretizer)
- Equal width interval (this is the uniform strategy of KBinsDiscretizer)
- K-means clustering (this is the kmeans strategy of KBinsDiscretizer)
In regression by classification, the choice of the numbers of classes, the n_bins parameter, is critical. One straight forward way to tune this parameter and to choose the binning strategy is to use cross-validation. But because transformations on y (see #4143) are currently forbidden in scikit-learn, this is not "natively" supported.
We found a way around this by creating our own meta-estimator, as suggested by @jnothman elsewhere. But one problem remained. How can we tell scikit-learn to compute evaluation metrics on BINNED targets, and not the original CONTINOUS targets?
We achieved this by hacking the _PredictScorer class on our scikit-learn fork. The hack looks for a special custom method called get_transformed_targets on our home-brewed meta-estimator. If this method is present, the score is computed using transformed (binned) targets. Here is the hack:
class _PredictScorer(_BaseScorer):
def _score(self, method_caller, estimator, X, y_true, sample_weight=None):
"""[... docstring ...]
"""
#Here starts the hack
if hasattr(estimator, 'get_transformed_targets'):
y_true = estimator.get_transformed_targets(X, y_true)
#Here ends the hack
y_pred = method_caller(estimator, "predict", X)
if sample_weight is not None:
return self._sign * self._score_func(y_true, y_pred,
sample_weight=sample_weight,
**self._kwargs)
else:
return self._sign * self._score_func(y_true, y_pred,
**self._kwargs)
Another problem we encounter is to use the KBinsDiscretizer class on targets. We plan on doing this with a custom meta-transformer.
It would be nice if the regression by classification was supported by scikit-learn out of the box. Perhaps the re-sampling options coming soon will make this possible, but it will have to be tested.