Skip to content

[WIP] Add quantile regression (Continuation) #16343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 37 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
7988049
Added basic Quantile regression
Oct 4, 2017
fd04aed
Added L1 penalty to loss function
Oct 4, 2017
6284c83
Add tests and plot example for Quantile Regression
Oct 16, 2017
8ed4899
Enable gtol and maxiter
Oct 17, 2017
285b5d7
light refactor
Oct 17, 2017
bb5ce51
Code for approximate smooth quantile loss
Oct 17, 2017
fb07f0e
Fix error in the smooth version
Oct 17, 2017
95492fe
Solve a sequence of smooth problems instead of one non-smooth.
Oct 18, 2017
bdb48ac
Tuned convergence
Oct 18, 2017
8ee950d
Rename approximation threshold, write docstrings
Oct 18, 2017
00a72fb
Enforce zero coefficients
Oct 18, 2017
4a25fad
Fix zero enforcement
Oct 18, 2017
b3ccff6
Add description in user guide
Oct 22, 2017
cb87e63
pep8 line widths
Oct 22, 2017
0ea2317
Merge remote-tracking branch 'upstream/master' into quantile-regression
Oct 22, 2017
b567c82
Mention GradientBoostingRegressor in the docs.
Oct 22, 2017
0686e66
Mention robustness in doc.
Oct 22, 2017
d2a59d8
Remove magic constants for quantile regression
Mar 18, 2018
4dc4eb2
Small edits to .rst on quantile regression
Mar 18, 2018
eca1165
Improve formatting of quantile regression tests
Mar 18, 2018
2524c2b
Change the solver for quantile regression
avidale Mar 18, 2018
04e6165
small refactor of tests of the quantile regression
Mar 18, 2018
4b715d3
Merge branch 'quantile-regression' of https://github.com/avidale/scik…
Mar 18, 2018
52e1aa1
Enable normalization of inputs
Mar 18, 2018
8fa745b
Change convergence warning and max_iter in some tests
Mar 19, 2018
90bc013
A small refactor to the quantile regression example and docs
Mar 19, 2018
06c24c3
Comparison of QuantileRegressor and OLS
Mar 19, 2018
9b0d42c
update test_quantile_warm_start
Mar 19, 2018
5e453aa
Improved the warm_start example, added a toy example for quantile reg…
Mar 19, 2018
463c633
fig the toy test case for quantile regression
Mar 19, 2018
80c8182
Merge branch 'master' into quantile-regression
Oct 7, 2018
50a5105
respond to comments and fix some style
Oct 7, 2018
d96197a
fix the gradient calculation
Oct 7, 2018
08355a4
replace the warnings module with pytest; fix the plotting errors
Oct 7, 2018
0d3081a
Merge remote-tracking branch 'upstream/master' into qreg
DatenBiene Jan 31, 2020
a4259a7
Merge branch 'master' into qreg
DatenBiene Apr 21, 2020
2fca6c6
Merge remote-tracking branch 'upstream/master' into qreg
DatenBiene Apr 21, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -796,6 +796,7 @@ or :class:`~sklearn.linear_model.SGDClassifier` with an appropriate penalty.
linear_model.OrthogonalMatchingPursuit
linear_model.OrthogonalMatchingPursuitCV


Bayesian regressors
-------------------

Expand Down Expand Up @@ -836,6 +837,7 @@ Any estimator using the Huber loss would also be robust to outliers, e.g.
linear_model.HuberRegressor
linear_model.RANSACRegressor
linear_model.TheilSenRegressor
linear_model.QuantileRegressor

Generalized linear models (GLM) for regression
----------------------------------------------
Expand All @@ -851,7 +853,6 @@ than a normal distribution:
linear_model.TweedieRegressor
linear_model.GammaRegressor


Miscellaneous
-------------

Expand Down
68 changes: 68 additions & 0 deletions doc/modules/linear_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1411,6 +1411,74 @@ Note that this estimator is different from the R implementation of Robust Regres
squares implementation with weights given to each sample on the basis of how much the residual is
greater than a certain threshold.

.. _quantile_regression:

Quantile Regression
===================

Quantile regression estimates median or other quantiles of :math:`y` conditional on :math:`X`, while OLS estimates
conditional mean.

The :class:`QuantileRegressor` applies linear loss to all samples. It is thus more radical than
:class:`HuberRegressor`, that applies linear penalty to small fraction of outliers and quadratic loss
to the rest of observations. :class:`QuantileRegressor` also supports L1 and L2 regularization,
like :class:`ElasticNet`. It solves

.. math::
\underset{w}{min\,} { \frac{1}{n_{samples}} L_q (y - X w) + \alpha \rho ||w||_1 + \alpha(1-\rho) ||w||_2 ^ 2}

where

.. math::
\L_q(t) =
\begin{cases}
q t, & t > 0, \\
0, & t = 0, \\
(1-q) t, & t < 0
\end{cases}

and :math:`q \in (0, 1)` is the quantile to be estimated.

Quantile regression may be useful if one is interested in predicting an interval
instead of point prediction. Sometimes prediction interval is calculated based on
assumption that prediction error is distributed normally with zero mean and constant variance.
Quantile regression provides sensible prediction intervals even for errors with non-constant
(but predictable) variance or non-normal distribution.

.. figure:: /auto_examples/linear_model/images/sphx_glr_plot_quantile_regression_001.png
:target: ../auto_examples/linear_model/plot_quantile_regression.html
:align: center
:scale: 50%

Another possible advantage of quantile regression over OLS is its robustness
to outliers, because it is only sign of an error that influences estimated
coefficients, not its absolute value.

Quantile loss function can be used with models other than linear. For example,
:class:`GradientBoostingRegressor` can predict conditional quantiles, if its parameter ``loss`` is set to ``"quantile"``
and parameter ``alpha`` is set to the quantile that should be predicted. See the example in
:ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_quantile.py`

Most implementations of quantile regression are based on linear programming problem.
Use of L2 regularization makes the problem nonlinear, but use of non-differentiable absolute values
makes it difficult for gradient descent optimization. Instead, the current implementation solves
a sequence of smooth approximate problems similar to Huber regression, proposed by Chen and Wei.
Every next step uses a finer approximation. Optimization stops when solutions of two
consecutive steps are almost identical or when maximal number of iterations is exceeded.

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_linear_model_plot_quantile_regression.py`

.. topic:: References:

* Koenker, R., & Bassett Jr, G. (1978). `Regression quantiles. <http://web.stanford.edu/~doubleh/otherpapers/koenker.pdf>`_
Econometrica: journal of the Econometric Society, 33-50.

* Chen, C., & Wei, Y. (2005). `Computational issues for quantile regression. <http://pdfs.semanticscholar.org/5cf3/f9fe77c423dc394c8766cbdcfb41ea44b7d4.pdf>`_
Sankhya: The Indian Journal of Statistics, 399-417.


.. _polynomial_regression:

Polynomial regression: extending linear models with basis functions
Expand Down
78 changes: 78 additions & 0 deletions examples/linear_model/plot_quantile_regression.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
"""
==============
Quantile regression
==============

Plot the prediction of different conditional quantiles.

The left figure shows the case when error distribution is normal,
but variance is not constant.

The right figure shows example of an asymmetric error distribution
(namely, Pareto).
"""
from __future__ import division
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import QuantileRegressor, LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import cross_val_score

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

rng = np.random.RandomState(42)
x = np.linspace(0, 10, 100)
X = x[:, np.newaxis]
y = 20 + x*2 + rng.normal(loc=0, scale=0.5+0.5*x, size=x.shape[0])
ax1.scatter(x, y)

quantiles = [0.05, 0.5, 0.95]
for quantile in quantiles:
qr = QuantileRegressor(quantile=quantile, max_iter=10000, alpha=0)
qr.fit(X, y)
ax1.plot([0, 10], qr.predict([[0], [10]]))
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Quantiles of normal residuals with non-constant variance')
ax1.legend(quantiles)

y = 20 + x * 0.5 + rng.pareto(10, size=x.shape[0])*10
ax2.scatter(x, y)

for quantile in quantiles:
qr = QuantileRegressor(quantile=quantile, max_iter=10000, alpha=0)
qr.fit(X, y)
ax2.plot([0, 10], qr.predict([[0], [10]]))
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Quantiles of asymmetrically distributed residuals')
ax2.legend(quantiles)

plt.show()

#########################################################################
#
# The second part of the code shows that LinearRegression minimizes RMSE,
# while QuantileRegressor minimizes MAE, and both do their own job well.

models = [LinearRegression(), QuantileRegressor(alpha=0, max_iter=10000)]
names = ['OLS', 'Quantile']

print('# In-sample performance')
for model_name, model in zip(names, models):
print(model_name + ':')
model.fit(X, y)
mae = mean_absolute_error(model.predict(X), y)
rmse = np.sqrt(mean_squared_error(model.predict(X), y))
print('MAE={:.4} RMSE={:.4}'.format(mae, rmse))
print('\n# Cross-validated performance')
for model_name, model in zip(names, models):
print(model_name + ':')
mae = -cross_val_score(model, X, y, cv=3,
scoring='neg_mean_absolute_error').mean()
rmse = np.sqrt(-cross_val_score(model, X, y, cv=3,
scoring='neg_mean_squared_error').mean())
print('MAE={:.4} RMSE={:.4}'.format(mae, rmse))
7 changes: 7 additions & 0 deletions sklearn/linear_model/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@

from ._ransac import RANSACRegressor
from ._theil_sen import TheilSenRegressor
from .quantile import QuantileRegressor

__all__ = ['ARDRegression',
'BayesianRidge',
Expand Down Expand Up @@ -59,6 +60,12 @@
'PassiveAggressiveClassifier',
'PassiveAggressiveRegressor',
'Perceptron',
<<<<<<< HEAD
'QuantileRegressor',
'RandomizedLasso',
'RandomizedLogisticRegression',
=======
>>>>>>> upstream/master
'Ridge',
'RidgeCV',
'RidgeClassifier',
Expand Down
Loading