scikit-learn · ogrisel · Nov 20, 2014 · Jan 14, 2014 · Jan 17, 2014 · Jan 27, 2014
diff --git a/doc/modules/classes.rst b/doc/modules/classes.rst
@@ -659,6 +659,7 @@ From text
    linear_model.RidgeCV
    linear_model.SGDClassifier
    linear_model.SGDRegressor
+   linear_model.TheilSenRegressor
 
 .. autosummary::
    :toctree: generated/

diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
@@ -789,13 +789,90 @@ For classification, :class:`PassiveAggressiveClassifier` can be used with
    <http://jmlr.csail.mit.edu/papers/volume7/crammer06a/crammer06a.pdf>`_
    K. Crammer, O. Dekel, J. Keshat, S. Shalev-Shwartz, Y. Singer - JMLR 7 (2006)
 
-Robustness to outliers: RANSAC
-==============================
 
-RANSAC (RANdom SAmple Consensus) is an iterative algorithm for the robust
-estimation of parameters from a subset of inliers from the complete data set.
+Robustness regression: outliers and modeling errors
+=====================================================
+
+Robust regression is interested in fitting a regression model in the
+presence of corrupt data: either outliers, or error in the model.
+
+.. figure:: ../auto_examples/linear_model/images/plot_theilsen_001.png
+   :target: ../auto_examples/linear_model/plot_theilsen.html
+   :scale: 50%
+   :align: center
+
+Different scenario and useful concepts
+----------------------------------------
+
+There are different things to keep in mind when dealing with data
+corrupted by outliers:
+
+.. |y_outliers| image:: ../auto_examples/linear_model/images/plot_robust_fit_003.png
+   :target: ../auto_examples/linear_model/plot_robust_fit.html
+   :scale: 60%
+
+.. |X_outliers| image:: ../auto_examples/linear_model/images/plot_robust_fit_002.png
+   :target: ../auto_examples/linear_model/plot_robust_fit.html
+   :scale: 60%
+
+.. |large_y_outliers| image:: ../auto_examples/linear_model/images/plot_robust_fit_005.png
+   :target: ../auto_examples/linear_model/plot_robust_fit.html
+   :scale: 60%
+
+* **Outliers in X or in y**?
+
+  ==================================== ====================================
+  Outliers in the y direction          Outliers in the X direction
+  ==================================== ====================================
+  |y_outliers|                         |X_outliers|
+  ==================================== ====================================
+
+* **Fraction of outliers versus amplitude of error**
+
+  The number of outlying points matters, but also how much they are
+  outliers.
+
+  ==================================== ====================================
+  Small outliers                       Large outliers
+  ==================================== ====================================
+  |y_outliers|                         |large_y_outliers|
+  ==================================== ====================================
+
+An important notion of robust fitting is that of breakdown point: the
+fraction of data that can be outlying for the fit to start missing the
+inlying data.
+
+Note that in general, robust fitting in high-dimensional setting (large
+`n_features`) is very hard. The robust models here will probably not work
+in these settings.
+
+
+.. topic:: **Trade-offs: which estimator?**
+
+   Scikit-learn provides 2 robust regression estimators:
+   :ref:`RANSAC <ransac_regression>` and
+   :ref:`Theil Sen <theil_sen_regression>`
+
+   * :ref:`RANSAC <ransac_regression>` is faster, and scales much better
+     with the number of samples
+
+   * :ref:`RANSAC <ransac_regression>` will deal better with large
+     outliers in the y direction (most common situation)
+
+  * :ref:`Theil Sen <theil_sen_regression>` will cope better with
+    medium-size outliers in the X direction, but this property will
+    disappear in large dimensional settings.
+
+ When in doubt, use :ref:`RANSAC <ransac_regression>`
+
+.. _ransac_regression:
+
+RANSAC: RANdom SAmple Consensus
+--------------------------------
+
+RANSAC (RANdom SAmple Consensus) fits a model from random subsets of
+inliers from the complete data set.
 
-It is an iterative method to estimate the parameters of a mathematical model.
 RANSAC is a non-deterministic algorithm producing only a reasonable result with
 a certain probability, which is dependent on the number of iterations (see
 `max_trials` parameter). It is typically used for linear and non-linear
@@ -812,6 +889,9 @@ estimated only from the determined inliers.
    :align: center
    :scale: 50%
 
+Details of the algorithm
+^^^^^^^^^^^^^^^^^^^^^^^^
+
 Each iteration performs the following steps:
 
 1. Select ``min_samples`` random samples from the original data and check
@@ -841,6 +921,7 @@ performance.
 .. topic:: Examples:
 
   * :ref:`example_linear_model_plot_ransac.py`
+  * :ref:`example_linear_model_plot_robust_fit.py`
 
 .. topic:: References:
 
@@ -853,6 +934,68 @@ performance.
    <http://www.bmva.org/bmvc/2009/Papers/Paper355/Paper355.pdf>`_
    Sunglok Choi, Taemin Kim and Wonpil Yu - BMVC (2009)
 
+.. _theil_sen_regression:
+
+Theil-Sen estimator: generalized-median-based estimator
+--------------------------------------------------------
+
+The :class:`TheilSenRegressor` estimator uses a generalization of the median in
+multiple dimensions. It is thus robust to multivariate outliers. Note however
+that the robustness of the estimator decreases quickly with the dimensionality
+of the problem. It looses its robustness properties and becomes no
+better than an ordinary least squares in high dimension.
+
+.. topic:: Examples:
+
+  * :ref:`example_linear_model_plot_theilsen.py`
+  * :ref:`example_linear_model_plot_robust_fit.py`
+
+.. topic:: References:
+
+ * http://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator
+
+Theoretical considerations
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:class:`TheilSenRegressor` is comparable to the :ref:`Ordinary Least Squares
+(OLS) <ordinary_least_squares>` in terms of asymptotic efficiency and as an
+unbiased estimator. In contrast to OLS, Theil-Sen is a non-parametric
+method which means it makes no assumption about the underlying
+distribution of the data. Since Theil-Sen is a median-based estimator, it
+is more robust against corrupted data aka outliers. In univariate
+setting, Theil-Sen has a breakdown point of about 29.3% in case of a
+simple linear regression which means that it can tolerate arbitrary
+corrupted data of up to 29.3%.
+
+.. figure:: ../auto_examples/linear_model/images/plot_theilsen_001.png
+   :target: ../auto_examples/linear_model/plot_theilsen.html
+   :align: center
+   :scale: 50%
+
+The implementation of :class:`TheilSenRegressor` in scikit-learn follows a
+generalization to a multivariate linear regression model [#f1]_ using the
+spatial median which is a generalization of the median to multiple
+dimensions [#f2]_.
+
+In terms of time and space complexity, Theil-Sen scales according to
+
+.. math::
+    \binom{n_{samples}}{n_{subsamples}}
+
+which makes it infeasible to be applied exhaustively to problems with a
+large number of samples and features. Therefore, the magnitude of a
+subpopulation can be chosen to limit the time and space complexity by
+considering only a random subset of all possible combinations.
+
+.. topic:: Examples:
+
+  * :ref:`example_linear_model_plot_theilsen.py`
+
+.. topic:: References:
+
+    .. [#f1] Xin Dang, Hanxiang Peng, Xueqin Wang and Heping Zhang: `Theil-Sen Estimators in a Multiple Linear Regression Model. <http://www.math.iupui.edu/~hpeng/MTSE_0908.pdf>`_
+
+    .. [#f2] T. Kärkkäinen and S. Äyrämö: `On Computation of Spatial Median for Robust Data Mining. <http://users.jyu.fi/~samiayr/pdf/ayramo_eurogen05.pdf>`_
 
 .. _polynomial_regression:
 
@@ -965,3 +1108,6 @@ This way, we can solve the XOR problem with a linear classifier::
     >>> clf = Perceptron(fit_intercept=False, n_iter=10).fit(X, y)
     >>> clf.score(X, y)
     1.0
+
+
+
diff --git a/examples/linear_model/plot_robust_fit.py b/examples/linear_model/plot_robust_fit.py
@@ -0,0 +1,87 @@
+"""
+Robust linear estimator fitting
+===============================
+
+Here a sine function is fit with a polynomial of order 3, for values
+close to zero.
+
+Robust fitting is demoed in different situations:
+
+- No measurement errors, only modelling errors (fitting a sine with a
+  polynomial)
+
+- Measurement errors in X
+
+- Measurement errors in y
+
+The median absolute deviation to non corrupt new data is used to judge
+the quality of the prediction.
+
+What we can see that:
+
+- RANSAC is good for strong outliers in the y direction
+
+- TheilSen is good for small outliers, both in direction X and y, but has
+  a break point above which it performs worst than OLS.
+
+"""
+
+from matplotlib import pyplot as plt
+import numpy as np
+
+from sklearn import linear_model, metrics
+from sklearn.preprocessing import PolynomialFeatures
+from sklearn.pipeline import make_pipeline
+
+np.random.seed(42)
+
+X = np.random.normal(size=400)
+y = np.sin(X)
+# Make sure that it X is 2D
+X = X[:, np.newaxis]
+
+X_test = np.random.normal(size=200)
+y_test = np.sin(X_test)
+X_test = X_test[:, np.newaxis]
+
+y_errors = y.copy()
+y_errors[::3] = 3
+
+X_errors = X.copy()
+X_errors[::3] = 3
+
+y_errors_large = y.copy()
+y_errors_large[::3] = 10
+
+X_errors_large = X.copy()
+X_errors_large[::3] = 10
+
+estimators = [('OLS', linear_model.LinearRegression()),
+              ('Theil-Sen', linear_model.TheilSenRegressor(random_state=42)),
+              ('RANSAC', linear_model.RANSACRegressor(random_state=42)), ]
+
+x_plot = np.linspace(X.min(), X.max())
+
+for title, this_X, this_y in [
+        ('Modeling errors only', X, y),
+        ('Corrupt X, small deviants', X_errors, y),
+        ('Corrupt y, small deviants', X, y_errors),
+        ('Corrupt X, large deviants', X_errors_large, y),
+        ('Corrupt y, large deviants', X, y_errors_large)]:
+    plt.figure(figsize=(5, 4))
+    plt.plot(this_X[:, 0], this_y, 'k+')
+
+    for name, estimator in estimators:
+        model = make_pipeline(PolynomialFeatures(3), estimator)
+        model.fit(this_X, this_y)
+        mse = metrics.mean_squared_error(model.predict(X_test), y_test)
+        y_plot = model.predict(x_plot[:, np.newaxis])
+        plt.plot(x_plot, y_plot,
+                 label='%s: error = %.3f' % (name, mse))
+
+    plt.legend(loc='best', frameon=False,
+               title='Error: mean absolute deviation\n to non corrupt data')
+    plt.xlim(-4, 10.2)
+    plt.ylim(-2, 10.2)
+    plt.title(title)
+plt.show()
diff --git a/examples/linear_model/plot_theilsen.py b/examples/linear_model/plot_theilsen.py
@@ -0,0 +1,108 @@
+"""
+====================
+Theil-Sen Regression
+====================
+
+Computes a Theil-Sen Regression on a synthetic dataset.
+
+See :ref:`theil_sen_regression` for more information on the regressor.
+
+Compared to the OLS (ordinary least squares) estimator, the Theil-Sen
+estimator is robust against outliers. It has a breakdown point of about 29.3%
+in case of a simple linear regression which means that it can tolerate
+arbitrary corrupted data (outliers) of up to 29.3% in the two-dimensional
+case.
+
+The estimation of the model is done by calculating the slopes and intercepts
+of a subpopulation of all possible combinations of p subsample points. If an
+intercept is fitted, p must be greater than or equal to n_features + 1. The
+final slope and intercept is then defined as the spatial median of these
+slopes and intercepts.
+
+In certain cases Theil-Sen performs better than :ref:`RANSAC
+<ransac_regression>` which is also a robust method. This is illustrated in the
+second example below where outliers with respect to the x-axis perturb RANSAC.
+Tuning the ``residual_threshold`` parameter of RANSAC remedies this but in
+general a priori knowledge about the data and the nature of the outliers is
+needed.
+Due to the computational complexity of Theil-Sen it is recommended to use it
+only for small problems in terms of number of samples and features. For larger
+problems the ``max_subpopulation`` parameter restricts the magnitude of all
+possible combinations of p subsample points to a randomly chosen subset and
+therefore also limits the runtime. Therefore, Theil-Sen is applicable to larger
+problems with the drawback of losing some of its mathematical properties since
+it then works on a random subset.
+"""
+
+# Author: Florian Wilhelm -- <florian.wilhelm@gmail.com>
+# License: BSD 3 clause
+
+import time
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn.linear_model import LinearRegression, TheilSenRegressor
+from sklearn.linear_model import RANSACRegressor
+
+print(__doc__)
+
+estimators = [('OLS', LinearRegression()),
+              ('Theil-Sen', TheilSenRegressor(random_state=42)),
+              ('RANSAC', RANSACRegressor(random_state=42)), ]
+
+##############################################################################
+# Outliers only in the y direction
+
+np.random.seed(0)
+n_samples = 200
+# Linear model y = 3*x + N(2, 0.1**2)
+x = np.random.randn(n_samples)
+w = 3.
+c = 2.
+noise = 0.1 * np.random.randn(n_samples)
+y = w * x + c + noise
+# 10% outliers
+y[-20:] += -20 * x[-20:]
+X = x[:, np.newaxis]
+
+plt.plot(x, y, 'k+', mew=2, ms=8)
+line_x = np.array([-3, 3])
+for name, estimator in estimators:
+    t0 = time.time()
+    estimator.fit(X, y)
+    elapsed_time = time.time() - t0
+    y_pred = estimator.predict(line_x.reshape(2, 1))
+    plt.plot(line_x, y_pred,
+             label='%s (fit time: %.2fs)' % (name, elapsed_time))
+
+plt.axis('tight')
+plt.legend(loc='upper left')
+
+
+##############################################################################
+# Outliers in the X direction
+
+np.random.seed(0)
+# Linear model y = 3*x + N(2, 0.1**2)
+x = np.random.randn(n_samples)
+noise = 0.1 * np.random.randn(n_samples)
+y = 3 * x + 2 + noise
+# 10% outliers
+x[-20:] = 9.9
+y[-20:] += 22
+X = x[:, np.newaxis]
+
+plt.figure()
+plt.plot(x, y, 'k+', mew=2, ms=8)
+
+line_x = np.array([-3, 10])
+for name, estimator in estimators:
+    t0 = time.time()
+    estimator.fit(X, y)
+    elapsed_time = time.time() - t0
+    y_pred = estimator.predict(line_x.reshape(2, 1))
+    plt.plot(line_x, y_pred,
+             label='%s (fit time: %.2fs)' % (name, elapsed_time))
+
+plt.axis('tight')
+plt.legend(loc='upper left')
+plt.show()