-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] TheilSen robust linear regression #2949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
99 commits
Select commit
Hold shift + click to select a range
d4519d2
Added multiple linear Theil-Sen regression
FlorianWilhelm b3e8a64
Added an example and documentatin for Theil-Sen
FlorianWilhelm 7dea60d
Added subpopulation parameter to Theil-Sen
FlorianWilhelm 8d48d45
Added parallelization support to TheilSen Estimator
FlorianWilhelm 86a7461
Improved parallelization for Theilsen estimator
FlorianWilhelm 601b82b
Merge branch 'master' into theilsen
FlorianWilhelm c19d967
Cleanups and corrections for Theil-Sen regression.
FlorianWilhelm 8653c20
Removed subpopulation=None option in TheilSen
FlorianWilhelm 960c45b
xrange fix for Python3 in TheilSen
FlorianWilhelm 0eda5bf
FIX Theil-Sen unittest for older Scipy Version
FlorianWilhelm d133712
FIX that some functions in Theil-Sen were public
FlorianWilhelm d1d221b
FIX usage of linalg from Scipy in Theil-Sen
FlorianWilhelm 61a5195
FIX: Let Theil-Sen handle n_samples < n_features case
FlorianWilhelm 13d9212
FIX: Python 2.6 format syntax in Theil-Sen
FlorianWilhelm 3e8ca8c
Vectorization of theilsen._modweiszfeld_step
FlorianWilhelm 1d75896
FIX: Parallel unittests for Theil-Sen estimator
FlorianWilhelm b374e7e
FIX: TheilSen supports old Numpy versions
0b10f49
DOC: Comparison of Theil-Sen and RANSAC
FlorianWilhelm ca9a275
DOC: Fixed typo in Theil-Sen example.
FlorianWilhelm 675babb
FIX: Some coding style fixes in TheilSen unittest.
FlorianWilhelm 279dc90
FIX: Reduced the runtime of the TheilSen unittest.
FlorianWilhelm 39efe30
DOC: Small corrections in the docs of Theil-Sen
FlorianWilhelm 689bf29
DOC: improve Thiel-Sen vs RANSAC example
GaelVaroquaux b2366b2
ENH: speed up theilsen
GaelVaroquaux 43d76ac
DOC: Explanation when TheilSen outperforms RANSAC.
FlorianWilhelm 8935df6
Merge branch 'master' into theilsen
4b6768f
Merge branch 'master' into theilsen
FlorianWilhelm e4503b1
Fix for old Numpy 1.6.3
1119e3b
Added to comments to better explain last commit
b6ed218
Use string argument for legend's loc parameter
5189cfb
ENH: add a median absolute deviation metric
GaelVaroquaux 434c20d
DOC: add an example of robust fitting
GaelVaroquaux e7fa78d
TST: fix doctest
GaelVaroquaux 110d3f2
DOC: better documentation for robust models
GaelVaroquaux d3cddfe
API: naming: CamelCase class -> camel_case function
GaelVaroquaux c03d4bd
Merge remote-tracking branch 'gvaroquaux/pr_2949' into theilsen
FlorianWilhelm 2fcba22
Merge remote-tracking branch 'upstream/master' into theilsen
FlorianWilhelm 181c720
TST: Cleanups in test_theil_sen
FlorianWilhelm e44f9c7
COSMIT: Renamed _lse to _lstsq in theil_sen.py
FlorianWilhelm 09d44f1
ENH: Removed shared-memory parallelism in theil_sen
FlorianWilhelm 970e4c2
COSMIT: Inlined two methods in theil_sen.py
FlorianWilhelm 00071bd
ENH: Use warnings instead of logging in theil_sen
FlorianWilhelm 1314697
ENH: Removed _split_indices method in TheilSen
FlorianWilhelm d105145
ENH: Rewrote TheilSen._get_n_jobs as a function
FlorianWilhelm 43b5e43
COSMIT: More explicit names for vars in theil_sen
FlorianWilhelm 8dbec4f
Merge branch 'master' into theilsen
FlorianWilhelm 3e70d84
FIX: usage of check_array in theil_sen
FlorianWilhelm 030f7e1
FIX: Use check_consistent_length in theil_sen
FlorianWilhelm 9db5ba1
ENH: Refactoring in theil_sen
FlorianWilhelm b28e2b6
ENH: Removed unnecessary generator in theil_sen
FlorianWilhelm 49d3043
FIX: doctest of get_n_jobs
FlorianWilhelm 7d2179b
ENH: Theil-Sen vs. RANSAC example
FlorianWilhelm b0ab714
Merge branch 'master' into theilsen
FlorianWilhelm a800040
COSMIT: Small changes regarding Theil-Sen
FlorianWilhelm 3de9f22
DOC: Better documentation for Theil-Sen
FlorianWilhelm ef17f80
ENH: Improvements in the Theil-Sen regressor
FlorianWilhelm 961bf67
ENH: Shortcut for 1d case in spatial median
FlorianWilhelm 374ac6d
ENH: Avoid trailing \ in test_theilsen imports
FlorianWilhelm cf3b476
FIX: TheilSen -> TheilSenRegressor in docs
FlorianWilhelm ef43e7d
DOC: Narrative doc for median_absolute_error
53d1711
ENH: Reworked _modified_weiszfeld_step
afda6b0
DOC: Improved _spatial_median docs
89d5aa0
COSMIT: Replaced xrange by range
51cef86
COSMIT: Renamed y -> x_old in _modified_weiszfeld_step
d3de579
COSMIT: Renamed spmed[_old] to spatial_median[_old]
1f67b56
COSMIT: Break and for .. else in _spatial_median
9ce2444
ENH: Reworked _lstsq in theil_sen.py
389d13d
ENH: Replace AssertionError by ValueError
25746e2
COSMIT: Improved error message in theil_sen.py
8955fd0
COSMIT: Renamed n_all to all_combinations in theil_sen.py
961a2ea
COSMIT: Consistent naming for n_subpop
27ef88f
COSMIT: Fixed pep8 problem in theil_sen.py
4633c52
DOC: Moved notes section to long description
a5d2fd9
Merge branch 'master' into theilsen
aadd504
DOC: Added return doc to _lstsq
FlorianWilhelm 562f4e4
COSMIT: Better variable names for _modified_weiszfeld_step
FlorianWilhelm d9bdddd
COSMIT: Moved epsilon to module level
FlorianWilhelm 9c1b0c0
COSMIT: Some empty lines for better readability
FlorianWilhelm 076bad6
ENH: Removed median_absolute_error due to PR #3761
FlorianWilhelm 5b3a387
ENH: Made n_subpopulation a fit parameter
FlorianWilhelm 9a3ad6a
COSMIT: Some renamings and PEP8 compliance
9ebed28
Merge branch 'master' into theilsen
657c86f
ENH: Removed 1d shortcut in _spatial_median again
8e5347b
COSMIT: Clearer slicing syntax in _modified_weiszfeld_step
6564d23
ENH: Fixed confusing X = X.T renaming
7a1a35d
COSMIT: Some renamings in _lstsq
923619f
FIX: Fix of last merge with master
c377ccd
COSMIT: Renamings for easier understanding
ec663e6
COSMIT: Another slicing syntax cleanup
29c21f0
Revert "ENH: Removed 1d shortcut in _spatial_median again"
777cd87
ENH: sample without replacement
64fc5c9
Merge branch 'master' into theilsen
4fcd6b8
COSMIT: pep8 and renaming
b8cd60b
COSMIT: replaced assert by assert_less/greater etc.
d5ec39b
TEST: No console output during unit tests
f9ecbf7
ENH: Always set random_state in unit tests
f92074a
ENH: Speedup of unit tests
c54c17e
COSMIT: Better consistency
2137d82
ENH: Added random_state in plot_theilsen.py
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
""" | ||
Robust linear estimator fitting | ||
=============================== | ||
|
||
Here a sine function is fit with a polynomial of order 3, for values | ||
close to zero. | ||
|
||
Robust fitting is demoed in different situations: | ||
|
||
- No measurement errors, only modelling errors (fitting a sine with a | ||
polynomial) | ||
|
||
- Measurement errors in X | ||
|
||
- Measurement errors in y | ||
|
||
The median absolute deviation to non corrupt new data is used to judge | ||
the quality of the prediction. | ||
|
||
What we can see that: | ||
|
||
- RANSAC is good for strong outliers in the y direction | ||
|
||
- TheilSen is good for small outliers, both in direction X and y, but has | ||
a break point above which it performs worst than OLS. | ||
|
||
""" | ||
|
||
from matplotlib import pyplot as plt | ||
import numpy as np | ||
|
||
from sklearn import linear_model, metrics | ||
from sklearn.preprocessing import PolynomialFeatures | ||
from sklearn.pipeline import make_pipeline | ||
|
||
np.random.seed(42) | ||
|
||
X = np.random.normal(size=400) | ||
y = np.sin(X) | ||
# Make sure that it X is 2D | ||
X = X[:, np.newaxis] | ||
|
||
X_test = np.random.normal(size=200) | ||
y_test = np.sin(X_test) | ||
X_test = X_test[:, np.newaxis] | ||
|
||
y_errors = y.copy() | ||
y_errors[::3] = 3 | ||
|
||
X_errors = X.copy() | ||
X_errors[::3] = 3 | ||
|
||
y_errors_large = y.copy() | ||
y_errors_large[::3] = 10 | ||
|
||
X_errors_large = X.copy() | ||
X_errors_large[::3] = 10 | ||
|
||
estimators = [('OLS', linear_model.LinearRegression()), | ||
('Theil-Sen', linear_model.TheilSenRegressor(random_state=42)), | ||
('RANSAC', linear_model.RANSACRegressor(random_state=42)), ] | ||
|
||
x_plot = np.linspace(X.min(), X.max()) | ||
|
||
for title, this_X, this_y in [ | ||
('Modeling errors only', X, y), | ||
('Corrupt X, small deviants', X_errors, y), | ||
('Corrupt y, small deviants', X, y_errors), | ||
('Corrupt X, large deviants', X_errors_large, y), | ||
('Corrupt y, large deviants', X, y_errors_large)]: | ||
plt.figure(figsize=(5, 4)) | ||
plt.plot(this_X[:, 0], this_y, 'k+') | ||
|
||
for name, estimator in estimators: | ||
model = make_pipeline(PolynomialFeatures(3), estimator) | ||
model.fit(this_X, this_y) | ||
mse = metrics.mean_squared_error(model.predict(X_test), y_test) | ||
y_plot = model.predict(x_plot[:, np.newaxis]) | ||
plt.plot(x_plot, y_plot, | ||
label='%s: error = %.3f' % (name, mse)) | ||
|
||
plt.legend(loc='best', frameon=False, | ||
title='Error: mean absolute deviation\n to non corrupt data') | ||
plt.xlim(-4, 10.2) | ||
plt.ylim(-2, 10.2) | ||
plt.title(title) | ||
plt.show() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
""" | ||
==================== | ||
Theil-Sen Regression | ||
==================== | ||
|
||
Computes a Theil-Sen Regression on a synthetic dataset. | ||
|
||
See :ref:`theil_sen_regression` for more information on the regressor. | ||
|
||
Compared to the OLS (ordinary least squares) estimator, the Theil-Sen | ||
estimator is robust against outliers. It has a breakdown point of about 29.3% | ||
in case of a simple linear regression which means that it can tolerate | ||
arbitrary corrupted data (outliers) of up to 29.3% in the two-dimensional | ||
case. | ||
|
||
The estimation of the model is done by calculating the slopes and intercepts | ||
of a subpopulation of all possible combinations of p subsample points. If an | ||
intercept is fitted, p must be greater than or equal to n_features + 1. The | ||
final slope and intercept is then defined as the spatial median of these | ||
slopes and intercepts. | ||
|
||
In certain cases Theil-Sen performs better than :ref:`RANSAC | ||
<ransac_regression>` which is also a robust method. This is illustrated in the | ||
second example below where outliers with respect to the x-axis perturb RANSAC. | ||
Tuning the ``residual_threshold`` parameter of RANSAC remedies this but in | ||
general a priori knowledge about the data and the nature of the outliers is | ||
needed. | ||
Due to the computational complexity of Theil-Sen it is recommended to use it | ||
only for small problems in terms of number of samples and features. For larger | ||
problems the ``max_subpopulation`` parameter restricts the magnitude of all | ||
possible combinations of p subsample points to a randomly chosen subset and | ||
therefore also limits the runtime. Therefore, Theil-Sen is applicable to larger | ||
problems with the drawback of losing some of its mathematical properties since | ||
it then works on a random subset. | ||
""" | ||
|
||
# Author: Florian Wilhelm -- <florian.wilhelm@gmail.com> | ||
# License: BSD 3 clause | ||
|
||
import time | ||
import numpy as np | ||
import matplotlib.pyplot as plt | ||
from sklearn.linear_model import LinearRegression, TheilSenRegressor | ||
from sklearn.linear_model import RANSACRegressor | ||
|
||
print(__doc__) | ||
|
||
estimators = [('OLS', LinearRegression()), | ||
('Theil-Sen', TheilSenRegressor(random_state=42)), | ||
('RANSAC', RANSACRegressor(random_state=42)), ] | ||
|
||
############################################################################## | ||
# Outliers only in the y direction | ||
|
||
np.random.seed(0) | ||
n_samples = 200 | ||
# Linear model y = 3*x + N(2, 0.1**2) | ||
x = np.random.randn(n_samples) | ||
w = 3. | ||
c = 2. | ||
noise = 0.1 * np.random.randn(n_samples) | ||
y = w * x + c + noise | ||
# 10% outliers | ||
y[-20:] += -20 * x[-20:] | ||
X = x[:, np.newaxis] | ||
|
||
plt.plot(x, y, 'k+', mew=2, ms=8) | ||
line_x = np.array([-3, 3]) | ||
for name, estimator in estimators: | ||
t0 = time.time() | ||
estimator.fit(X, y) | ||
elapsed_time = time.time() - t0 | ||
y_pred = estimator.predict(line_x.reshape(2, 1)) | ||
plt.plot(line_x, y_pred, | ||
label='%s (fit time: %.2fs)' % (name, elapsed_time)) | ||
|
||
plt.axis('tight') | ||
plt.legend(loc='upper left') | ||
|
||
|
||
############################################################################## | ||
# Outliers in the X direction | ||
|
||
np.random.seed(0) | ||
# Linear model y = 3*x + N(2, 0.1**2) | ||
x = np.random.randn(n_samples) | ||
noise = 0.1 * np.random.randn(n_samples) | ||
y = 3 * x + 2 + noise | ||
# 10% outliers | ||
x[-20:] = 9.9 | ||
y[-20:] += 22 | ||
X = x[:, np.newaxis] | ||
|
||
plt.figure() | ||
plt.plot(x, y, 'k+', mew=2, ms=8) | ||
|
||
line_x = np.array([-3, 10]) | ||
for name, estimator in estimators: | ||
t0 = time.time() | ||
estimator.fit(X, y) | ||
elapsed_time = time.time() - t0 | ||
y_pred = estimator.predict(line_x.reshape(2, 1)) | ||
plt.plot(line_x, y_pred, | ||
label='%s (fit time: %.2fs)' % (name, elapsed_time)) | ||
|
||
plt.axis('tight') | ||
plt.legend(loc='upper left') | ||
plt.show() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels strange to call a linear model a non parametric model: the number of parameters is fixed (it's the number of features + 1) and it assumes that the output variable is linearly dependent on the input variables (excluding the outliers). I would not call that a non-parametric model.
To me a non-parametric model can grow in complexity with the number of samples in the training set. Here the complexity (the number of parameters) is bounded by the number of features which is assumed to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel From Wikipedia: "In statistics, a parametric model or parametric family or finite-dimensional model is a family of distributions that can be described using a finite number of parameters."
Least squares for instance makes the assumption that the error is normally distributed. The parameters of this distribution is then found by maximum likelihood. At least this is one way to derive the least squares method. On the other hand, Theil-Sen makes no assumption about the underlying distribution, the method does not try to find the parameters of some distribution and is therefore considered non-parametric although some parameters are of course determined but not of a distribution.
I think that is just about how one defines parametric and non-parametric. Can I keep it that way since this is consistent with the definition in statistics?