[MRG+1] TheilSen robust linear regression #2949

FlorianWilhelm · 2014-03-08T20:15:37Z

A multiple linear Theil-Sen regression for the Scikit-Learn toolbox. The implementation is based on the algorithm from the paper "Theil-Sen Estimators in a Multiple Linear Regression Model" of Xin Dang, Hanxiang Peng, Xueqin Wang and Heping Zhang. It is parallelized with the help of joblib.
On a personal note, I think that the popular Theil-Sen regression would be a nice addition to Scikit-Learn.
I am looking forward to your feedback.

Florian

jnothman · 2014-03-08T23:35:46Z

sklearn/linear_model/theilsen.py

+    res = np.zeros(y.shape)
+    T_nom = np.zeros(y.shape)
+    T_denom = 0.
+    for x in X.T:


I think it should be possible to vectorize this loop (i.e. not use a Python for loop), with something like:

diff = X.T - y normdiff = norm(diff, axis=1) mask = normdiff >= 1e-6 if mask.sum() < X.shape[1]: eta = 1. diff = diff[mask, :] normdiff = normdiff[mask] res = np.sum(diff / normdiff, axis=0) T_denom = np.sum(1 / normdiff, axis=0) T_nom = np.sum(X / normdiff, axis=1)

(warning: code untested)

Thanks for pointing this out, I fixed it in commit 3e8ca8c.

agramfort · 2014-03-09T07:28:09Z

can this method work with n_samples < n_features?
how does it compare to ransac which we already have?

thanks

Usage of fmin_bfgs instead of minimize.

- _spatial_median, _breakdown_point and _modweiszfeld_step are now private as suggested by agramfort. - moved _lse before TheilSen class.

Replaced usage of numpy.linalg by scipy.linalg as suggested by agramfort.

Theil-Sen will fall back to Least Squares like if number of samples is smaller than the number of features.

Fix of "zero length field name in format" ValueError due to Python 2.6 in TheilSen class.

Replaced the for-loop in linear_model.theilsen._modweiszfeld_step by array operations as suggested by jnothman

Instead of testing the various codepaths with n_jobs of joblib, now the functions _get_n_jobs and _split_indices of the theilsen class are tested as pointed out by agramfort.

FlorianWilhelm · 2014-03-09T13:14:11Z

For the n_samples < n_features case it did some modifications in commit 61a5195 in order to fall back to least squares in this case as this Theil-Sen implementation can be seen as a generalization of the ordinary least squares method anyways.
So far I have not found a direct comparison between RANSAC and Theil-Sen but I'm not that familiar with RANSAC that's why I should have deeper look into this subject. So far, my impression is that RANSAC is a more heuristic approach coming from computer science while Theil-Sen comes from robust statistics.

chrisjordansquire · 2014-03-09T17:03:10Z

Not to derail the conversation, but is this kind of estimator more appropriate for statsmodels? (My thinking is scikit-learn is focused on predicting outputs while statsmodels is focused on inferences about parameters.)

FlorianWilhelm · 2014-03-09T17:36:15Z

@chrisjordansquire In order to predict anything you always have to train the parameters of your model and quite often your training samples are less than perfect (i.e. having outliers or if the error is not normally distributed). This is where non-parametric methods like Theil-Sen shine since they do not assume the errors to be normally distributed.
The new RANSAC estimator also seems to fit into the class of robust estimators. From my understanding it takes an estimator like LinearRegression and presents a proper subset to it in order to remove the negative effects of outliers. In this way you could also say that all RANSAC does is to help LinearRegression to infer the right parameters.

coveralls · 2014-03-10T17:44:03Z

Coverage remained the same when pulling b374e7e on FlorianWilhelm:theilsen into 64fd085 on scikit-learn:master.

GaelVaroquaux · 2014-03-10T18:47:41Z

Not to derail the conversation, but is this kind of estimator more appropriate
for statsmodels?

I am also hesitating on whether this estimator should go in scikit-learn,
or whether we should push users to statsmodels. The reason that we have a
RANSAC is that it is widely used in computer vision, where people really
only about it's ability to fit and predict, and not corresponding
p-values.

I am +0 on including it in scikit-learn: robust models are sometimes
really useful for prediction. However, I would indeed like to hear a
little discussion about the usecases in mind.

FlorianWilhelm · 2014-03-11T10:34:15Z

@GaelVaroquaux: I understand the hesitation to include yet another estimator especially since RANSAC and Theil-Sen seem to open the gates for a new class of robust estimators and we want scikit-learn to be focused on its goals: An easy-to-use machine learning library with a consistent interface.
But for exactly that reason we are heavily using scikit-learn at Blue Yonder (http://www.blue-yonder.com/en/) to provide predictive analysis products to our customers. At its core everything we do is about making good predictions with the help of machine learning. For our own algorithms we have adapted the scikit-learn interface in order to extend it and that works really well for us.
In some projects we had to deal with suboptimal data (hard to determine outliers) from customers where we applied Theil-Sen successfully. Since Theil-Sen is quite a popular and well-known algorithm my idea was to include it directly into scikit-learn where others can use it too and I got the okay of the management to do so. I am convinced that Theil-Sen (although coming from statistics) is a robust machine learning algorithm that nicely complements Scikit-Learn whenever the data you get is suboptimal in various kinds.

sebastianneubauer · 2014-03-15T15:35:34Z

I'm a bit surprised because I never noticed that there is a clean separation between statsmodel and scikit-learn. Clearly this issue here is the wrong place, but I would pretty much appreciate a discussion about this. At least for me, I wouldn't be surprised at all when I would find a Theil-sen implementation in scikit-learn (and another one in scipy and one in statsmodel ;-) )
On the contrary, I heavily use scikit-learn and would be happy to be able to try a Theil-sen (e.g for preprocessing/regularisation) without introducing another dependency (I pretty much never need to introduce statsmodel as a dependency) and with the same interface as the other algorithms.
Last point from me, if someone spends time and wants to contribute to scikit learn, shouldn't the prior be +1 as long as there are no strong arguments against the contribution?
Therefore +1 from me as long as there is no official "that's not scikit-learn" guideline/rule (and that would be great!)

agramfort · 2014-03-15T18:34:27Z

to make your point I would compare performance and speed with the RANSAC we already have. Often computer scientists hacks make things pretty efficient in practice... statisticians care much less about performance ... in general ... :)

coveralls · 2014-10-17T07:37:17Z

Coverage increased (+0.06%) when pulling f92074a on FlorianWilhelm:theilsen into 031a3fc on scikit-learn:master.

FlorianWilhelm · 2014-10-17T07:40:11Z

@arjoly Regarding the new examples, I used the Unix time command and looked at the user time since the script finishes only when I close the plotting windows.

plot_theilsen: 1.865s
plot_robust_fit: 6.152s

arjoly · 2014-10-17T11:30:20Z

examples/linear_model/plot_theilsen.py

+print(__doc__)
+
+estimators = [('OLS', LinearRegression()),
+              ('Theil-Sen', TheilSenRegressor()),


the random state is missing here

@arjoly sorry, that I forgot that one, my bad :-/

arjoly · 2014-10-17T12:58:39Z

I won't be able to find much time for this before next week. @ogrisel and @GaelVaroquaux feel free to finish the review and merged before if you want.

FlorianWilhelm · 2014-10-27T08:28:16Z

@arjoly @GaelVaroquaux What are the next steps? Do I need to ping someone like @larsmans in order to get this merged? The related PR #3764 has also [MRG+1].

larsmans · 2014-10-27T09:44:53Z

I know nothing about this estimator, so no.

MechCoder · 2014-11-20T13:27:59Z

Is this ready to go? @ogrisel and @GaelVaroquaux I see +1's from you, and another implicit +1 by @arjoly !

ogrisel · 2014-11-20T14:33:58Z

Both I and @GaelVaroquaux already gave +1 and the last batch of comments by @arjoly have been seemingly addressed, let's merge.

[MRG+1] TheilSen robust linear regression

ogrisel · 2014-11-20T14:34:44Z

Thanks again @FlorianWilhelm for the contribution (especially the effort in the doc and examples)!

GaelVaroquaux · 2014-11-20T14:51:28Z

Thanks heaps. This is a quality contribution!

arjoly · 2014-11-20T15:13:07Z

Congratulation @FlorianWilhelm 👍

FlorianWilhelm · 2014-11-20T16:22:49Z

Thank you all for helping me making my first Scikit-Learn contribution! It was a lot of work but I had tons of fun :-)

ogrisel · 2014-11-20T17:52:46Z

@FlorianWilhelm I forgot: can you please open a new PR to add an entry to the doc/whats_new.rst file? Please link your name either to your github account or a personal webpage of your choice at the end of the file.

FlorianWilhelm · 2014-11-21T07:56:44Z

@ogrisel I updated whats_new.rst in PR #3870.

amueller · 2015-01-08T02:26:29Z

Stupid question: is there a simple way to make this run fast? It is quite slow in the common tests, even on trivial datasets. I guess setting n_subsamples is the trick, but is only possible if you know the number of samples, right?

FlorianWilhelm · 2015-01-08T11:57:19Z

@amueller You could also use max_subpopulation to reduce the maximum number of samples that are considered. If you want to reduce the runtime with n_subsamples you need to know the number of samples.

jnothman · 2015-01-08T12:55:20Z

Would it be appropriate to allow n_subsamples as a float to be a proportion of the number of samples?

FlorianWilhelm · 2015-01-10T15:37:42Z

No, the complexity is $\binom{n_{samples}}{n_{subsamples}}$. If you consider Pascal's triangle, the closer you get to the center of a row the larger the number is. So consider you have n_features and you fit with intercept a problem with n_samples, the efficiency starts to become better only when n_subsamples is larger or equal than n_samples - n_features.
What we could do is to specify it as a ratio of the number of features. If you choose ratio=1.0 (the default) you have the complexity $\binom{n_{samples}}{n_{subsamples}}$. Otherwise we change n_subsample in order to get the complexity $\binom{n_{samples}}{ (n_{samples} - ratio * n_{features}) }$.

FlorianWilhelm added 9 commits January 14, 2014 21:57

Added multiple linear Theil-Sen regression

d4519d2

Added an example and documentatin for Theil-Sen

b3e8a64

Added subpopulation parameter to Theil-Sen

7dea60d

Added parallelization support to TheilSen Estimator

8d48d45

Improved parallelization for Theilsen estimator

86a7461

Merge branch 'master' into theilsen

601b82b

Cleanups and corrections for Theil-Sen regression.

c19d967

Removed subpopulation=None option in TheilSen

8653c20

xrange fix for Python3 in TheilSen

960c45b

jnothman reviewed Mar 8, 2014
View reviewed changes

FlorianWilhelm added 7 commits March 9, 2014 09:44

FIX Theil-Sen unittest for older Scipy Version

0eda5bf

Usage of fmin_bfgs instead of minimize.

FIX that some functions in Theil-Sen were public

d133712

- _spatial_median, _breakdown_point and _modweiszfeld_step are now private as suggested by agramfort. - moved _lse before TheilSen class.

FIX usage of linalg from Scipy in Theil-Sen

d1d221b

Replaced usage of numpy.linalg by scipy.linalg as suggested by agramfort.

FIX: Let Theil-Sen handle n_samples < n_features case

61a5195

Theil-Sen will fall back to Least Squares like if number of samples is smaller than the number of features.

FIX: Python 2.6 format syntax in Theil-Sen

13d9212

Fix of "zero length field name in format" ValueError due to Python 2.6 in TheilSen class.

Vectorization of theilsen._modweiszfeld_step

3e8ca8c

Replaced the for-loop in linear_model.theilsen._modweiszfeld_step by array operations as suggested by jnothman

FIX: Parallel unittests for Theil-Sen estimator

1d75896

Instead of testing the various codepaths with n_jobs of joblib, now the functions _get_n_jobs and _split_indices of the theilsen class are tested as pointed out by agramfort.

FIX: TheilSen supports old Numpy versions

b374e7e

Florian Wilhelm added 5 commits October 16, 2014 16:16

COSMIT: pep8 and renaming

4fcd6b8

COSMIT: replaced assert by assert_less/greater etc.

b8cd60b

TEST: No console output during unit tests

d5ec39b

ENH: Always set random_state in unit tests

f9ecbf7

ENH: Speedup of unit tests

f92074a

COSMIT: Better consistency

c54c17e

arjoly reviewed Oct 17, 2014
View reviewed changes

ENH: Added random_state in plot_theilsen.py

2137d82

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

ogrisel added a commit that referenced this pull request Nov 20, 2014

Merge pull request #2949 from FlorianWilhelm/theilsen

f0fe4af

[MRG+1] TheilSen robust linear regression

ogrisel merged commit f0fe4af into scikit-learn:master Nov 20, 2014

FlorianWilhelm mentioned this pull request Nov 21, 2014

DOC: Added whats_new for TheilSenRegressor #3870

Merged

FlorianWilhelm deleted the theilsen branch November 21, 2014 12:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] TheilSen robust linear regression #2949

[MRG+1] TheilSen robust linear regression #2949

FlorianWilhelm commented Mar 8, 2014

jnothman Mar 8, 2014

FlorianWilhelm Mar 9, 2014

agramfort commented Mar 9, 2014

FlorianWilhelm commented Mar 9, 2014

chrisjordansquire commented Mar 9, 2014

FlorianWilhelm commented Mar 9, 2014

coveralls commented Mar 10, 2014

GaelVaroquaux commented Mar 10, 2014

FlorianWilhelm commented Mar 11, 2014

sebastianneubauer commented Mar 15, 2014

agramfort commented Mar 15, 2014

coveralls commented Oct 17, 2014

FlorianWilhelm commented Oct 17, 2014

arjoly Oct 17, 2014

FlorianWilhelm Oct 17, 2014

arjoly Oct 17, 2014

arjoly commented Oct 17, 2014

FlorianWilhelm commented Oct 27, 2014

larsmans commented Oct 27, 2014

MechCoder commented Nov 20, 2014

ogrisel commented Nov 20, 2014

ogrisel commented Nov 20, 2014

GaelVaroquaux commented Nov 20, 2014

arjoly commented Nov 20, 2014

FlorianWilhelm commented Nov 20, 2014

ogrisel commented Nov 20, 2014

FlorianWilhelm commented Nov 21, 2014

amueller commented Jan 8, 2015

FlorianWilhelm commented Jan 8, 2015

jnothman commented Jan 8, 2015

FlorianWilhelm commented Jan 10, 2015

[MRG+1] TheilSen robust linear regression #2949

[MRG+1] TheilSen robust linear regression #2949

Conversation

FlorianWilhelm commented Mar 8, 2014

jnothman Mar 8, 2014

Choose a reason for hiding this comment

FlorianWilhelm Mar 9, 2014

Choose a reason for hiding this comment

agramfort commented Mar 9, 2014

FlorianWilhelm commented Mar 9, 2014

chrisjordansquire commented Mar 9, 2014

FlorianWilhelm commented Mar 9, 2014

coveralls commented Mar 10, 2014

GaelVaroquaux commented Mar 10, 2014

FlorianWilhelm commented Mar 11, 2014

sebastianneubauer commented Mar 15, 2014

agramfort commented Mar 15, 2014

coveralls commented Oct 17, 2014

FlorianWilhelm commented Oct 17, 2014

arjoly Oct 17, 2014

Choose a reason for hiding this comment

FlorianWilhelm Oct 17, 2014

Choose a reason for hiding this comment

arjoly Oct 17, 2014

Choose a reason for hiding this comment

arjoly commented Oct 17, 2014

FlorianWilhelm commented Oct 27, 2014

larsmans commented Oct 27, 2014

MechCoder commented Nov 20, 2014

ogrisel commented Nov 20, 2014

ogrisel commented Nov 20, 2014

GaelVaroquaux commented Nov 20, 2014

arjoly commented Nov 20, 2014

FlorianWilhelm commented Nov 20, 2014

ogrisel commented Nov 20, 2014

FlorianWilhelm commented Nov 21, 2014

amueller commented Jan 8, 2015

FlorianWilhelm commented Jan 8, 2015

jnothman commented Jan 8, 2015

FlorianWilhelm commented Jan 10, 2015