Skip to content

[MRG+1] TheilSen robust linear regression #2949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 99 commits into from
Nov 20, 2014
Merged

[MRG+1] TheilSen robust linear regression #2949

merged 99 commits into from
Nov 20, 2014

Conversation

FlorianWilhelm
Copy link
Contributor

A multiple linear Theil-Sen regression for the Scikit-Learn toolbox. The implementation is based on the algorithm from the paper "Theil-Sen Estimators in a Multiple Linear Regression Model" of Xin Dang, Hanxiang Peng, Xueqin Wang and Heping Zhang. It is parallelized with the help of joblib.
On a personal note, I think that the popular Theil-Sen regression would be a nice addition to Scikit-Learn.
I am looking forward to your feedback.

Florian

res = np.zeros(y.shape)
T_nom = np.zeros(y.shape)
T_denom = 0.
for x in X.T:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be possible to vectorize this loop (i.e. not use a Python for loop), with something like:

diff = X.T - y
normdiff = norm(diff, axis=1)
mask = normdiff >= 1e-6
if mask.sum() < X.shape[1]:
    eta = 1.
diff = diff[mask, :]
normdiff = normdiff[mask]
res = np.sum(diff / normdiff, axis=0)
T_denom = np.sum(1 / normdiff, axis=0)
T_nom = np.sum(X / normdiff, axis=1)

(warning: code untested)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out, I fixed it in commit 3e8ca8c.

@agramfort
Copy link
Member

can this method work with n_samples < n_features?
how does it compare to ransac which we already have?

thanks

Usage of fmin_bfgs instead of minimize.
- _spatial_median, _breakdown_point and _modweiszfeld_step are now
private as suggested by agramfort.
- moved _lse before TheilSen class.
Replaced usage of numpy.linalg by scipy.linalg as suggested by
agramfort.
Theil-Sen will fall back to Least Squares like if number of samples is
smaller than the number of features.
Fix of "zero length field name in format" ValueError due to Python 2.6
in TheilSen class.
Replaced the for-loop in linear_model.theilsen._modweiszfeld_step by
array operations as suggested by jnothman
Instead of testing the various codepaths with n_jobs of joblib, now the
functions _get_n_jobs and _split_indices of the theilsen class are
tested as pointed out by agramfort.
@FlorianWilhelm
Copy link
Contributor Author

For the n_samples < n_features case it did some modifications in commit 61a5195 in order to fall back to least squares in this case as this Theil-Sen implementation can be seen as a generalization of the ordinary least squares method anyways.
So far I have not found a direct comparison between RANSAC and Theil-Sen but I'm not that familiar with RANSAC that's why I should have deeper look into this subject. So far, my impression is that RANSAC is a more heuristic approach coming from computer science while Theil-Sen comes from robust statistics.

@chrisjordansquire
Copy link

Not to derail the conversation, but is this kind of estimator more appropriate for statsmodels? (My thinking is scikit-learn is focused on predicting outputs while statsmodels is focused on inferences about parameters.)

@FlorianWilhelm
Copy link
Contributor Author

@chrisjordansquire In order to predict anything you always have to train the parameters of your model and quite often your training samples are less than perfect (i.e. having outliers or if the error is not normally distributed). This is where non-parametric methods like Theil-Sen shine since they do not assume the errors to be normally distributed.
The new RANSAC estimator also seems to fit into the class of robust estimators. From my understanding it takes an estimator like LinearRegression and presents a proper subset to it in order to remove the negative effects of outliers. In this way you could also say that all RANSAC does is to help LinearRegression to infer the right parameters.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling b374e7e on FlorianWilhelm:theilsen into 64fd085 on scikit-learn:master.

@GaelVaroquaux
Copy link
Member

Not to derail the conversation, but is this kind of estimator more appropriate
for statsmodels?

I am also hesitating on whether this estimator should go in scikit-learn,
or whether we should push users to statsmodels. The reason that we have a
RANSAC is that it is widely used in computer vision, where people really
only about it's ability to fit and predict, and not corresponding
p-values.

I am +0 on including it in scikit-learn: robust models are sometimes
really useful for prediction. However, I would indeed like to hear a
little discussion about the usecases in mind.

@FlorianWilhelm
Copy link
Contributor Author

@GaelVaroquaux: I understand the hesitation to include yet another estimator especially since RANSAC and Theil-Sen seem to open the gates for a new class of robust estimators and we want scikit-learn to be focused on its goals: An easy-to-use machine learning library with a consistent interface.
But for exactly that reason we are heavily using scikit-learn at Blue Yonder (http://www.blue-yonder.com/en/) to provide predictive analysis products to our customers. At its core everything we do is about making good predictions with the help of machine learning. For our own algorithms we have adapted the scikit-learn interface in order to extend it and that works really well for us.
In some projects we had to deal with suboptimal data (hard to determine outliers) from customers where we applied Theil-Sen successfully. Since Theil-Sen is quite a popular and well-known algorithm my idea was to include it directly into scikit-learn where others can use it too and I got the okay of the management to do so. I am convinced that Theil-Sen (although coming from statistics) is a robust machine learning algorithm that nicely complements Scikit-Learn whenever the data you get is suboptimal in various kinds.

@sebastianneubauer
Copy link

I'm a bit surprised because I never noticed that there is a clean separation between statsmodel and scikit-learn. Clearly this issue here is the wrong place, but I would pretty much appreciate a discussion about this. At least for me, I wouldn't be surprised at all when I would find a Theil-sen implementation in scikit-learn (and another one in scipy and one in statsmodel ;-) )
On the contrary, I heavily use scikit-learn and would be happy to be able to try a Theil-sen (e.g for preprocessing/regularisation) without introducing another dependency (I pretty much never need to introduce statsmodel as a dependency) and with the same interface as the other algorithms.
Last point from me, if someone spends time and wants to contribute to scikit learn, shouldn't the prior be +1 as long as there are no strong arguments against the contribution?
Therefore +1 from me as long as there is no official "that's not scikit-learn" guideline/rule (and that would be great!)

@agramfort
Copy link
Member

to make your point I would compare performance and speed with the RANSAC we already have. Often computer scientists hacks make things pretty efficient in practice... statisticians care much less about performance ... in general ... :)

@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) when pulling f92074a on FlorianWilhelm:theilsen into 031a3fc on scikit-learn:master.

@FlorianWilhelm
Copy link
Contributor Author

@arjoly Regarding the new examples, I used the Unix time command and looked at the user time since the script finishes only when I close the plotting windows.

  • plot_theilsen: 1.865s
  • plot_robust_fit: 6.152s

print(__doc__)

estimators = [('OLS', LinearRegression()),
('Theil-Sen', TheilSenRegressor()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the random state is missing here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arjoly sorry, that I forgot that one, my bad :-/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@arjoly
Copy link
Member

arjoly commented Oct 17, 2014

I won't be able to find much time for this before next week. @ogrisel and @GaelVaroquaux feel free to finish the review and merged before if you want.

@FlorianWilhelm
Copy link
Contributor Author

@arjoly @GaelVaroquaux What are the next steps? Do I need to ping someone like @larsmans in order to get this merged? The related PR #3764 has also [MRG+1].

@larsmans
Copy link
Member

I know nothing about this estimator, so no.

@MechCoder
Copy link
Member

Is this ready to go? @ogrisel and @GaelVaroquaux I see +1's from you, and another implicit +1 by @arjoly !

@ogrisel
Copy link
Member

ogrisel commented Nov 20, 2014

Both I and @GaelVaroquaux already gave +1 and the last batch of comments by @arjoly have been seemingly addressed, let's merge.

ogrisel added a commit that referenced this pull request Nov 20, 2014
 [MRG+1] TheilSen robust linear regression
@ogrisel ogrisel merged commit f0fe4af into scikit-learn:master Nov 20, 2014
@ogrisel
Copy link
Member

ogrisel commented Nov 20, 2014

Thanks again @FlorianWilhelm for the contribution (especially the effort in the doc and examples)!

@GaelVaroquaux
Copy link
Member

Thanks heaps. This is a quality contribution!

@arjoly
Copy link
Member

arjoly commented Nov 20, 2014

Congratulation @FlorianWilhelm 👍

@FlorianWilhelm
Copy link
Contributor Author

Thank you all for helping me making my first Scikit-Learn contribution! It was a lot of work but I had tons of fun :-)

@ogrisel
Copy link
Member

ogrisel commented Nov 20, 2014

@FlorianWilhelm I forgot: can you please open a new PR to add an entry to the doc/whats_new.rst file? Please link your name either to your github account or a personal webpage of your choice at the end of the file.

@FlorianWilhelm
Copy link
Contributor Author

@ogrisel I updated whats_new.rst in PR #3870.

@FlorianWilhelm FlorianWilhelm deleted the theilsen branch November 21, 2014 12:33
@amueller
Copy link
Member

amueller commented Jan 8, 2015

Stupid question: is there a simple way to make this run fast? It is quite slow in the common tests, even on trivial datasets. I guess setting n_subsamples is the trick, but is only possible if you know the number of samples, right?

@FlorianWilhelm
Copy link
Contributor Author

@amueller You could also use max_subpopulation to reduce the maximum number of samples that are considered. If you want to reduce the runtime with n_subsamples you need to know the number of samples.

@jnothman
Copy link
Member

jnothman commented Jan 8, 2015

Would it be appropriate to allow n_subsamples as a float to be a proportion of the number of samples?

@FlorianWilhelm
Copy link
Contributor Author

No, the complexity is $\binom{n_{samples}}{n_{subsamples}}$. If you consider Pascal's triangle, the closer you get to the center of a row the larger the number is. So consider you have n_features and you fit with intercept a problem with n_samples, the efficiency starts to become better only when n_subsamples is larger or equal than n_samples - n_features.
What we could do is to specify it as a ratio of the number of features. If you choose ratio=1.0 (the default) you have the complexity $\binom{n_{samples}}{n_{subsamples}}$. Otherwise we change n_subsample in order to get the complexity $\binom{n_{samples}}{ (n_{samples} - ratio * n_{features}) }$.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.