-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] TheilSen robust linear regression #2949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
res = np.zeros(y.shape) | ||
T_nom = np.zeros(y.shape) | ||
T_denom = 0. | ||
for x in X.T: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be possible to vectorize this loop (i.e. not use a Python for
loop), with something like:
diff = X.T - y
normdiff = norm(diff, axis=1)
mask = normdiff >= 1e-6
if mask.sum() < X.shape[1]:
eta = 1.
diff = diff[mask, :]
normdiff = normdiff[mask]
res = np.sum(diff / normdiff, axis=0)
T_denom = np.sum(1 / normdiff, axis=0)
T_nom = np.sum(X / normdiff, axis=1)
(warning: code untested)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out, I fixed it in commit 3e8ca8c.
can this method work with n_samples < n_features? thanks |
Usage of fmin_bfgs instead of minimize.
- _spatial_median, _breakdown_point and _modweiszfeld_step are now private as suggested by agramfort. - moved _lse before TheilSen class.
Replaced usage of numpy.linalg by scipy.linalg as suggested by agramfort.
Theil-Sen will fall back to Least Squares like if number of samples is smaller than the number of features.
Fix of "zero length field name in format" ValueError due to Python 2.6 in TheilSen class.
Replaced the for-loop in linear_model.theilsen._modweiszfeld_step by array operations as suggested by jnothman
Instead of testing the various codepaths with n_jobs of joblib, now the functions _get_n_jobs and _split_indices of the theilsen class are tested as pointed out by agramfort.
For the n_samples < n_features case it did some modifications in commit 61a5195 in order to fall back to least squares in this case as this Theil-Sen implementation can be seen as a generalization of the ordinary least squares method anyways. |
Not to derail the conversation, but is this kind of estimator more appropriate for statsmodels? (My thinking is scikit-learn is focused on predicting outputs while statsmodels is focused on inferences about parameters.) |
@chrisjordansquire In order to predict anything you always have to train the parameters of your model and quite often your training samples are less than perfect (i.e. having outliers or if the error is not normally distributed). This is where non-parametric methods like Theil-Sen shine since they do not assume the errors to be normally distributed. |
I am also hesitating on whether this estimator should go in scikit-learn, I am +0 on including it in scikit-learn: robust models are sometimes |
@GaelVaroquaux: I understand the hesitation to include yet another estimator especially since RANSAC and Theil-Sen seem to open the gates for a new class of robust estimators and we want scikit-learn to be focused on its goals: An easy-to-use machine learning library with a consistent interface. |
I'm a bit surprised because I never noticed that there is a clean separation between statsmodel and scikit-learn. Clearly this issue here is the wrong place, but I would pretty much appreciate a discussion about this. At least for me, I wouldn't be surprised at all when I would find a Theil-sen implementation in scikit-learn (and another one in scipy and one in statsmodel ;-) ) |
to make your point I would compare performance and speed with the RANSAC we already have. Often computer scientists hacks make things pretty efficient in practice... statisticians care much less about performance ... in general ... :) |
@arjoly Regarding the new examples, I used the Unix
|
print(__doc__) | ||
|
||
estimators = [('OLS', LinearRegression()), | ||
('Theil-Sen', TheilSenRegressor()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the random state is missing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arjoly sorry, that I forgot that one, my bad :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
I won't be able to find much time for this before next week. @ogrisel and @GaelVaroquaux feel free to finish the review and merged before if you want. |
@arjoly @GaelVaroquaux What are the next steps? Do I need to ping someone like @larsmans in order to get this merged? The related PR #3764 has also [MRG+1]. |
I know nothing about this estimator, so no. |
Is this ready to go? @ogrisel and @GaelVaroquaux I see +1's from you, and another implicit +1 by @arjoly ! |
Both I and @GaelVaroquaux already gave +1 and the last batch of comments by @arjoly have been seemingly addressed, let's merge. |
[MRG+1] TheilSen robust linear regression
Thanks again @FlorianWilhelm for the contribution (especially the effort in the doc and examples)! |
Thanks heaps. This is a quality contribution! |
Congratulation @FlorianWilhelm 👍 |
Thank you all for helping me making my first Scikit-Learn contribution! It was a lot of work but I had tons of fun :-) |
@FlorianWilhelm I forgot: can you please open a new PR to add an entry to the |
Stupid question: is there a simple way to make this run fast? It is quite slow in the common tests, even on trivial datasets. I guess setting |
@amueller You could also use |
Would it be appropriate to allow |
No, the complexity is |
A multiple linear Theil-Sen regression for the Scikit-Learn toolbox. The implementation is based on the algorithm from the paper "Theil-Sen Estimators in a Multiple Linear Regression Model" of Xin Dang, Hanxiang Peng, Xueqin Wang and Heping Zhang. It is parallelized with the help of joblib.
On a personal note, I think that the popular Theil-Sen regression would be a nice addition to Scikit-Learn.
I am looking forward to your feedback.
Florian