-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Add quantile regression #9978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks. You'll need a lot of patience for core devs to review this in detail, I suspect. We have a number of pressing API issues as well as highly requested features that have been in the reviewing queue for some time. But I hope we'll get here soon enough! |
@avidale Thanks a lot for your contribution! @jnothman has already outlined the broader picture but FWIW a few pointers from a wanderer for when the core devs become available. First, I would suggest that you look into resolving the CI test failures. After that, you might want to consider adding to the PR to a point where you feel comfortable changing the status of this PR to [MRG] from [WIP]. (Of course, this is of no use if you actually need some comments/ideas before you can start working any further). In my limited experience, [WIP]s are usually not prioritized for review (but don't quote me on this ;)) so you might want to consider the change. Finally when you get to that point, you might want to tag some of the core devs that participated in the original discussion since some of them might have missed the initial post. |
This is a great add to scikit-learn. Would personally really like to see it merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- You have test failure to deal with.
- Some parameters have not been tested, such as l1_ratio.
- Please add the class to doc/modules/classes.rst
- We no longer use assert_true, assert_greater. Just use bare assert.
Under Travis CI there is another failure: the
Current solution: just changed the solver from |
I have had a go with a real training set as follows: X = pandas dataframe ([61296 rows x 2846 columns]) (There are 2846 variables in the model and I have 61296 measurements of the 2846 variables.) I run into the following problem in _quantile.py This occurs when np.eye(n_mask) tries to create a 61296 x 61296 identity matrix. Is a more memory efficient implementation possible? File "", line 1, in File "\sklearn\linear_model_quantile.py", line 217, in fit File "\site-packages\numpy\lib\twodim_base.py", line 209, in eye MemoryError: Unable to allocate 28.0 GiB for an array with shape (61296, 61296) and data type float64 |
@RPyElec do you know a solver that would handle problems of this size? if so what optimization method is used? |
The book reference could be nice I think. |
I'm not an expert on solvers I'm afraid. I have been running the quantile regression problem above with the implementation here (https://pypi.org/project/asgl/). I have an academic license for MOSEK/Gurobi as the LP solver but have also had a go with the free solvers that it comes with - which also work). (Scikit-learn is better maintained so is preferable). |
Is there some incremental/online solver? |
You manage to make it work with asgl? It seems to use cvxopt/cvxpy which makes me skeptical that it scales? Can you share more details?
|
@RPyElec Cool that you already gave this PR a try and feedback. Once this PR is merged, there might be room for improvements. In particular, if your problem/feature matrix If you urgently need a solution right now, you could try the R package https://cran.r-project.org/package=quantreg. Though I don't know if it will work on your problem. |
@lorentzenchr - understood! Looking forward to seeing this get pushed too. I have a more powerful machine I can run the current implementation on (and ASGL too) - so no worries there. I was just flagging the large identity matrices being created as a potential issue. @agramfort - I'm trying to put something together that I can use as an example and will get back to you. |
QR examples.txt If you run with MOSEK, then in asgl.py, you will need to add the MOSEK section to _cvxpy_solver_options:
(It might be necessary to uncomment the num_threads line depending on your setup). I would also recommend adding a print statement in def lasso at the three locations below (again in asgl.py):
|
@RPyElec @glemaitre I think that for quantile regression it is very easy to implement a naive gradient descent based solver that is memory-efficient and is relatively fast on large datasets. The quantile loss is Here is a Colab notebook with my proof-of-concept implementation that trains on a 60,000 x 3,000 dataset in 20 seconds. If you think that this direction is correct, maybe we include a solver like this into a future version of QuantileRegressor? |
@RPyElec @glemaitre I think that for quantile regression it is very easy to implement a naive gradient descent based solver
The pinball loss is non-differentiable. It's not at all clear to me that a gradient descent solver will converge (I'd rather say that they are good theoretical arguments, and empirical evidence on non-smooth problems in learning that standard gradient methods will not converge to a good accuracy).
|
@RPyElec it would make our life easier if you share code snippets eg in https://gist.github.com/ so we have very control if needed and if you could share a branch from your fork of asgl project so we don't have to apply the patch manually. @avidale as @GaelVaroquaux said above it's a non-smooth problem. What you do is a "sub-gradient" descent will can be quite slow (known theoretical rates). You could add your solver in https://github.com/benchopt/benchmark_quantile_regression to actually compare the solvers objectively. WDYT? |
This seems a good idea. Now, my question is: do we want to merge the current version with solvers that work reasonably well for in-memory problems, or do we want as well benchmark solver optimum for online learning on the offline problems? In short, do we merge now and improve the estimator or by doing so, do are putting ourselve into trouble? |
+1 for merging now (or else I go crazy - to share my feelings, too:smirk:). |
OK so LGTM. I will open a subsequent issue to address the point raised in the discussion. |
🍻 🎉 |
Thanks to all contributors |
Hi! The |
Open an issue when you make a feature request. A closed or merged issue/pr will not raise any attention. On the topic, it might be better to do a pipeline with a Nystroem transformer followed by a Quantile Regressor. It will scale better. |
Well, it already got more than zero engagement 😅 ! I was not aware of this, so should I make an issue for the PR as well? Also, thanks for the tip with the transformer, I'll check it out! On the other hand, maybe L2 regularization would be more desirable? The QuantileRegressor only does L1, the support vector regression does L2. |
This PR fixes issue #3148
This new feature implements quantile regression - an algorithm that directly minimizes mean absolute error of a linear regression model.
The work is still in progress, but I do want to receive some feedback.