-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Draft implementation of non-parametric quantile methods (RF, Extra Trees and Nearest Neighbors) #19754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks quite nice after a glance at qrf.
Concerns:
- We ordinarily avoid additional parameters to methods like predict. The exception has been return_std etc, and maybe this could fall under the same allowance.
- Should we support predicting at multiple q simultaneously? For an array like of q, do we return a matrix/tensor of results, or a list of vector/matrix?
- I don't know what current thoughts are on numba, but I think it would be a big step to introduce a hard dependency on it. Certainly, this requires a core developer vote according to our governance document. It might even deserve a vote if we make it a soft dependency. The outcome could be rewriting those functions in cython
- I'm somewhat concerned about maintaining two random forest implementations in parallel. I assume you've attempted to share as much code as possible, but we should also share tests where possible.
Maybe you can attend our core devs meeting on Monday to raise some of these. (Advertised on our mailing list. I would be more helpful if I were at my desk!)
Thanks for having a look at the material! The concerns that you raise are indeed issues that would require some more thoughts. For now I just wanted to get an opinion on the API and to see if you would be interested in integrating these methods. If so, it would still require some work to smooth over some of these issues (therefore the draft PR). On the concerns:
I will try to attend the meeting on Monday to discuss this in more detail . |
@jasperroebroek Thank you for this draft PR. Some more options for quantile regression would certainly be beneficial.
|
@lorentzenchr thank you for taking the time to review. So, let me try to answer your questions/remarks. Quantile Regression Forests are indeed fitted with exactly the same approach as normal Random Forest would. From there it does some postprocessing, but the core is the same (with MSE splitting criterion etc.). In his paper, Meinshausen (see the link above) describes why, and under which conditions, this yields credible conditional quantiles. In short: pretty much all the time. After testing this a little, I believe that there are, as you describe, some constraints on the use of this methodology. This being that locally the mean should at least not be strongly negatively correlated to the quantile that you are trying to predict. The issue #18849, comparing QRF and gradient boosting shows some figures comparing the different approaches for fake sinusoidal data (#18849 (comment)). It shows that the methodology works quite well for both the upper and lower quantiles, even without any hyperparameter selection. (PS: I just realised you were the one commenting on the post, so ....) In contrast, Random Forest could be trained with a quantile loss function, yielding probably more accurate conditional quantiles. QRFs are likely faster and more flexible. I think it would be great to have them both in sklearn (for the user to choose between these tradeoffs). |
After our discussion on Monday I have tried to take the first steps in the cythonisation of the code. Although the code that I just pushed runs and produces satisfactory results, it is terribly slow in comparison to the previous version (which used numba for compilation). I believe it is now even slower than the original python/numpy implementation. What I have:
What is still missing:
I have put a lot of comments and thoughts in the code, so hopefully it is still intelligible. In short; I would really need some help from someone with more Cython experience to make this work. Would someone have time and energy to go through this? |
Another update on the quantile regression forests. I have converted the code completely to Cython and changed the API as discussed (by not including the quantiles in the predict function, but by setting it before). It currently runs smoothly and does what it is supposed to do, including predicting an array-like of quantiles simultaneously. If you agree on the API and implementation I would be ready to switch to a PR (instead of a draf one). |
Another update on QRF. I added the mean pinball loss as the score function of the Regression Forest Quantile Regressor. I have also uploaded an example of the QRF in action. Note that this is almost a direct copy from the new example of Gradient Boosting Quantile Regression. The workflow, results and conclusions are identical and the final resulting coverage fraction is even better reproduced by the QRF. I have changed the status of this PR from draft to regular. Happy to discuss further on what needs to be added before it can be accepted. |
…overhead from calling GIL for small array creation
…spead and readability
I tried to install the package (in a conda env w/ git clone & python setup.py install), but get Should that work in principle? |
I believe that that is caused by an incorrect compilation of the cython source files. Not sure why that happens. The workaround I have been using is to compile the right files after the installation of sklearn. To do this create a second setup file next to the sklearn setup file and call it 'setup_qrf.py'. Put the compilation code for the QRF extension inside (this is an example, I haven't tested this):
in the terminal compile with Now you should be able to import the RandomForestQuantileRegressor class. Let me know if this works for you! |
@jasperroebroek on windows i am getting the error Buffer dtype mismatch, expected 'long' but got 'long long', seems like there are other packages with similar issues where they had to resort to libc type default instead of numpy's |
still there but now: File "sklearn\ensemble_qrf.pyx", line 547, in sklearn.ensemble._qrf._RandomSampleForestQuantileRegressor.fit |
@jasperroebroek I found the previous issue though, it was idx in your sample method fit. it needs to be changed to idx = np.arange(self.n_samples_,dtype=np.int64)[mask] and it no longer gave me an issue. i just tested with np.arange(self.n_samples_,dtype=np.intp)[mask] and the new version also works and slightly faster it seems. |
Interesting, default is now faster than sample which you say should be faster. Was this fixed? Not sure there is a need for two versions if default is faster now |
Nevermind, seems multiple quantiles slows down the default method compared to samples method on the predict...wonder if there is a way to improve this because the results seem more accurate on default |
@ryancinsight So if I understand it correctly, to solve the issue only line 539 needs to change into Anyway, speaking about performance difference between the two implementations. The default is more precise, but becomes prohibitively slow on datasets between 10000 and 100000 samples. In my benchmark runs it is actually faster than the sampling approach for small datasets (100s/ 1000s of samples). With more than 1 quantile, the sampling method was almost always faster than the default implementation. Do you confirm these observations? Anyway, thanks for checking out the code! |
@jasperroebroek that is correct with line 539 dtype being set everything now runs smoothly on windows 10. I can also confirm your observations on performance. Best regards, great work! |
Hello, I see that this thread has not been updated since July. Is this implementation of QRF planned to be integrated into sklearn in the future ? It would be very useful. Thanks. Nicolas |
For those interested: I have published these methods as a separate package as a temporary place for development: |
Thanks for your doing this, to increase adoption, it would be worth publishing a release on PyPI & conda-forge as well as publishing the rendered HTML doc on gh-pages for instance. Also you should better highlight in the README when those methods are more interesting that quantile gradient boosting for instance: in particular the fact that they are more efficient to for people who want to estimate many quantiles at once for a given input (e.g. all the percentiles, to get some coarse estimate of the CDF of the predictive distribution). I think this would be interesting to get those methods upstream in scikit-learn at some point but I must confess we are running a bit short on reviewer bandwidth to work on such big new refactoring. |
I have now pushed the project to Pypi, so it should be installable with pip. I am still trying to figure out how to get it on conda-forge too. Documentation (although really barebones) is available on: https://sklearn-quantile.readthedocs.io/en/latest/?badge=latest. I hope that this is enough for some initial testing! The code has now significantly diverged from this pull request. It still very much follows the structure of scikit-learn, so it should be relatively straightforward to get it into the main scikit-learn repository of needed. |
First of all, awesome work! I have been looking for a replacement for scikit-garden which has been inactive for a few years and which is falling behind on scikit-learn (compatible with <0.23 only). I have questions regarding the predictions made using the 2 models (Random Forest from scikit-quantile / scikit-garden, versus the Gradient Boosting from scikit-learn / lightGBM). Indeed, these predictions seem to be very different from one another. Here's some code to demonstrate this: import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn_quantile import RandomForestQuantileRegressor
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=0)
quantile = 0.05
# Gradient Boosting Conditional Mean Model
# Models have different hyper-parameters to have similar test and train scores.
param_gb_mean = {"n_estimators": 100, "max_depth": 4, "min_samples_leaf": 60, "max_features": None, "random_state": 1}
gb_mean = GradientBoostingRegressor(**param_gb_mean)
gb_mean.fit(X_train, y_train)
y_mean_gb = gb_mean.predict(X_test)
# Gradient Boosting Quantile Model
param_gb_quantile = {"loss": "quantile", "alpha": quantile, "n_estimators": 100, "max_depth": 4, "min_samples_leaf": 60, "max_features": None, "random_state": 1}
gb_quantile = GradientBoostingRegressor(**param_gb_quantile)
gb_quantile.fit(X_train, y_train)
y_quantile_gb = gb_quantile.predict(X_test)
# Random Forest Conditional Mean Model
param_rf_mean = {"n_estimators": 100, "max_depth":10, "min_samples_leaf": 10, "max_features": 5, "random_state": 1}
rf_mean = RandomForestRegressor(**param_rf_mean)
rf_mean.fit(X_train, y_train)
y_mean_rf = rf_mean.predict(X_test)
# Random Forest Quantile Model
param_gb_quantile = {"q": quantile, "n_estimators": 100, "max_depth": 10, "min_samples_leaf": 10, "max_features": 5, "random_state": 1}
rf_quantile = RandomForestQuantileRegressor(**param_gb_quantile)
rf_quantile.fit(X_train, y_train)
y_quantile_rf = rf_quantile.predict(X_test) Now you will note that while the average predicted conditional mean values ( According to the pinball quantile loss which is the most commonly used quantile regression metrics, the performance of the 2 models is equivalent (slightly better for the GB model in this case, but better for the RF model in the case of my dataset of interest after spending some time optimizing each model). More significantly the shape of the predicted quantiles as a function of the predicted conditional mean are wildly different both for this dataset and mine (banana shape for the RF model). For the Gradient Boosting model: fig = plt.figure(figsize=(12, 8))
sns.scatterplot(x=y_mean_gb, y=y_quantile_gb, alpha=0.1, s=10, linewidth=0)
plt.plot([0, 6], [0, 6], c="black", dashes=[5, 5])
plt.xlim(0, 6)
plt.ylim(0, 6)
plt.xlabel("Conditional Mean")
plt.ylabel("15% Quantile")
sns.despine() For the Random Forest model: fig = plt.figure(figsize=(12, 8))
sns.scatterplot(x=y_mean_rf, y=y_quantile_rf, alpha=0.1, s=10, linewidth=0)
plt.plot([0, 6], [0, 6], c="black", dashes=[5, 5])
plt.xlim(0, 6)
plt.ylim(0, 6)
plt.xlabel("Conditional Mean")
plt.ylabel("15% Quantile")
sns.despine() I was thinking that this is not due to the difference in the type of model itself (Random Forest vs Gradient Boosting: the conditional mean models are very similar) but rather to how the quantile models are built. My understanding is that:
Can we explain why these quantile predictions are so different? Why the Gradient Boosting model routinely predicts average quantiles 10% larger than the Random Forest model? @lorentzenchr, this may also interest you since it is somewhat related to a debate we had a year ago. |
Hi @Bougeant, Thanks for reaching out and testing the package. I am very glad to hear that you have managed to install and run it without any issues! Concerning the difference between Gradient Boosting quantile regression and Random Forest quantile regression. I am not really well versed in how Gradient Boosting works under the hood, except that it uses the pinball loss for training. If your list is accurate, it wouldn't surprise me that the lower quantiles in GB are predicting higher values than RF. Averaging predicted quantiles would always be more conservative (closer to mean) in comparison to calculating the quantile over all data informing the prediction. (That would suggest lower predictions for GB for high quantiles). To test this hypothesis one might create both GB and RF with equal parameters, and only one tree. In this case the only difference between the models should be in the training phase, one going for MSE and the other for the lowest pinball loss. The difference in training on MSE or pinball loss very much depends on the structure of the data. Locally, i.e. where the algorithm tries to split the data, both algorithms are expected to yield similar results as long as the quantile you are looking at, is effectively positively correlated with the mean. In many cases this assumption is approximately correct, but not always. In those cases where it does not, training on the pinball loss would be much more accurate (which might for example be the case at the end of the banana shape). To test the accuracy of both models, you might also have a look at K-Nearest Neighbours quantile regression (also available in sklearn-quantile), which does not use bootstrapping to come to a conclusion. |
Great suggestion to try models with one tree. Let me try that. |
Another update on this project. It just got accepted in conda-forge so it now works with conda! |
@jasperroebroek out of curiosity, do you use honest trees in your implementation (in-tree validation sets for quantile estimation?). It seems that it can help get better calibration for quantile estimators. See for instance: #19710 |
This implementation currently uses the default implementation of random forest and extra trees algorithms. This is according to the original research paper by Meinshausen et al. Undoubtedly, using quantile information at the time of training will give improved accuracy, but this will likely be (much) slower and predictions can't be done with the same model for different quantiles. |
Adding some thoughts, cuz this PR is of interest to me. This PR's maintenance burden can most likely benefit from the PRs proposed to refactor trees:
Then instead of a quantile forest Cython file, we could just create a |
The code could indeed partially benefit from an update to the Splitter. The code that I have sent here, contains two different strategies to calculate conditional quantiles with random forests. The first and most precise one, the one proposed by the original paper, calculates the quantiles after the random forest is trained with standard Criterion and Splitter. It could be beneficial here to store all values per leaf, rather than only the mean, but the quantile calculation would still need to happen at prediction time and would make sense to me to be kept in its own class. The second methods takes advantage of sampling the values at leaf scale, which could be done in the Splitter and would be much more efficient. Criterion would still remain the same I assume. In this method, the quantiles are calculated on the outputs of the different trees. I hope this information helps! |
@jasperroebroek would you mind having a look at the following PR? I am proposing to separate the API for setting leaf/split nodes in the current Tree Cython code. Then I included a very short pseudo-code demo of the potential quantile-tree implementation. Lmk if this looks like the right direction to you? |
I just had a look at the code that you wrote on the quantile tree. I am a bit confused by the predict function that you define there. While this, in and by itself, might be valid, the prediction of the quantile regression forest is not based on the idea of an ensemble of predictions by its individual trees, as is the case in standard regression forests. Rather, it pulls all values of the relevant leaves of all trees (and its weights) and calculates a weighted quantile on them. I don't see any other way than doing this operation in the quantile regression forest class itself. |
Yes this is the case of just reproducing a single quantile-tree. However, with a single quantile-tree that also optionally exposes the leaf values, the ensemble method should be nice and short. Rather, I'm asking in terms of the API, whether or not you think this will fit the scope of what is needed by a quantile-tree (forest)? The main goal is to enable a quantile-tree(forest) that simply needs to rewrite a handful of functions, and the rest of the tree building/etc code lives in the base classes.
Yep you're correct. The eventual goal would be to do this, but at the forest level. |
This definitely sounds like an improvement over how I initially implemented and could provide some nice speed improvements. I also implemented the sampling method used in the R package, which takes a weighted random value at leaf level and calculates the quantile at forest level. Although this is an approximation, it would it so much faster that in most cases this would be the preferred algorithm. When implemented as a Tree class, that would be trivial I suppose. If you need further input or help with implementation, I am available. |
I think having this as a separate package is ideal (which we already have), and I think with the work @adam2392 has been doing for making some parts of the codebase more extendable, we could close this one. Happy to reopen if y'all think otherwise. |
For completeness: Linking the package I've come to develop to shore up some of these gaps (maybe in future some aspects of them will make it into scikit-learn): https://github.com/neurodata/scikit-tree Also note: https://github.com/zillow/quantile-forest |
Dear scikit-learn maintainers and contributers,
this PR aims to provide a draft implementation for Random Forest Quantile Regression (https://jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf) and KNN Quantile Regression (https://www.tandfonline.com/doi/abs/10.1080/02664763.2015.1070807?journalCode=cjas20&journalCode=cjas20). Both methods are widely used, and would be an interesting addition to scikit-learn as they can estimate conditional quantiles without having to fit each quantile separately (thus providing significant speedups). I started the development of these algorithms from the implementation of scikit-garden, which seems to not be maintained anymore.
Reference Issues/PRs
Fixes #11086
Takes some steps of #18997
What does this implement/fix? Explain your changes.
Any other comments?
This is very much a draft PR. Although the code is documented, I did not yet create the actual documentation for these additional algorithms (as the API is not yet fixed). Additionally, if these methods are approved the partial_dependence functions need to be changed (to pass the quantile parameter to the predict function).
Note that some parts are currently optimised with numba, rather than Cython. I am still attempting to do this conversion myself, but any help on this would be great.