Skip to content

Draft implementation of non-parametric quantile methods (RF, Extra Trees and Nearest Neighbors) #19754

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from

Conversation

jasperroebroek
Copy link

Dear scikit-learn maintainers and contributers,

this PR aims to provide a draft implementation for Random Forest Quantile Regression (https://jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf) and KNN Quantile Regression (https://www.tandfonline.com/doi/abs/10.1080/02664763.2015.1070807?journalCode=cjas20&journalCode=cjas20). Both methods are widely used, and would be an interesting addition to scikit-learn as they can estimate conditional quantiles without having to fit each quantile separately (thus providing significant speedups). I started the development of these algorithms from the implementation of scikit-garden, which seems to not be maintained anymore.

Reference Issues/PRs

Fixes #11086
Takes some steps of #18997

What does this implement/fix? Explain your changes.

  • Random Forest Quantile Regression
  • Extra Trees Quantile Regression (which uses the same implementation as QRF)
  • KNN quantile regression
  • A weighted_quantile function, which aims to mimic np.quantile with the inclusion of a weight parameter.

Any other comments?

This is very much a draft PR. Although the code is documented, I did not yet create the actual documentation for these additional algorithms (as the API is not yet fixed). Additionally, if these methods are approved the partial_dependence functions need to be changed (to pass the quantile parameter to the predict function).

Note that some parts are currently optimised with numba, rather than Cython. I am still attempting to do this conversion myself, but any help on this would be great.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks quite nice after a glance at qrf.
Concerns:

  • We ordinarily avoid additional parameters to methods like predict. The exception has been return_std etc, and maybe this could fall under the same allowance.
  • Should we support predicting at multiple q simultaneously? For an array like of q, do we return a matrix/tensor of results, or a list of vector/matrix?
  • I don't know what current thoughts are on numba, but I think it would be a big step to introduce a hard dependency on it. Certainly, this requires a core developer vote according to our governance document. It might even deserve a vote if we make it a soft dependency. The outcome could be rewriting those functions in cython
  • I'm somewhat concerned about maintaining two random forest implementations in parallel. I assume you've attempted to share as much code as possible, but we should also share tests where possible.

Maybe you can attend our core devs meeting on Monday to raise some of these. (Advertised on our mailing list. I would be more helpful if I were at my desk!)

@jasperroebroek
Copy link
Author

Thanks for having a look at the material!

The concerns that you raise are indeed issues that would require some more thoughts. For now I just wanted to get an opinion on the API and to see if you would be interested in integrating these methods. If so, it would still require some work to smooth over some of these issues (therefore the draft PR).

On the concerns:

  • I understand that this would be a major deviation from what is already out there. I do, however, not see a possible other way (besides including a setter for q before calling predict). I am not fully aware of the consequences of deviating from the template for the predict function. So far I attempted to insert a dictionary argument in the partial_dependence functionality to tackle this issue, but there might be others that I have missed.
  • I would be in favour of doing that, which would make it consistent with what numpy does in np.quantile. It would require some more thought on how to actually achieve this, but I believe it should be feasible.
  • Creating a new dependency on numba, while Cython is already used, does not make a whole lot of sense from my perspective.
  • The implementation under the hood is exactly the same as the regular random forest. Some additional work is needed after fitting, but the base is the same. All tests for random forest that do not require predict should yield the same result for QRF (at least in its default implementation).

I will try to attend the meeting on Monday to discuss this in more detail .

@lorentzenchr
Copy link
Member

@jasperroebroek Thank you for this draft PR. Some more options for quantile regression would certainly be beneficial.
I only skimmed over the Random Forest Quantile Regression and now have a few questions and remarks:

  • Is it correct that single decision trees are fitted based on the squared error (MSE) splitting criterion? Does that make sense if our goal is to predict conditional quantiles?
  • IIUC, all quantiles are predicted with the same trees with the same splitting thresholds. One argument in linear quantile regression is that features may have very different effects (coefficients) when fitted for different quantiles. Wouldn't this one threshold for all weaken the predictiveness for, at least, some quantiles?

@jasperroebroek
Copy link
Author

@lorentzenchr thank you for taking the time to review.

So, let me try to answer your questions/remarks. Quantile Regression Forests are indeed fitted with exactly the same approach as normal Random Forest would. From there it does some postprocessing, but the core is the same (with MSE splitting criterion etc.). In his paper, Meinshausen (see the link above) describes why, and under which conditions, this yields credible conditional quantiles. In short: pretty much all the time. After testing this a little, I believe that there are, as you describe, some constraints on the use of this methodology. This being that locally the mean should at least not be strongly negatively correlated to the quantile that you are trying to predict.

The issue #18849, comparing QRF and gradient boosting shows some figures comparing the different approaches for fake sinusoidal data (#18849 (comment)). It shows that the methodology works quite well for both the upper and lower quantiles, even without any hyperparameter selection. (PS: I just realised you were the one commenting on the post, so ....)

In contrast, Random Forest could be trained with a quantile loss function, yielding probably more accurate conditional quantiles. QRFs are likely faster and more flexible. I think it would be great to have them both in sklearn (for the user to choose between these tradeoffs).

@jasperroebroek
Copy link
Author

After our discussion on Monday I have tried to take the first steps in the cythonisation of the code. Although the code that I just pushed runs and produces satisfactory results, it is terribly slow in comparison to the previous version (which used numba for compilation). I believe it is now even slower than the original python/numpy implementation.

What I have:

  • A weighted_quantile function that takes pre-sorted data but does (almost) not require the GIL.
  • A weighted_quantile function that can do the sorting but does require the GIL (but I am not sure if this would be necessary)
  • A first draft of a compiled _quantile_forest_predict.

What is still missing:

  • The _quantile_forest_predict function does not compile with cdef with specified memoryviews (for some reason that I can't seem to figure out).
  • The main code of this function can't be parallelised because of the previous point. Additionally, I can't seem to figure out how to store the summed weights in each realisation of the parallel loops as a separate object.
  • The second implementation (similar to quantregForest) would be easiest to implement by creating a subclass of the MSE criterion. The only difference needs to be the node_value function that instead of the weighted mean returns a weighted random sample of the data in the node. I attempted this, but I have no idea on that would work with the random values inside the criterion.

I have put a lot of comments and thoughts in the code, so hopefully it is still intelligible.

In short; I would really need some help from someone with more Cython experience to make this work. Would someone have time and energy to go through this?

@jasperroebroek
Copy link
Author

Another update on the quantile regression forests. I have converted the code completely to Cython and changed the API as discussed (by not including the quantiles in the predict function, but by setting it before). It currently runs smoothly and does what it is supposed to do, including predicting an array-like of quantiles simultaneously. If you agree on the API and implementation I would be ready to switch to a PR (instead of a draf one).

@jasperroebroek jasperroebroek marked this pull request as ready for review April 21, 2021 08:47
@jasperroebroek
Copy link
Author

Another update on QRF. I added the mean pinball loss as the score function of the Regression Forest Quantile Regressor. I have also uploaded an example of the QRF in action. Note that this is almost a direct copy from the new example of Gradient Boosting Quantile Regression. The workflow, results and conclusions are identical and the final resulting coverage fraction is even better reproduced by the QRF. I have changed the status of this PR from draft to regular. Happy to discuss further on what needs to be added before it can be accepted.

@jasperroebroek jasperroebroek requested a review from jnothman April 28, 2021 17:38
@tim-oh
Copy link

tim-oh commented Jun 21, 2021

I tried to install the package (in a conda env w/ git clone & python setup.py install), but get
ImportError: cannot import name 'RandomForestQuantileRegressor' from 'sklearn.ensemble' (/INSTALL_DIR/lib/python3.8/site-packages/sklearn/ensemble/init.py)

Should that work in principle?
Can I fix it?

@jasperroebroek
Copy link
Author

I believe that that is caused by an incorrect compilation of the cython source files. Not sure why that happens.

The workaround I have been using is to compile the right files after the installation of sklearn. To do this create a second setup file next to the sklearn setup file and call it 'setup_qrf.py'. Put the compilation code for the QRF extension inside (this is an example, I haven't tested this):

from setuptools import setup
from distutils.extension import Extension
from Cython.Build import cythonize
import numpy

extensions = [Extension("sklearn.utils.weighted_quantile", ["sklearn/utils/_weighted_quantile.pyx"]),
              Extension("sklearn.ensemble.qrf", ["sklearn/_qrf.pyx"])]

setup(
    ext_modules=cythonize(extensions),
    include_dirs=[numpy.get_include()],
    extra_compile_args=['-fopenmp'],
    extra_link_args=['-fopenmp']
)

in the terminal compile with python setup_qrf.py build_ext --inplace

Now you should be able to import the RandomForestQuantileRegressor class.

Let me know if this works for you!

@ryancinsight
Copy link

@jasperroebroek on windows i am getting the error Buffer dtype mismatch, expected 'long' but got 'long long', seems like there are other packages with similar issues where they had to resort to libc type default instead of numpy's

@ryancinsight
Copy link

@ryancinsight I don't think it is related to the weighted_quantile module, at that is not using any integer type data. If I am correct it is caused by assigning np.int64 explicitly outside the cython code, while referring to it through the same mechanism as the system 'long' type. Again, I am not able at the moment to test this out on a windows machine. Sorry for the inconvenience. I would be very interested to know if this bug-fix actually resolved the issue. Thanks for testing ;)

still there but now: File "sklearn\ensemble_qrf.pyx", line 547, in sklearn.ensemble._qrf._RandomSampleForestQuantileRegressor.fit
ValueError: Buffer dtype mismatch, expected 'SIZE_t' but got 'long'

@ryancinsight
Copy link

ryancinsight commented Jul 2, 2021

@jasperroebroek I found the previous issue though, it was idx in your sample method fit. it needs to be changed to idx = np.arange(self.n_samples_,dtype=np.int64)[mask] and it no longer gave me an issue. i just tested with np.arange(self.n_samples_,dtype=np.intp)[mask] and the new version also works and slightly faster it seems.
Best

@ryancinsight
Copy link

Interesting, default is now faster than sample which you say should be faster. Was this fixed? Not sure there is a need for two versions if default is faster now

@ryancinsight
Copy link

Nevermind, seems multiple quantiles slows down the default method compared to samples method on the predict...wonder if there is a way to improve this because the results seem more accurate on default

@jasperroebroek
Copy link
Author

@ryancinsight So if I understand it correctly, to solve the issue only line 539 needs to change into idx = np.arange(self.n_samples_, dtype=np.intp)[mask]?

Anyway, speaking about performance difference between the two implementations. The default is more precise, but becomes prohibitively slow on datasets between 10000 and 100000 samples. In my benchmark runs it is actually faster than the sampling approach for small datasets (100s/ 1000s of samples). With more than 1 quantile, the sampling method was almost always faster than the default implementation. Do you confirm these observations?

Anyway, thanks for checking out the code!
Cheers

@ryancinsight
Copy link

@jasperroebroek that is correct with line 539 dtype being set everything now runs smoothly on windows 10.

I can also confirm your observations on performance.

Best regards, great work!

@Nbruneau9
Copy link

Hello, I see that this thread has not been updated since July. Is this implementation of QRF planned to be integrated into sklearn in the future ? It would be very useful. Thanks. Nicolas

@jasperroebroek
Copy link
Author

For those interested: I have published these methods as a separate package as a temporary place for development:
https://github.com/jasperroebroek/sklearn-quantile

@ogrisel
Copy link
Member

ogrisel commented Jan 26, 2022

For those interested: I have published these methods as a separate package as a temporary place for development:
https://github.com/jasperroebroek/sklearn-quantile

Thanks for your doing this, to increase adoption, it would be worth publishing a release on PyPI & conda-forge as well as publishing the rendered HTML doc on gh-pages for instance.

Also you should better highlight in the README when those methods are more interesting that quantile gradient boosting for instance: in particular the fact that they are more efficient to for people who want to estimate many quantiles at once for a given input (e.g. all the percentiles, to get some coarse estimate of the CDF of the predictive distribution).

I think this would be interesting to get those methods upstream in scikit-learn at some point but I must confess we are running a bit short on reviewer bandwidth to work on such big new refactoring.

@jasperroebroek
Copy link
Author

jasperroebroek commented Feb 4, 2022

I have now pushed the project to Pypi, so it should be installable with pip. I am still trying to figure out how to get it on conda-forge too. Documentation (although really barebones) is available on: https://sklearn-quantile.readthedocs.io/en/latest/?badge=latest. I hope that this is enough for some initial testing!

The code has now significantly diverged from this pull request. It still very much follows the structure of scikit-learn, so it should be relatively straightforward to get it into the main scikit-learn repository of needed.

@Bougeant
Copy link

Bougeant commented Feb 8, 2022

Hi @jasperroebroek

First of all, awesome work! I have been looking for a replacement for scikit-garden which has been inactive for a few years and which is falling behind on scikit-learn (compatible with <0.23 only).

I have questions regarding the predictions made using the 2 models (Random Forest from scikit-quantile / scikit-garden, versus the Gradient Boosting from scikit-learn / lightGBM).

Indeed, these predictions seem to be very different from one another.

Here's some code to demonstrate this:

import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn_quantile import RandomForestQuantileRegressor

housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=0)
quantile = 0.05

# Gradient Boosting Conditional Mean Model 
# Models have different hyper-parameters to have similar test and train scores.
param_gb_mean = {"n_estimators": 100, "max_depth": 4, "min_samples_leaf": 60, "max_features": None, "random_state": 1}
gb_mean = GradientBoostingRegressor(**param_gb_mean)
gb_mean.fit(X_train, y_train)
y_mean_gb = gb_mean.predict(X_test)

# Gradient Boosting Quantile Model 
param_gb_quantile = {"loss": "quantile", "alpha": quantile, "n_estimators": 100, "max_depth": 4, "min_samples_leaf": 60, "max_features": None, "random_state": 1}
gb_quantile = GradientBoostingRegressor(**param_gb_quantile)
gb_quantile.fit(X_train, y_train)
y_quantile_gb = gb_quantile.predict(X_test)

# Random Forest Conditional Mean Model 
param_rf_mean = {"n_estimators": 100, "max_depth":10, "min_samples_leaf": 10, "max_features": 5, "random_state": 1}
rf_mean = RandomForestRegressor(**param_rf_mean)
rf_mean.fit(X_train, y_train)
y_mean_rf = rf_mean.predict(X_test)

# Random Forest Quantile Model 
param_gb_quantile = {"q": quantile, "n_estimators": 100, "max_depth": 10, "min_samples_leaf": 10, "max_features": 5, "random_state": 1}
rf_quantile = RandomForestQuantileRegressor(**param_gb_quantile)
rf_quantile.fit(X_train, y_train)
y_quantile_rf = rf_quantile.predict(X_test)

Now you will note that while the average predicted conditional mean values (y_mean_gb and y_mean_rf) are within 0.2% of each other, the average predicted 5% quantiles are quite far apart (10% higher with the gradient boosting model, 6% after recalibrating both models so that exactly 5% of samples are over-predicted). I am seeing exactly the same thing on another dataset I'm interested in but with stronger effect (+15% for the gradient boosting model after recalibration).

According to the pinball quantile loss which is the most commonly used quantile regression metrics, the performance of the 2 models is equivalent (slightly better for the GB model in this case, but better for the RF model in the case of my dataset of interest after spending some time optimizing each model).

More significantly the shape of the predicted quantiles as a function of the predicted conditional mean are wildly different both for this dataset and mine (banana shape for the RF model).

For the Gradient Boosting model:

fig = plt.figure(figsize=(12, 8))
sns.scatterplot(x=y_mean_gb, y=y_quantile_gb, alpha=0.1, s=10, linewidth=0)
plt.plot([0, 6], [0, 6], c="black", dashes=[5, 5])
plt.xlim(0, 6)
plt.ylim(0, 6)
plt.xlabel("Conditional Mean")
plt.ylabel("15% Quantile")
sns.despine()

image

For the Random Forest model:

fig = plt.figure(figsize=(12, 8))
sns.scatterplot(x=y_mean_rf, y=y_quantile_rf, alpha=0.1, s=10, linewidth=0)
plt.plot([0, 6], [0, 6], c="black", dashes=[5, 5])
plt.xlim(0, 6)
plt.ylim(0, 6)
plt.xlabel("Conditional Mean")
plt.ylabel("15% Quantile")
sns.despine()

image

I was thinking that this is not due to the difference in the type of model itself (Random Forest vs Gradient Boosting: the conditional mean models are very similar) but rather to how the quantile models are built. My understanding is that:

  • The Random Forest model
    • is built by splitting the branches/leaves in order to optimise MSE or MAE, not the quantiles
    • stores all the samples from the training data in each leaf during training. During inferences, it generates a distribution of the training samples sharing the same leaves and selects the X% quantile as the prediction
  • The Gradient Boosting model
    • is built by splitting the branches/leaves in order to optimise the selected quantile loss (one would think that this guarantees a better model but that's not what I'm seeing)
    • (correct me if I'm wrong) during training we store the selected quantile associated to each leaf and during inference we take the average of the quantiles for the leaves we fall into.
    • One obvious drawback of the GB model is that for multiple quantiles we need independent models, so nothing guarantees that the quantile predictions are monotonic (higher quantile = higher value), unlike the RF model

Can we explain why these quantile predictions are so different? Why the Gradient Boosting model routinely predicts average quantiles 10% larger than the Random Forest model?

@lorentzenchr, this may also interest you since it is somewhat related to a debate we had a year ago.

@jasperroebroek
Copy link
Author

Hi @Bougeant,

Thanks for reaching out and testing the package. I am very glad to hear that you have managed to install and run it without any issues!

Concerning the difference between Gradient Boosting quantile regression and Random Forest quantile regression. I am not really well versed in how Gradient Boosting works under the hood, except that it uses the pinball loss for training. If your list is accurate, it wouldn't surprise me that the lower quantiles in GB are predicting higher values than RF. Averaging predicted quantiles would always be more conservative (closer to mean) in comparison to calculating the quantile over all data informing the prediction. (That would suggest lower predictions for GB for high quantiles). To test this hypothesis one might create both GB and RF with equal parameters, and only one tree. In this case the only difference between the models should be in the training phase, one going for MSE and the other for the lowest pinball loss.

The difference in training on MSE or pinball loss very much depends on the structure of the data. Locally, i.e. where the algorithm tries to split the data, both algorithms are expected to yield similar results as long as the quantile you are looking at, is effectively positively correlated with the mean. In many cases this assumption is approximately correct, but not always. In those cases where it does not, training on the pinball loss would be much more accurate (which might for example be the case at the end of the banana shape).

To test the accuracy of both models, you might also have a look at K-Nearest Neighbours quantile regression (also available in sklearn-quantile), which does not use bootstrapping to come to a conclusion.

@Bougeant
Copy link

Great suggestion to try models with one tree. Let me try that.

@jasperroebroek
Copy link
Author

Another update on this project. It just got accepted in conda-forge so it now works with conda!

@ogrisel ogrisel changed the title Draft implementation of quantile methods Draft implementation of non-parametric quantile methods (RF, Extra Trees and Nearest Neighbors) May 12, 2022
@ogrisel
Copy link
Member

ogrisel commented Sep 7, 2022

@jasperroebroek out of curiosity, do you use honest trees in your implementation (in-tree validation sets for quantile estimation?). It seems that it can help get better calibration for quantile estimators. See for instance: #19710

@jasperroebroek
Copy link
Author

This implementation currently uses the default implementation of random forest and extra trees algorithms. This is according to the original research paper by Meinshausen et al. Undoubtedly, using quantile information at the time of training will give improved accuracy, but this will likely be (much) slower and predictions can't be done with the same model for different quantiles.

@adam2392
Copy link
Member

Adding some thoughts, cuz this PR is of interest to me. This PR's maintenance burden can most likely benefit from the PRs proposed to refactor trees:

Then instead of a quantile forest Cython file, we could just create a QuantileTree class.

@jasperroebroek
Copy link
Author

The code could indeed partially benefit from an update to the Splitter.

The code that I have sent here, contains two different strategies to calculate conditional quantiles with random forests. The first and most precise one, the one proposed by the original paper, calculates the quantiles after the random forest is trained with standard Criterion and Splitter. It could be beneficial here to store all values per leaf, rather than only the mean, but the quantile calculation would still need to happen at prediction time and would make sense to me to be kept in its own class.

The second methods takes advantage of sampling the values at leaf scale, which could be done in the Splitter and would be much more efficient. Criterion would still remain the same I assume. In this method, the quantiles are calculated on the outputs of the different trees.

I hope this information helps!
An updated and operational version of the code is available in the sklearn-quantile package.

@adam2392
Copy link
Member

@jasperroebroek would you mind having a look at the following PR? I am proposing to separate the API for setting leaf/split nodes in the current Tree Cython code.

Then I included a very short pseudo-code demo of the potential quantile-tree implementation. Lmk if this looks like the right direction to you?

#25448

@jasperroebroek
Copy link
Author

jasperroebroek commented Jan 23, 2023

I just had a look at the code that you wrote on the quantile tree. I am a bit confused by the predict function that you define there. While this, in and by itself, might be valid, the prediction of the quantile regression forest is not based on the idea of an ensemble of predictions by its individual trees, as is the case in standard regression forests. Rather, it pulls all values of the relevant leaves of all trees (and its weights) and calculates a weighted quantile on them. I don't see any other way than doing this operation in the quantile regression forest class itself.

@adam2392
Copy link
Member

I just had a look at the code that you wrote on the quantile tree. I am a bit confused by the predict function that you define there. While this, in and by itself, might be valid, the prediction of the quantile regression forest is not based on the idea of an ensemble of predictions by its individual trees, as is the case in standard regression forests.

Yes this is the case of just reproducing a single quantile-tree. However, with a single quantile-tree that also optionally exposes the leaf values, the ensemble method should be nice and short. Rather, I'm asking in terms of the API, whether or not you think this will fit the scope of what is needed by a quantile-tree (forest)? The main goal is to enable a quantile-tree(forest) that simply needs to rewrite a handful of functions, and the rest of the tree building/etc code lives in the base classes.

Rather, it pulls all values of the relevant leaves of all trees (and its weights) and calculates a weighted quantile on them. I don't see any other way than doing this operation in the quantile regression forest class itself.

Yep you're correct. The eventual goal would be to do this, but at the forest level.

@jasperroebroek
Copy link
Author

This definitely sounds like an improvement over how I initially implemented and could provide some nice speed improvements. I also implemented the sampling method used in the R package, which takes a weighted random value at leaf level and calculates the quantile at forest level. Although this is an approximation, it would it so much faster that in most cases this would be the preferred algorithm. When implemented as a Tree class, that would be trivial I suppose. If you need further input or help with implementation, I am available.

@adrinjalali
Copy link
Member

I think having this as a separate package is ideal (which we already have), and I think with the work @adam2392 has been doing for making some parts of the codebase more extendable, we could close this one. Happy to reopen if y'all think otherwise.

@adam2392
Copy link
Member

For completeness: Linking the package I've come to develop to shore up some of these gaps (maybe in future some aspects of them will make it into scikit-learn): https://github.com/neurodata/scikit-tree

Also note: https://github.com/zillow/quantile-forest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Quantile Regression Forest [Feature request]