Skip to content

Quantile Regression Forest [Feature request] #11086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
RLstat opened this issue May 11, 2018 · 6 comments
Open

Quantile Regression Forest [Feature request] #11086

RLstat opened this issue May 11, 2018 · 6 comments

Comments

@RLstat
Copy link

RLstat commented May 11, 2018

Hi, Scikit-learn owners and contributors,

I am wondering whether scikit-learn want to implement the quantile regression forest: http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf.

It is a quite highly cited paper, and already has an R package. https://cran.r-project.org/web/packages/quantregForest/quantregForest.pdf

This method has been widely used in various quantile regression problems.

From implementation perspective, it is also a natural extension of random forest, given scikit-learn already has a good random forest implementation. Most of the computation is performed with random forest base method.

In addition, R's extra-tree package also has quantile regression functionality, which is implemented very similarly as quantile regression forest. So if scikit-learn could implement quantile regression forest, it would be an relatively easy task to add it to extra-tree algorithm as well.

Please let me know if it is possible, Thanks.

@glemaitre
Copy link
Member

FWIIW, there is the implementation of scikit-garden:
https://scikit-garden.github.io/examples/QuantileRegressionForests/

I could also spot an example which is linked to the interval of the prediction in GBRT:
http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_quantile.html

@MechCoder @glouppe Was there any discussion on making a change in the tree API to handle outputting the std. dev. ?

@RLstat
Copy link
Author

RLstat commented May 12, 2018

@glemaitre , Thanks for the information.

Is scikit-garden in anyway associated with scikit-learn? I've tried it and it is a nice implementation but seems a bit inactive lately in terms of development, updating issues or release updated version into PyPi.

I am not sure whether it is a better (or worse) idea to gather resource and put in one place for better development and maintenance, if the methods are so related.

@glemaitre
Copy link
Member

@MechCoder @glouppe are contributors in both project. I would say that this is a
contribution to scikit-learn which should be compatible with the scikit-learn tool.
Now it would be worth to integrate it in scikit-learn if we have a good API which fit well

Actually @betatim Could know something as well about the project?

@betatim
Copy link
Member

betatim commented May 13, 2018

scikit-garden doesn't lack people who want to maintain it, but it lacks people who have the time to maintain it :) If someone wants to help out I think they would be welcomed with open arms and given a lot of authority.

If you want to return the std dev together with predictions that would require adjusting the scikit-learn interface or adjusting it. So I think being freed from that by being a different package (scikit-garden) is a good thing. Put another way: the amount of effort to change scikit-learn is way larger than helping update scikit-garden.

Quantile regression in GBRT works well but I always end up writing a small wrapper to bundle three estimators together so that I can return the std dev https://github.com/scikit-optimize/scikit-optimize/blob/1c4c0f12ad8c1fe0a33542108fc0e55164138f9d/skopt/learning/gbrt.py#L14 In scikit-optimize we also have other forest based quantile regression: https://github.com/scikit-optimize/scikit-optimize/blob/1c4c0f12ad8c1fe0a33542108fc0e55164138f9d/skopt/learning/forest.py

@jasperroebroek
Copy link

jasperroebroek commented Mar 16, 2021

Dear scikit-learn maintainers and contributors,

although this is a bit of an old issue, I would like to reach out to you for exactly this class of models. This might become a bit of a long post, but please bear with me.

I have recently attempted using the scikit-garden implementation of these models, but they seem not suitable for data with more than around 10000 samples. Since then I have attempted on rewriting the implementation (to improve performance) and to reach out to the scikit-garden maintainers. Sadly, the package seems to not be maintained anymore (last commit to master is four years old).

As I continued working on the model, I wanted to see if it would be possible to contribute it directly to scikit-learn, as it is fully based on (and consistent with the interface of) scikit-learn. I believe it is worth including it in scikit-learn as it does something quite different from the only other quantile method currently implemented (GradientBoostRegressor): it predicts conditional quantiles from a single model, i.e. you only need to train it once and can then predict all quantiles. (The trade-off is that it is a less precise estimate of the conditional quantile)

I have currently two implementations: 1) from the original paper, 2) taking the concept of the quantregForest implementation. The first one groups together the weighted samples from all trees for each prediction and calculates the weighted quantile. The second one assigns a weighted random draw from the values in each leaf and calculates the quantile on these random samples. The second one is obviously an approximation, but is very substantially faster. Here I include a figure that compares both approaches with the standard scikit-learn Random Forest (with MAE criterion -> which yields the median) and with GradientBoostRegressor to compare the 90th percentile (on the Boston dataset).

image

import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestQuantileRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

boston = load_boston()
X, y = boston.data, boston.target

# Test working
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=0)

qrf_default = RandomForestQuantileRegressor(random_state=0)
qrf_rs = RandomForestQuantileRegressor(random_state=0, method='sample')
rf = RandomForestRegressor(random_state=0, criterion='mae')
gbr = GradientBoostingRegressor(random_state=0, loss='quantile', alpha=0.9)

qrf_default.fit(X_train, y_train)
qrf_rs.fit(X_train, y_train)
rf.fit(X_train, y_train)
gbr.fit(X_train, y_train)

f, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 6))
ax = ax.flatten()

ax[0].scatter(qrf_default.predict(X_test, q=0.5), rf.predict(X_test))
ax[0].set_xlabel("Quantile regression forest - median")
ax[0].set_ylabel("Random forest with MAE criterion")
ax[0].set_title("Default implementation (from original paper)")
ax[0].plot([10, 50], [10, 50])

ax[1].scatter(qrf_rs.predict(X_test, q=0.5), rf.predict(X_test))
ax[1].set_xlabel("Quantile regression forest - median")
ax[1].set_ylabel("Random forest with MAE criterion")
ax[1].set_title("Random sample implementation (as in quantregForest)")
ax[1].plot([10, 50], [10, 50])

ax[2].scatter(qrf_default.predict(X_test, q=0.9), gbr.predict(X_test))
ax[2].set_xlabel("Quantile regression forest - q=0.9")
ax[2].set_ylabel("GradientBoostRegressor - q=0.9")
ax[2].set_title("Default implementation (from original paper)")
ax[2].plot([10, 50], [10, 50])

ax[3].scatter(qrf_rs.predict(X_test, q=0.9), gbr.predict(X_test))
ax[3].set_xlabel("Quantile regression forest - q=0.9")
ax[3].set_ylabel("GradientBoostRegressor - q=0.9")
ax[3].set_title("Random sample implementation (as in quantregForest)")
ax[3].plot([10, 50], [10, 50])

plt.show()

This method yields exactly the same results as the original scikit-garden implementation and passes their unit-tests (with some slight modification for different properties).

However, from here I don't know how to proceed. I am new to collaborating on community maintained packages and I don't really know what to do/ who to approach with this, therefore my extensive comment here. I have committed my code to my fork of scikit-learn (https://github.com/jasperroebroek/scikit-learn), but it seems a bit premature to actually call a PR. Also, I am currently using numba for speedy for-loops. As scikit-learn is using Cython I suppose this additional dependency would not be very well received. I have attempted to rewrite it in Cython, but I have not yet succeeded.

So, if this is of interest for the scikit-learn community, I would love to receive some guidance on the path to follow from here.

Note 1
For the random-sampling approach to Quantile Forest Regression it would be probably much faster to directly implement this in the tree building process. I have looked at the current implementation of the structure and I believe this would be quite trivial to implement. It would only require the created of an additional criterion, which is identical from the current MSE criterion, with the difference that the node_value method returns a weighted random sample of the data, rather than a weighted mean. I am currently computing these values after the fitting process and storing them in the tree_.value array, so the prediction returns these values automatically. From my understanding these two methods would lead to the same thing, which means that the predict function can stay the same. Again, my limited understanding of the Cython syntax prohibits me to implement it myself.

Note 2
As the quantile parameter goes to the predict function of the model, partial dependency functionality does not reflect the conditional quantiles, but rather the mean. To do this, an additional parameter needs to be past through these methods. I am not sure if these methods are compatible with the recursive methods of creating the partial dependence of the RandomForestRegressor class.

#18997

@reidjohnson
Copy link

FWIW regarding this issue, there is an actively maintained, scikit-learn compatible/compliant Quantile Regression Forest implementation available here: https://github.com/zillow/quantile-forest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants