-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC Improve claim prediction example #16648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC Improve claim prediction example #16648
Conversation
Ping @lorentzenchr. |
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
import numpy as np | ||
import matplotlib.pyplot as plt | ||
import pandas as pd | ||
|
||
from sklearn.datasets import fetch_openml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original PR, we had a short discussion whether to have all includes at the very beginning or each import where it is first needed. I'm fine with both approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really prefer to split large import blocks for long examples. Otherwise the reader as to scroll-up a lot to know where a function / class comes from.
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
|
||
gbrt = Pipeline([ | ||
("preprocessor", tree_preprocessor), | ||
("regressor", HistGradientBoostingRegressor(max_leaf_nodes=128)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With min_weight_fraction_leaf=0.01
, we could avoid non-positive predictions of the random forest. Could we try the same for the HGBR, in this case with min_samples_leaf=...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because max_leaf_nodes
is limited and the number of trees very large, I don't think this is going to change anything. Also with the Poisson loss and its log link that are now enabled we don't have any negative predictions for HGBRT anymore.
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
# - The non-linear Gradient Boosting Regression Trees model does not seem to | ||
# suffer from significant mis-calibration issues (despite the use of a least | ||
# squares loss). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as an explanation: For large sample sizes, the different losses do not matter much (as long as they are strictly consistent for the expected value). Without interactions and other cheerful feature engineering for linear models, the tree based models will win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I even doubt that it's possible to cheerfully feature engineer enough to compete with the tree based models when n_samples > 1e5
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not on this and most other datasets, but I have seen otherwise:smirk:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love to be proven wrong ;)
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. Looks pretty good. Could you also make the plots colorblind friendly?
Our goal is to predict the expected frequency of claims following car accidents | ||
for a new policyholder given the historical data over a population of | ||
policyholders. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is an example of where I rather not see our library be used lol (it also has a lot of ethical concerns, bias etc, which we don't tackle).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that we should remove it really, just saying I personally probably would :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us please agree on not having that discussion here. I understand and admire your point of view and would be very glad to talk with you about this controversial (and never ending) subject. At last, note that neither gender nor race are in the data. :no_fitting_emoji:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that the data is biased and we can't even measure that bias because we don't even have the features we need (like race and gender). Not using those features doesn't mean we're not going to have biased model discriminating against a certain group, and I'm very very worried about putting an example out there which people would then use as a reference to [unintentionally] discriminate against people. @romanlutz, do you happen to have a good example for this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for better examples for controlling for bias, but I also don't think this PR is the right place for this discussion. It merely refactors an existing example, we should have this discussion in a separate issue.
As a side note, I imagine there could indeed be some sample selection bias in the the data (i.e. company chooses customers), however the target variable (frequency or cost of accidents) shouldn't be too biased, I think? At least significantly less biased than in other examples such as https://scikit-learn.org/dev/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html for predicting wage. Also pricing policy of an insurance company doesn't directly impact how often one has accidents, so at least there is no immediate feedback loop with the training data. I'm not an expert on this, and there are likely other effects, but I'm just saying that we should discuss it in a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree with the suggestions to discuss this elsewhere. For more information on what @adrinjalali is referring to I recommend "Big Data's Disparate Impact" by Barocas and Selbst. file:///C:/Users/rolutz/AppData/Local/Temp/SSRN-id2477899.pdf section I.D. The entire piece is actually relevant for such a scenario, but that's the section that discusses excluding sensitive features such as race & gender. Other than that there's still potential bias in how the data got collected (I.A, I.B, I.C).
I think there's value in acknowledging such potential shortcomings so that people don't assume that it's the best (or only) way to approach the task. We wouldn't want users to end up on my list of questionable or unethical use cases. With the acknowledgment it should be clear that it's just a demonstration of scikit-learn. At least that's my point of view :-) Thanks for asking!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrinjalali and @romanlutz I opened #16715 if you want to address this topic further.
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
ridge__sample_weight=df_train["Exposure"]) | ||
ridge = Pipeline([ | ||
("preprocessor", linear_model_preprocessor), | ||
("regressor", Ridge(alpha=1e-6)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's some large change in alpha
, interesting!
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
# | ||
# - The least squares loss of the Ridge regression model seem to cause this | ||
# model to be badly calibrated. In particular it tends to under estimate the | ||
# risk and can even predict invalid negative frequencies... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it's not the squared error, but rather the linearity assumptions that cause negative predictions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tree-based model does not make the linear assumption between E[y]
and X
and can still predict negative values. But I will mention the implicit identity link function of Ridge
.
|
||
df = fetch_openml(data_id=41214, as_frame=True).frame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is raising
/home/circleci/project/sklearn/datasets/_openml.py:655: UserWarning: Version 1 of dataset freMTPL2freq is inactive, meaning that issues have been found in the dataset. Try using a newer version from this URL: https://www.openml.org/data/v1/download/20649148/freMTPL2freq.arff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but unfortunately there is nothing we can do about it apart from contacting the original uploader. Let me do that.
Thanks! Please ping me once the first batch of comments above is addressed/responded to. |
91fb3a7
to
2b77584
Compare
That's my plan. The narrative will need an update though. I wonder if it's worth keeping |
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
I tried fitting both HGBRT with LS and Poisson and their calibration curves, Lorenz curves and Gini index are almost indistinguishable. However, as expected the HGBRT model with log link and Poisson loss does not produce negative predictions and gets a significantly better Poisson deviance. I will just keep the later to avoid having to reorganize the plot layouts of the examples. |
Co-Authored-By: Christian Lorentzen <lorentzen.ch@googlemail.com>
Co-Authored-By: Christian Lorentzen <lorentzen.ch@googlemail.com>
4a919b0
to
1fe4fb1
Compare
@rth @adrinjalali @NicolasHug @lorentzenchr I think this is ready for final review. I think that would be great to have it in 0.23 to have at least one example that shows how to use the new Poisson loss of HGBRT to be mentioned in the release notes. |
lol you know I'm very conflicted about this one and how I feel about an example which can be directly used to put people with disabilities, pre-existing conditions, and certain genetic backgrounds in a direct disadvantage by just changing the dataset from auto industry to healthcare. I also know y'all have put so much effort in the Poisson loss in this release and how important it is to you. So I'll abstain on this one and probably try to replace it with a less harmful example later. |
Just changing the dataset to another data would not fix the ethical issues of application of machine learning models magically. I agree it would be good to discuss those ethical issues either in this example or in-another and cross link this example to it. But in any way, to discuss the fairness issues of the application of machine learning to insurance pricing, marketing and evaluation of the risk of a portfolio, it's good to technically understand how those models are typically built and I think these example has a very strong pedagogical value in that respect. |
We opened #16715 to discuss it. As mentioned there, I don't think this discussion is specific to this example and should happen here. Besides, this PR is about improving example, so if we don't merge this as @adrinjalali seems to be suggesting, we will still have this example in the released docs just with worse wording. I would rather merge this, then add comments to examples in a follow up PR per #16715 where we find it necessary. Will try to re-review later today. |
But as soon as we have good documentation / examples about how to think and treat fairness issues in scikit-learn we should definitely update this example cross-link to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ogrisel
A few comments but LGTM when addressed
I'm just a bit confused by the use of "calibration" which seems to mean a different thing
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
# ``PoissonRegressor`` thanks to the flexibility of the trees combined with | ||
# the large number of training samples. | ||
# | ||
# Evaluation of the calibration of predictions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the curves that are plotted below have a specific name?
Surely they are different from calibration curves as we have them in the UG ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calibration curves currently in the user guide are only defined for classification problems. They are similar to those int this example (plot the mean observed values per prediction bin) when using the quantile binning strategy.
It would make sense to extend the calibration_curve
tool to support regression but this should be done in a dedicated PR.
# The dummy regression model predicts a constant frequency. This model is not | ||
# attribute the same tied rank to all samples but is none-the-less globally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should "is not" be "does not"?
Though I'm not sure I understand the plot for the dummy: if it always predicts the mean, why do we have multiple dots on the x-axis? I would have assumed the plot should just be a single dot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We bin by quantile so the content of each bin is independent of y hat (since y hat is constant) and therefore uniformly random.
I agree it's an edge case. But it really high light that this notion of probabilistic calibration and ranking power are quite independent of one another.
# The sum of all predictions also confirms the calibration issue of the | ||
# ``Ridge`` model: it under-estimates by more than 3% the total number of | ||
# claims in the test set while the other three models can approximately recover | ||
# the total number of claims of the test portfolio. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's curious that the HGBDT overestimates the number of claim, much more than the PoissonRegressor, while both curves look pretty similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that this difference is significant. One would need to bootstrap to get confidence intervals but this is too expensive and complicated for this example.
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a fun read. Thank you @ogrisel !
examples/linear_model/plot_poisson_regression_non_normal_loss.py
Outdated
Show resolved
Hide resolved
tagging for inclusion #17010 |
Co-Authored-By: Christian Lorentzen <lorentzen.ch@googlemail.com> Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>
Co-Authored-By: Christian Lorentzen <lorentzen.ch@googlemail.com> Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>
Co-Authored-By: Christian Lorentzen <lorentzen.ch@googlemail.com> Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>
This rewrite is motivated by the following: