Skip to content

DOC Improve claim prediction example #16648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Apr 28, 2020

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Mar 5, 2020

This rewrite is motivated by the following:

  • use headers to split the content into sections;
  • simplify the data-loading, display the data-frame content (an excerpt) and histograms for the target variable;
  • use HistGradientBoostingRegressor (with the default least squares loss) now that it supports sample weights: it's better and much faster than RandomForestRegressors and it makes it possible to use the full dataset.
  • break the top level import block to progressively introduce scikit-learn component when needed;
  • contrast calibration and discriminative power of the models in the narrative;
  • add a take-aways section at the end.

@ogrisel ogrisel requested a review from rth March 5, 2020 18:28
@ogrisel
Copy link
Member Author

ogrisel commented Mar 5, 2020

Ping @lorentzenchr.

@ogrisel ogrisel changed the title Improve claim prediction example [MRG] Improve claim prediction example Mar 6, 2020
@ogrisel ogrisel added this to the 0.23 milestone Mar 6, 2020
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.datasets import fetch_openml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original PR, we had a short discussion whether to have all includes at the very beginning or each import where it is first needed. I'm fine with both approaches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really prefer to split large import blocks for long examples. Otherwise the reader as to scroll-up a lot to know where a function / class comes from.


gbrt = Pipeline([
("preprocessor", tree_preprocessor),
("regressor", HistGradientBoostingRegressor(max_leaf_nodes=128)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With min_weight_fraction_leaf=0.01, we could avoid non-positive predictions of the random forest. Could we try the same for the HGBR, in this case with min_samples_leaf=...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because max_leaf_nodes is limited and the number of trees very large, I don't think this is going to change anything. Also with the Poisson loss and its log link that are now enabled we don't have any negative predictions for HGBRT anymore.

Comment on lines 531 to 554
# - The non-linear Gradient Boosting Regression Trees model does not seem to
# suffer from significant mis-calibration issues (despite the use of a least
# squares loss).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as an explanation: For large sample sizes, the different losses do not matter much (as long as they are strictly consistent for the expected value). Without interactions and other cheerful feature engineering for linear models, the tree based models will win.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I even doubt that it's possible to cheerfully feature engineer enough to compete with the tree based models when n_samples > 1e5.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not on this and most other datasets, but I have seen otherwise:smirk:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to be proven wrong ;)

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Looks pretty good. Could you also make the plots colorblind friendly?

Comment on lines +28 to 31
Our goal is to predict the expected frequency of claims following car accidents
for a new policyholder given the historical data over a population of
policyholders.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is an example of where I rather not see our library be used lol (it also has a lot of ethical concerns, bias etc, which we don't tackle).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that we should remove it really, just saying I personally probably would :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us please agree on not having that discussion here. I understand and admire your point of view and would be very glad to talk with you about this controversial (and never ending) subject. At last, note that neither gender nor race are in the data. :no_fitting_emoji:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the data is biased and we can't even measure that bias because we don't even have the features we need (like race and gender). Not using those features doesn't mean we're not going to have biased model discriminating against a certain group, and I'm very very worried about putting an example out there which people would then use as a reference to [unintentionally] discriminate against people. @romanlutz, do you happen to have a good example for this one?

Copy link
Member

@rth rth Mar 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for better examples for controlling for bias, but I also don't think this PR is the right place for this discussion. It merely refactors an existing example, we should have this discussion in a separate issue.

As a side note, I imagine there could indeed be some sample selection bias in the the data (i.e. company chooses customers), however the target variable (frequency or cost of accidents) shouldn't be too biased, I think? At least significantly less biased than in other examples such as https://scikit-learn.org/dev/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html for predicting wage. Also pricing policy of an insurance company doesn't directly impact how often one has accidents, so at least there is no immediate feedback loop with the training data. I'm not an expert on this, and there are likely other effects, but I'm just saying that we should discuss it in a separate issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree with the suggestions to discuss this elsewhere. For more information on what @adrinjalali is referring to I recommend "Big Data's Disparate Impact" by Barocas and Selbst. file:///C:/Users/rolutz/AppData/Local/Temp/SSRN-id2477899.pdf section I.D. The entire piece is actually relevant for such a scenario, but that's the section that discusses excluding sensitive features such as race & gender. Other than that there's still potential bias in how the data got collected (I.A, I.B, I.C).

I think there's value in acknowledging such potential shortcomings so that people don't assume that it's the best (or only) way to approach the task. We wouldn't want users to end up on my list of questionable or unethical use cases. With the acknowledgment it should be clear that it's just a demonstration of scikit-learn. At least that's my point of view :-) Thanks for asking!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrinjalali and @romanlutz I opened #16715 if you want to address this topic further.

ridge__sample_weight=df_train["Exposure"])
ridge = Pipeline([
("preprocessor", linear_model_preprocessor),
("regressor", Ridge(alpha=1e-6)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's some large change in alpha, interesting!

#
# - The least squares loss of the Ridge regression model seem to cause this
# model to be badly calibrated. In particular it tends to under estimate the
# risk and can even predict invalid negative frequencies...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's not the squared error, but rather the linearity assumptions that cause negative predictions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tree-based model does not make the linear assumption between E[y] and X and can still predict negative values. But I will mention the implicit identity link function of Ridge.


df = fetch_openml(data_id=41214, as_frame=True).frame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is raising

/home/circleci/project/sklearn/datasets/_openml.py:655: UserWarning: Version 1 of dataset freMTPL2freq is inactive, meaning that issues have been found in the dataset. Try using a newer version from this URL: https://www.openml.org/data/v1/download/20649148/freMTPL2freq.arff

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but unfortunately there is nothing we can do about it apart from contacting the original uploader. Let me do that.

@rth
Copy link
Member

rth commented Mar 10, 2020

Thanks! Please ping me once the first batch of comments above is addressed/responded to.

@lorentzenchr
Copy link
Member

@ogrisel Once #16692 gets merged, we could use HistGradientBoostingRegressor(loss='poisson') in this example. (The poor, good old, so-far interaction-free GLM!!!:smirk:)

@ogrisel
Copy link
Member Author

ogrisel commented Apr 27, 2020

@ogrisel Once #16692 gets merged, we could use HistGradientBoostingRegressor(loss='poisson') in this example. (The poor, good old, so-far interaction-free GLM!!!smirk)

That's my plan. The narrative will need an update though. I wonder if it's worth keeping HistGradientBoostingRegressor with the least squares loss as well or if that would make the example too crowded.

@ogrisel
Copy link
Member Author

ogrisel commented Apr 28, 2020

I tried fitting both HGBRT with LS and Poisson and their calibration curves, Lorenz curves and Gini index are almost indistinguishable.

However, as expected the HGBRT model with log link and Poisson loss does not produce negative predictions and gets a significantly better Poisson deviance. I will just keep the later to avoid having to reorganize the plot layouts of the examples.

@ogrisel ogrisel force-pushed the doc-poisson-claim-example branch from 4a919b0 to 1fe4fb1 Compare April 28, 2020 10:24
@ogrisel
Copy link
Member Author

ogrisel commented Apr 28, 2020

@rth @adrinjalali @NicolasHug @lorentzenchr I think this is ready for final review.

I think that would be great to have it in 0.23 to have at least one example that shows how to use the new Poisson loss of HGBRT to be mentioned in the release notes.

@adrinjalali
Copy link
Member

lol you know I'm very conflicted about this one and how I feel about an example which can be directly used to put people with disabilities, pre-existing conditions, and certain genetic backgrounds in a direct disadvantage by just changing the dataset from auto industry to healthcare.

I also know y'all have put so much effort in the Poisson loss in this release and how important it is to you. So I'll abstain on this one and probably try to replace it with a less harmful example later.

@ogrisel
Copy link
Member Author

ogrisel commented Apr 28, 2020

Just changing the dataset to another data would not fix the ethical issues of application of machine learning models magically. I agree it would be good to discuss those ethical issues either in this example or in-another and cross link this example to it. But in any way, to discuss the fairness issues of the application of machine learning to insurance pricing, marketing and evaluation of the risk of a portfolio, it's good to technically understand how those models are typically built and I think these example has a very strong pedagogical value in that respect.

@rth
Copy link
Member

rth commented Apr 28, 2020

I agree it would be good to discuss those ethical issues either in this example or in-another and cross link this example to it.

We opened #16715 to discuss it. As mentioned there, I don't think this discussion is specific to this example and should happen here. Besides, this PR is about improving example, so if we don't merge this as @adrinjalali seems to be suggesting, we will still have this example in the released docs just with worse wording.

I would rather merge this, then add comments to examples in a follow up PR per #16715 where we find it necessary. Will try to re-review later today.

@ogrisel
Copy link
Member Author

ogrisel commented Apr 28, 2020

But as soon as we have good documentation / examples about how to think and treat fairness issues in scikit-learn we should definitely update this example cross-link to it.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ogrisel

A few comments but LGTM when addressed

I'm just a bit confused by the use of "calibration" which seems to mean a different thing

# ``PoissonRegressor`` thanks to the flexibility of the trees combined with
# the large number of training samples.
#
# Evaluation of the calibration of predictions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the curves that are plotted below have a specific name?

Surely they are different from calibration curves as we have them in the UG ?

Copy link
Member Author

@ogrisel ogrisel Apr 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calibration curves currently in the user guide are only defined for classification problems. They are similar to those int this example (plot the mean observed values per prediction bin) when using the quantile binning strategy.

It would make sense to extend the calibration_curve tool to support regression but this should be done in a dedicated PR.

Comment on lines 427 to 428
# The dummy regression model predicts a constant frequency. This model is not
# attribute the same tied rank to all samples but is none-the-less globally
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should "is not" be "does not"?

Though I'm not sure I understand the plot for the dummy: if it always predicts the mean, why do we have multiple dots on the x-axis? I would have assumed the plot should just be a single dot?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We bin by quantile so the content of each bin is independent of y hat (since y hat is constant) and therefore uniformly random.

I agree it's an edge case. But it really high light that this notion of probabilistic calibration and ranking power are quite independent of one another.

# The sum of all predictions also confirms the calibration issue of the
# ``Ridge`` model: it under-estimates by more than 3% the total number of
# claims in the test set while the other three models can approximately recover
# the total number of claims of the test portfolio.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's curious that the HGBDT overestimates the number of claim, much more than the PoissonRegressor, while both curves look pretty similar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that this difference is significant. One would need to bootstrap to get confidence intervals but this is too expensive and complicated for this example.

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a fun read. Thank you @ogrisel !

@thomasjpfan thomasjpfan changed the title [MRG] Improve claim prediction example DOC Improve claim prediction example Apr 28, 2020
@thomasjpfan thomasjpfan merged commit 2dd12af into scikit-learn:master Apr 28, 2020
@thomasjpfan
Copy link
Member

tagging for inclusion #17010

adrinjalali pushed a commit that referenced this pull request Apr 30, 2020
Co-Authored-By: Christian Lorentzen <lorentzen.ch@googlemail.com>
Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>
gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
Co-Authored-By: Christian Lorentzen <lorentzen.ch@googlemail.com>
Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
Co-Authored-By: Christian Lorentzen <lorentzen.ch@googlemail.com>
Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants