DOC Improve claim prediction example #16648

ogrisel · 2020-03-05T18:28:10Z

This rewrite is motivated by the following:

use headers to split the content into sections;
simplify the data-loading, display the data-frame content (an excerpt) and histograms for the target variable;
use HistGradientBoostingRegressor (with the default least squares loss) now that it supports sample weights: it's better and much faster than RandomForestRegressors and it makes it possible to use the full dataset.
break the top level import block to progressively introduce scikit-learn component when needed;
contrast calibration and discriminative power of the models in the narrative;
add a take-aways section at the end.

ogrisel · 2020-03-05T18:28:43Z

examples/linear_model/plot_poisson_regression_non_normal_loss.py

lorentzenchr · 2020-03-06T14:41:03Z

examples/linear_model/plot_poisson_regression_non_normal_loss.py

 import numpy as np
 import matplotlib.pyplot as plt
 import pandas as pd

-from sklearn.datasets import fetch_openml


In the original PR, we had a short discussion whether to have all includes at the very beginning or each import where it is first needed. I'm fine with both approaches.

I really prefer to split large import blocks for long examples. Otherwise the reader as to scroll-up a lot to know where a function / class comes from.

examples/linear_model/plot_poisson_regression_non_normal_loss.py

lorentzenchr · 2020-03-06T15:00:44Z

examples/linear_model/plot_poisson_regression_non_normal_loss.py

-
+gbrt = Pipeline([
+    ("preprocessor", tree_preprocessor),
+    ("regressor", HistGradientBoostingRegressor(max_leaf_nodes=128)),


With min_weight_fraction_leaf=0.01, we could avoid non-positive predictions of the random forest. Could we try the same for the HGBR, in this case with min_samples_leaf=...?

Because max_leaf_nodes is limited and the number of trees very large, I don't think this is going to change anything. Also with the Poisson loss and its log link that are now enabled we don't have any negative predictions for HGBRT anymore.

examples/linear_model/plot_poisson_regression_non_normal_loss.py

lorentzenchr · 2020-03-06T15:31:28Z

examples/linear_model/plot_poisson_regression_non_normal_loss.py

+# - The non-linear Gradient Boosting Regression Trees model does not seem to
+#   suffer from significant mis-calibration issues (despite the use of a least
+#   squares loss).


Just as an explanation: For large sample sizes, the different losses do not matter much (as long as they are strictly consistent for the expected value). Without interactions and other cheerful feature engineering for linear models, the tree based models will win.

I agree. I even doubt that it's possible to cheerfully feature engineer enough to compete with the tree based models when n_samples > 1e5.

Maybe not on this and most other datasets, but I have seen otherwise:smirk:

Would love to be proven wrong ;)

examples/linear_model/plot_poisson_regression_non_normal_loss.py

adrinjalali

Awesome. Looks pretty good. Could you also make the plots colorblind friendly?

examples/.flake8

adrinjalali · 2020-03-06T15:37:56Z

examples/linear_model/plot_poisson_regression_non_normal_loss.py

+Our goal is to predict the expected frequency of claims following car accidents
+for a new policyholder given the historical data over a population of
+policyholders.



I feel like this is an example of where I rather not see our library be used lol (it also has a lot of ethical concerns, bias etc, which we don't tackle).

Not that we should remove it really, just saying I personally probably would :D

Let us please agree on not having that discussion here. I understand and admire your point of view and would be very glad to talk with you about this controversial (and never ending) subject. At last, note that neither gender nor race are in the data. :no_fitting_emoji:

The issue is that the data is biased and we can't even measure that bias because we don't even have the features we need (like race and gender). Not using those features doesn't mean we're not going to have biased model discriminating against a certain group, and I'm very very worried about putting an example out there which people would then use as a reference to [unintentionally] discriminate against people. @romanlutz, do you happen to have a good example for this one?

I'm all for better examples for controlling for bias, but I also don't think this PR is the right place for this discussion. It merely refactors an existing example, we should have this discussion in a separate issue.

As a side note, I imagine there could indeed be some sample selection bias in the the data (i.e. company chooses customers), however the target variable (frequency or cost of accidents) shouldn't be too biased, I think? At least significantly less biased than in other examples such as https://scikit-learn.org/dev/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html for predicting wage. Also pricing policy of an insurance company doesn't directly impact how often one has accidents, so at least there is no immediate feedback loop with the training data. I'm not an expert on this, and there are likely other effects, but I'm just saying that we should discuss it in a separate issue.

I don't disagree with the suggestions to discuss this elsewhere. For more information on what @adrinjalali is referring to I recommend "Big Data's Disparate Impact" by Barocas and Selbst. file:///C:/Users/rolutz/AppData/Local/Temp/SSRN-id2477899.pdf section I.D. The entire piece is actually relevant for such a scenario, but that's the section that discusses excluding sensitive features such as race & gender. Other than that there's still potential bias in how the data got collected (I.A, I.B, I.C).

I think there's value in acknowledging such potential shortcomings so that people don't assume that it's the best (or only) way to approach the task. We wouldn't want users to end up on my list of questionable or unethical use cases. With the acknowledgment it should be clear that it's just a demonstration of scikit-learn. At least that's my point of view :-) Thanks for asking!

@adrinjalali and @romanlutz I opened #16715 if you want to address this topic further.

examples/linear_model/plot_poisson_regression_non_normal_loss.py

adrinjalali · 2020-03-06T16:05:08Z

examples/linear_model/plot_poisson_regression_non_normal_loss.py

-          ridge__sample_weight=df_train["Exposure"])
+ridge = Pipeline([
+    ("preprocessor", linear_model_preprocessor),
+    ("regressor", Ridge(alpha=1e-6)),


that's some large change in alpha, interesting!

examples/linear_model/plot_poisson_regression_non_normal_loss.py

adrinjalali · 2020-03-06T16:13:07Z

examples/linear_model/plot_poisson_regression_non_normal_loss.py

+#
+# - The least squares loss of the Ridge regression model seem to cause this
+#   model to be badly calibrated. In particular it tends to under estimate the
+#   risk and can even predict invalid negative frequencies...


Also, it's not the squared error, but rather the linearity assumptions that cause negative predictions.

The tree-based model does not make the linear assumption between E[y] and X and can still predict negative values. But I will mention the implicit identity link function of Ridge.

adrinjalali · 2020-03-06T16:18:03Z

examples/linear_model/plot_poisson_regression_non_normal_loss.py


+df = fetch_openml(data_id=41214, as_frame=True).frame


this is raising

/home/circleci/project/sklearn/datasets/_openml.py:655: UserWarning: Version 1 of dataset freMTPL2freq is inactive, meaning that issues have been found in the dataset. Try using a newer version from this URL: https://www.openml.org/data/v1/download/20649148/freMTPL2freq.arff

Yes but unfortunately there is nothing we can do about it apart from contacting the original uploader. Let me do that.

rth · 2020-03-10T13:11:49Z

Thanks! Please ping me once the first batch of comments above is addressed/responded to.

lorentzenchr · 2020-03-16T11:34:22Z

@ogrisel Once #16692 gets merged, we could use HistGradientBoostingRegressor(loss='poisson') in this example. (The poor, good old, so-far interaction-free GLM!!!:smirk:)

ogrisel · 2020-04-27T17:26:55Z

@ogrisel Once #16692 gets merged, we could use HistGradientBoostingRegressor(loss='poisson') in this example. (The poor, good old, so-far interaction-free GLM!!!smirk)

That's my plan. The narrative will need an update though. I wonder if it's worth keeping HistGradientBoostingRegressor with the least squares loss as well or if that would make the example too crowded.

examples/linear_model/plot_poisson_regression_non_normal_loss.py

ogrisel · 2020-04-28T08:28:17Z

I tried fitting both HGBRT with LS and Poisson and their calibration curves, Lorenz curves and Gini index are almost indistinguishable.

However, as expected the HGBRT model with log link and Poisson loss does not produce negative predictions and gets a significantly better Poisson deviance. I will just keep the later to avoid having to reorganize the plot layouts of the examples.

Co-Authored-By: Christian Lorentzen <lorentzen.ch@googlemail.com>

ogrisel · 2020-04-28T10:38:56Z

@rth @adrinjalali @NicolasHug @lorentzenchr I think this is ready for final review.

I think that would be great to have it in 0.23 to have at least one example that shows how to use the new Poisson loss of HGBRT to be mentioned in the release notes.

adrinjalali · 2020-04-28T10:44:22Z

lol you know I'm very conflicted about this one and how I feel about an example which can be directly used to put people with disabilities, pre-existing conditions, and certain genetic backgrounds in a direct disadvantage by just changing the dataset from auto industry to healthcare.

I also know y'all have put so much effort in the Poisson loss in this release and how important it is to you. So I'll abstain on this one and probably try to replace it with a less harmful example later.

ogrisel · 2020-04-28T11:14:10Z

Just changing the dataset to another data would not fix the ethical issues of application of machine learning models magically. I agree it would be good to discuss those ethical issues either in this example or in-another and cross link this example to it. But in any way, to discuss the fairness issues of the application of machine learning to insurance pricing, marketing and evaluation of the risk of a portfolio, it's good to technically understand how those models are typically built and I think these example has a very strong pedagogical value in that respect.

rth · 2020-04-28T11:59:29Z

I agree it would be good to discuss those ethical issues either in this example or in-another and cross link this example to it.

We opened #16715 to discuss it. As mentioned there, I don't think this discussion is specific to this example and should happen here. Besides, this PR is about improving example, so if we don't merge this as @adrinjalali seems to be suggesting, we will still have this example in the released docs just with worse wording.

I would rather merge this, then add comments to examples in a follow up PR per #16715 where we find it necessary. Will try to re-review later today.

ogrisel · 2020-04-28T13:05:44Z

But as soon as we have good documentation / examples about how to think and treat fairness issues in scikit-learn we should definitely update this example cross-link to it.

NicolasHug

Thanks @ogrisel

A few comments but LGTM when addressed

I'm just a bit confused by the use of "calibration" which seems to mean a different thing

examples/linear_model/plot_poisson_regression_non_normal_loss.py

NicolasHug · 2020-04-28T14:21:56Z

examples/linear_model/plot_poisson_regression_non_normal_loss.py

+# ``PoissonRegressor`` thanks to the flexibility of the trees combined with
+# the large number of training samples.
+#
+# Evaluation of the calibration of predictions


Do the curves that are plotted below have a specific name?

Surely they are different from calibration curves as we have them in the UG ?

The calibration curves currently in the user guide are only defined for classification problems. They are similar to those int this example (plot the mean observed values per prediction bin) when using the quantile binning strategy.

It would make sense to extend the calibration_curve tool to support regression but this should be done in a dedicated PR.

NicolasHug · 2020-04-28T14:32:52Z

examples/linear_model/plot_poisson_regression_non_normal_loss.py

+# The dummy regression model predicts a constant frequency. This model is not
+# attribute the same tied rank to all samples but is none-the-less globally


should "is not" be "does not"?

Though I'm not sure I understand the plot for the dummy: if it always predicts the mean, why do we have multiple dots on the x-axis? I would have assumed the plot should just be a single dot?

We bin by quantile so the content of each bin is independent of y hat (since y hat is constant) and therefore uniformly random.

I agree it's an edge case. But it really high light that this notion of probabilistic calibration and ranking power are quite independent of one another.

NicolasHug · 2020-04-28T14:35:25Z

examples/linear_model/plot_poisson_regression_non_normal_loss.py

+# The sum of all predictions also confirms the calibration issue of the
+# ``Ridge`` model: it under-estimates by more than 3% the total number of
+# claims in the test set while the other three models can approximately recover
+# the total number of claims of the test portfolio.


It's curious that the HGBDT overestimates the number of claim, much more than the PoissonRegressor, while both curves look pretty similar

I am not sure that this difference is significant. One would need to bootstrap to get confidence intervals but this is too expensive and complicated for this example.

examples/linear_model/plot_poisson_regression_non_normal_loss.py

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

…tograms

thomasjpfan

This was a fun read. Thank you @ogrisel !

examples/linear_model/plot_poisson_regression_non_normal_loss.py

thomasjpfan · 2020-04-28T21:26:31Z

tagging for inclusion #17010

Co-Authored-By: Christian Lorentzen <lorentzen.ch@googlemail.com> Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

ogrisel requested a review from rth March 5, 2020 18:28

ogrisel changed the title ~~Improve claim prediction example~~ [MRG] Improve claim prediction example Mar 6, 2020

ogrisel added Documentation module:ensemble module:linear_model labels Mar 6, 2020

ogrisel added this to the 0.23 milestone Mar 6, 2020

ogrisel added the Waiting for Reviewer label Mar 6, 2020

lorentzenchr reviewed Mar 6, 2020

View reviewed changes

adrinjalali reviewed Mar 6, 2020

View reviewed changes

lorentzenchr mentioned this pull request Mar 17, 2020

Doc/Discussion: Discrimination in examples and user guide #16715

Open

adrinjalali modified the milestones: 0.23, 0.24 Apr 21, 2020

ogrisel force-pushed the doc-poisson-claim-example branch from 91fb3a7 to 2b77584 Compare April 27, 2020 16:38

lorentzenchr reviewed Apr 27, 2020

View reviewed changes

examples/linear_model/plot_poisson_regression_non_normal_loss.py Outdated Show resolved Hide resolved

ogrisel and others added 11 commits April 28, 2020 12:20

Improve claim prediction example

a7fc0ac

Various cosmetic improvements

0f67587

Note on poor performance of linear models

db81988

Rephrase/improve linear model discriminative perf analysis

65a314b

cosmetics [doc skip] [ci skip]

0227260

Revert change in example/.flake8

01ec0e8

Phrasing: model y conditionally on X

b64ee3f

Remove warning (use print instead)

4490424

Disable F401 for enable_hist_gradient_boosting explicitly

fe4cc22

Do not use the term discriminative power which is overloaded/ambiguous

f6cd00d

Apply suggestions from code review

aa60673

Co-Authored-By: Christian Lorentzen <lorentzen.ch@googlemail.com>

ogrisel and others added 5 commits April 28, 2020 12:20

Fix pep8

e3c382d

Mention identity link function + linear variance function assumption

8347811

Fix kwonly deprecation warning in scoring functions

e012139

Update examples/linear_model/plot_poisson_regression_non_normal_loss.py

039e87b

Co-Authored-By: Christian Lorentzen <lorentzen.ch@googlemail.com>

Use Poisson GBRT

1fe4fb1

ogrisel force-pushed the doc-poisson-claim-example branch from 4a919b0 to 1fe4fb1 Compare April 28, 2020 10:24

NicolasHug approved these changes Apr 28, 2020

View reviewed changes

Apply suggestions from code review

65b1125

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

NicolasHug mentioned this pull request Apr 28, 2020

DOC Feature highlights for 0.23 #17062

Merged

ogrisel added 4 commits April 28, 2020 17:34

Phrasing, formatting, fixes

deba073

Add link to OpenML page

53e402f

mask instead of clip, fix refs, fix backticks, note on prediction his…

02cd599

…tograms

Typo

e972527

thomasjpfan approved these changes Apr 28, 2020

View reviewed changes

examples/linear_model/plot_poisson_regression_non_normal_loss.py Outdated Show resolved Hide resolved

Phrasing for the Lorenz curves

8f80a4d

thomasjpfan changed the title ~~[MRG] Improve claim prediction example~~ DOC Improve claim prediction example Apr 28, 2020

thomasjpfan merged commit 2dd12af into scikit-learn:master Apr 28, 2020

lorentzenchr mentioned this pull request Apr 28, 2020

DOC small typos and fixes for poisson example #17078

Merged

adrinjalali pushed a commit that referenced this pull request Apr 30, 2020

DOC Improve claim prediction example (#16648)

cf3e5e0

Co-Authored-By: Christian Lorentzen <lorentzen.ch@googlemail.com> Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

DOC Improve claim prediction example (scikit-learn#16648)

05a4eb8

Co-Authored-By: Christian Lorentzen <lorentzen.ch@googlemail.com> Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020

DOC Improve claim prediction example (scikit-learn#16648)

e5b6019

Co-Authored-By: Christian Lorentzen <lorentzen.ch@googlemail.com> Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

		# The dummy regression model predicts a constant frequency. This model is not
		# attribute the same tied rank to all samples but is none-the-less globally

Uh oh!

DOC Improve claim prediction example #16648

DOC Improve claim prediction example #16648

Uh oh!

Conversation

ogrisel commented Mar 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Mar 5, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth Mar 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Mar 10, 2020

Uh oh!

lorentzenchr commented Mar 16, 2020

Uh oh!

ogrisel commented Apr 27, 2020

Uh oh!

Uh oh!

ogrisel commented Apr 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Apr 28, 2020

Uh oh!

ogrisel commented Mar 5, 2020 •

edited

Loading

rth Mar 16, 2020 •

edited

Loading

ogrisel commented Apr 28, 2020 •

edited

Loading

ogrisel commented Apr 28, 2020 •

edited

Loading

ogrisel Apr 28, 2020 •

edited

Loading