Skip to content

DOC add guideline for choosing a scoring function #11430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Nov 25, 2024

Conversation

lorentzenchr
Copy link
Member

Solves issue #10584.
Adds the new chapter "3.3.2. Which scoring function should I use?" to the user guide.

@jnothman
Copy link
Member

jnothman commented Jul 4, 2018

I've not had time to look at this in detail. While I think it is useful, it does not yet connect to what metrics are in the library and how they are described. So in that sense it is impractical.

@lorentzenchr
Copy link
Member Author

What do you mean by "it does not connect to what metrics are in the library"? To know what's wrong so far would help to improve or to make it right.

@amueller
Copy link
Member

I think what @jnothman means is that you're not connecting the theory to the functions and options in the library.
I'm also not sure if strictly consistent is really the important thing to think about.
In the real world I don't think people thing about the functional for aggregation, but rather about the importance of different classes, and the importance of precision and recall.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Aug 1, 2018

The point is to encourage people, in their daily work, to think about what they want to predict and the purpose of the prediction. Depending on that, the use of a strictly consistent scoring function is - from the point of view of statistical decision theory - stongly recommended because it guarantees that

truth telling is an optimal strategy in expectation
Probabilistic Forecasting, Annual Review of Statistics and Its Application

Example: If the goal of a regression problem is the mean, comparing different predictions for the mean using mean absolute error would be plain wrong, in general.

I don't know if a PR is the right place for a discussion or whether or not you are interested in this topic.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @lorentzenchr for this contribution.

This is indeed useful to have some pieces of documentation regarding those scoring function, educating users on the most appropriate ones for their use-cases (:+1: for me).

I think this PR can be further improved after #19088 gets approved.

@jjerphan
Copy link
Member

Now that #20567 got merged, I think it is worth reviewing this PR.

@jmloyola and @ArturoAmorQ: would you be interested in reviewing this contribution?
I think this can be a way to get you started with Sphinx, building the documentation and reStructured text. I can help if needed.

Would any participant here be interested as well?

@ArturoAmorQ
Copy link
Member

@jmloyola and @ArturoAmorQ: would you be interested in reviewing this contribution? I think this can be a way to get you started with Sphinx, building the documentation and reStructured text. I can help if needed.

Sure, I have my eyes on it, thanks!

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Jan 18, 2022

On one hand, it's great if this PR gets appreciated. On the other hand, 2.5 years after opening it, I might rewrite some parts of it. In particular, today I would add the connection between consistent scoring functions and calibration.
...and I won't get to it before February.

@jjerphan
Copy link
Member

jjerphan commented Apr 7, 2022

@lorentzenchr: do you still want and do you have time to work on this?

@lorentzenchr
Copy link
Member Author

I'm not so sure about the example with the apple blossoms. It's more an example to emphasize probabilistic predictions instead of showing how to select a good scoring function.

Copy link
Contributor

@cmarmo cmarmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lorentzenchr, I have suggested some fixes for the sphinx warnings and the rendering.
Besides these details, I have a question about the use of strictly consistent in the text.
The title of the subsection 3.3.2.1 says consistent: is strictly necessary, meaning is it related to a mathematical definition? If yes, is it possible to have the definition at the beginning of the section, and make the subsection title ... consistent? Otherwise, I guess strictly could be dropped.

@lorentzenchr
Copy link
Member Author

TBH, now that I have published https://arxiv.org/abs/2202.12780, I don't know if its worth finishing this PR. The paper really nails down what I intended to convey with this PR. As I'm one of the authors, my point of view is certainly biased. I'd be interested in what others think on how to best proceed:

  1. Keep current status (close this PR unmerged)
  2. Finish this PR or do a similar PR
  3. Just reference published work

@jjerphan
Copy link
Member

jjerphan commented Aug 5, 2022

I would learn toward options 2. and 3., as I think most users only stick to the documentation and (unfortunately) do not read research work and publications.

In that sense, we could have a distilled version in this PR illustrated with scikit-learn and then indicate that the foundations, motivation and practices expose in the experiences are covered in detail in arxiv.org/abs/2202.12780.

What do you think?

@jjerphan
Copy link
Member

jjerphan commented Nov 2, 2024

@scikit-learn/core-devs and @scikit-learn/documentation-team: are some people interested in having this small methodological contribution integrated in the documentation?

I can re-dedicate some time to it with someone else interested.

@jjerphan
Copy link
Member

jjerphan commented Nov 3, 2024

I think referencing Strictly Proper Scoring Rules, Prediction, and Estimation from Tilmann G Neiting & Adrian E. R Aftery (especially sections 3, 4, 7 and 9) is relevant.

@ArturoAmorQ
Copy link
Member

Hi @lorentzenchr, are you interested in finishing this PR? Back in the day @jjerphan and myself leaned towards a mix of options 2 and 3 from your comment. You also didn't react to @cmarmo's comment. Would you rather have some of us take over the PR?

@lorentzenchr
Copy link
Member Author

@StefanieSenger Does 7785fa7 make it more tangible for you?

@lorentzenchr lorentzenchr added this to the 1.6 milestone Nov 12, 2024
@lorentzenchr lorentzenchr changed the title [MRG] DOC add guideline for choosing a scoring function DOC add guideline for choosing a scoring function Nov 12, 2024
Copy link
Contributor

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lorentzenchr I really like the new Fictitious Example paragraph! It makes everything much clearer and will also make the whole new section a more enjoyable read (it it quite entertaining anyways if you manage to not get stuck).

I have checked this on the rendered docs and run spelling correction, which was both quite fruitful. I have also left some other comments, but nothing big and I am really content with the current state.

I don't have the energy to write anything longer. I am content with having this small piece of it in the scikit-learn user guide. Others can later improve it as they wish.

Your work already is a great improvement and will enable many people to use scikit-learn more consciously.

BTW: I guess you might like the summary on page 4 and the Figure 1 on page 5 in https://arxiv.org/abs/2202.12780.

Hehe, now I can appreciate this summary, but be assured that a few days ago I wouldn't have known what to do with it.

This is a really productive exchange, and I'm glad I got into it.

@@ -6,6 +6,118 @@
Metrics and scoring: quantifying the quality of predictions
===========================================================

.. _which_scoring_function:

Which scoring function should I use?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That clears up this terminology for me and would also be essential to understand this new section the right way.

For classifiers, this is what :term:`predict` returns.
See also :ref:`TunedThresholdClassifierCV`.
There are many scoring functions which measure different aspects of such a
decision, most of them are covered with or derived from the :func:`confusion_matrix`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, and in our confusion_matrix docs there is a link to the Wikipedia page and in it there is this.
It's quite far and you can find it only by coincidence or if you know that such tables exist (but then you don't need to necessarily see it in order to know what is meant).

response variable :math:`Y` (conditionally on :math:`X`).

Once that is settled, use a **strictly consistent** scoring function for that
(target) functional, see [Gneiting2009]_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If references are used like this: "Sentence, see [Gneiting2009]_." I would expect to get a page number as well. Do we do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I'm aware of. The page numbers may also vary with versions of papers.

@lorentzenchr
Copy link
Member Author

I am really content with the current state.

@StefanieSenger Do you then want to approve the PR as a reviewer?
It's ready to be merged in my opinion.

Copy link
Contributor

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StefanieSenger Do you then want to approve the PR as a reviewer?
It's ready to be merged in my opinion.

I've reviewed readability, not content, which I cannot judge.
Since my review doesn't count anyway, I will still do it for fun. :)

Edit: To avoid misunderstandings: What I actually wanted to say is that I would normally not approve a PR that I cannot fully check (what I had done here was a partial review on readability). Since I was asked to approve, I did, but marking verbally, that it wouldn't be fully legit.

@ArturoAmorQ
Copy link
Member

Since my review doesn't count anyway, I will still do it for fun

I would argue that the decision making rules on documentation are a bit ambiguous in defining what is a minor or major contribution. If this contribution is something in between maybe we can consider the two current approvals enough for a merge? (After fixing the failure in the CI, of course).

Any of the maintainers that already participated in the discussion want to jump in?

@lorentzenchr
Copy link
Member Author

Since my review doesn't count anyway, I will still do it for fun. :)

I counts for and matters to me.

@lorentzenchr
Copy link
Member Author

For the formalities: @jjerphan @glemaitre you did reveiw, could you approve and merge?

Copy link
Member

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice addition, just nitpicks

@jjerphan jjerphan self-requested a review November 14, 2024 08:33
@jjerphan
Copy link
Member

If I were to merge it, I would review it again first.

To be truly honest, I don't see when I would be able to get some time in the coming days since I don't have time outside working hours.

But other maintainers can go.

Stefanie: regarding your remark in #11430 (review), I value any review from non-maintainers, and people should feel free to do so. That's a necessary step towards getting further involved in the project as well.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I also think this is an improvement. I would only like a small change such that the wikipedia article appear as a reference as well.

Comment on lines +122 to +125
.. rubric:: References

.. [Gneiting2007] T. Gneiting and A. E. Raftery. :doi:`Strictly Proper
Scoring Rules, Prediction, and Estimation <10.1198/016214506000001437>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. rubric:: References
.. [Gneiting2007] T. Gneiting and A. E. Raftery. :doi:`Strictly Proper
Scoring Rules, Prediction, and Estimation <10.1198/016214506000001437>`
.. rubric:: References
.. [Wikipedia2024] Wikipedia contributors. "Scoring rule."
Wikipedia, The Free Encyclopedia.
Wikipedia, The Free Encyclopedia, 10 Nov. 2024. Web. 15 Nov. 2024.
.. [Gneiting2007] T. Gneiting and A. E. Raftery. :doi:`Strictly Proper
Scoring Rules, Prediction, and Estimation <10.1198/016214506000001437>`

Comment on lines +45 to +47
For classification **strictly proper scoring rules**, see
`Wikipedia entry for Scoring rule <https://en.wikipedia.org/wiki/Scoring_rule>`_
and [Gneiting2007]_, coincide with strictly consistent scoring functions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For classification **strictly proper scoring rules**, see
`Wikipedia entry for Scoring rule <https://en.wikipedia.org/wiki/Scoring_rule>`_
and [Gneiting2007]_, coincide with strictly consistent scoring functions.
For classification **strictly proper scoring rules**, see
`Wikipedia entry for Scoring rule <https://en.wikipedia.org/wiki/Scoring_rule>`_ [Wikipedia2024]_
and [Gneiting2007]_, coincide with strictly consistent scoring functions.

@StefanieSenger
Copy link
Contributor

Bringing something up after looking at this anew:

I think it is better if this new section follows the bird eye's description of where scoring is used. ("There are 3 different APIs for evaluating the quality of a model’s prediction...") which used to be the starting point of this chapter. And rightly so.

It's better to map out things first for people before going into the details.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good improvement already, let's have improvements on this in future PRs.

@adrinjalali adrinjalali merged commit a508dab into scikit-learn:main Nov 25, 2024
30 checks passed
@jjerphan
Copy link
Member

Thank you, everyone!

@ogrisel ogrisel deleted the Consistent_Scorer branch November 27, 2024 10:51
@lorentzenchr
Copy link
Member Author

This is a very good improvement already, let's have improvements on this in future PRs.

@adrinjalali Thanks, really much appreciated!

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Dec 4, 2024
Co-authored-by: Christian Lorentzen <lorentzen.ch@googlemail.com>
Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
jeremiedbb pushed a commit that referenced this pull request Dec 6, 2024
Co-authored-by: Christian Lorentzen <lorentzen.ch@googlemail.com>
Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
virchan pushed a commit to virchan/scikit-learn that referenced this pull request Dec 9, 2024
Co-authored-by: Christian Lorentzen <lorentzen.ch@googlemail.com>
Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: No status
Development

Successfully merging this pull request may close these issues.