-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC add guideline for choosing a scoring function #11430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC add guideline for choosing a scoring function #11430
Conversation
I've not had time to look at this in detail. While I think it is useful, it does not yet connect to what metrics are in the library and how they are described. So in that sense it is impractical. |
What do you mean by "it does not connect to what metrics are in the library"? To know what's wrong so far would help to improve or to make it right. |
I think what @jnothman means is that you're not connecting the theory to the functions and options in the library. |
The point is to encourage people, in their daily work, to think about what they want to predict and the purpose of the prediction. Depending on that, the use of a strictly consistent scoring function is - from the point of view of statistical decision theory - stongly recommended because it guarantees that
Example: If the goal of a regression problem is the mean, comparing different predictions for the mean using mean absolute error would be plain wrong, in general. I don't know if a PR is the right place for a discussion or whether or not you are interested in this topic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @lorentzenchr for this contribution.
This is indeed useful to have some pieces of documentation regarding those scoring function, educating users on the most appropriate ones for their use-cases (:+1: for me).
I think this PR can be further improved after #19088 gets approved.
Now that #20567 got merged, I think it is worth reviewing this PR. @jmloyola and @ArturoAmorQ: would you be interested in reviewing this contribution? Would any participant here be interested as well? |
Sure, I have my eyes on it, thanks! |
On one hand, it's great if this PR gets appreciated. On the other hand, 2.5 years after opening it, I might rewrite some parts of it. In particular, today I would add the connection between consistent scoring functions and calibration. |
@lorentzenchr: do you still want and do you have time to work on this? |
637669a
to
da021fa
Compare
I'm not so sure about the example with the apple blossoms. It's more an example to emphasize probabilistic predictions instead of showing how to select a good scoring function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lorentzenchr, I have suggested some fixes for the sphinx warnings and the rendering.
Besides these details, I have a question about the use of strictly consistent in the text.
The title of the subsection 3.3.2.1 says consistent: is strictly necessary, meaning is it related to a mathematical definition? If yes, is it possible to have the definition at the beginning of the section, and make the subsection title ... consistent? Otherwise, I guess strictly could be dropped.
TBH, now that I have published https://arxiv.org/abs/2202.12780, I don't know if its worth finishing this PR. The paper really nails down what I intended to convey with this PR. As I'm one of the authors, my point of view is certainly biased. I'd be interested in what others think on how to best proceed:
|
I would learn toward options 2. and 3., as I think most users only stick to the documentation and (unfortunately) do not read research work and publications. In that sense, we could have a distilled version in this PR illustrated with scikit-learn and then indicate that the foundations, motivation and practices expose in the experiences are covered in detail in arxiv.org/abs/2202.12780. What do you think? |
@scikit-learn/core-devs and @scikit-learn/documentation-team: are some people interested in having this small methodological contribution integrated in the documentation? I can re-dedicate some time to it with someone else interested. |
I think referencing Strictly Proper Scoring Rules, Prediction, and Estimation from Tilmann G Neiting & Adrian E. R Aftery (especially sections 3, 4, 7 and 9) is relevant. |
Hi @lorentzenchr, are you interested in finishing this PR? Back in the day @jjerphan and myself leaned towards a mix of options 2 and 3 from your comment. You also didn't react to @cmarmo's comment. Would you rather have some of us take over the PR? |
@StefanieSenger Does 7785fa7 make it more tangible for you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lorentzenchr I really like the new Fictitious Example paragraph! It makes everything much clearer and will also make the whole new section a more enjoyable read (it it quite entertaining anyways if you manage to not get stuck).
I have checked this on the rendered docs and run spelling correction, which was both quite fruitful. I have also left some other comments, but nothing big and I am really content with the current state.
I don't have the energy to write anything longer. I am content with having this small piece of it in the scikit-learn user guide. Others can later improve it as they wish.
Your work already is a great improvement and will enable many people to use scikit-learn more consciously.
BTW: I guess you might like the summary on page 4 and the Figure 1 on page 5 in https://arxiv.org/abs/2202.12780.
Hehe, now I can appreciate this summary, but be assured that a few days ago I wouldn't have known what to do with it.
This is a really productive exchange, and I'm glad I got into it.
@@ -6,6 +6,118 @@ | |||
Metrics and scoring: quantifying the quality of predictions | |||
=========================================================== | |||
|
|||
.. _which_scoring_function: | |||
|
|||
Which scoring function should I use? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That clears up this terminology for me and would also be essential to understand this new section the right way.
doc/modules/model_evaluation.rst
Outdated
For classifiers, this is what :term:`predict` returns. | ||
See also :ref:`TunedThresholdClassifierCV`. | ||
There are many scoring functions which measure different aspects of such a | ||
decision, most of them are covered with or derived from the :func:`confusion_matrix`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, and in our confusion_matrix
docs there is a link to the Wikipedia page and in it there is this.
It's quite far and you can find it only by coincidence or if you know that such tables exist (but then you don't need to necessarily see it in order to know what is meant).
response variable :math:`Y` (conditionally on :math:`X`). | ||
|
||
Once that is settled, use a **strictly consistent** scoring function for that | ||
(target) functional, see [Gneiting2009]_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If references are used like this: "Sentence, see [Gneiting2009]_." I would expect to get a page number as well. Do we do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I'm aware of. The page numbers may also vary with versions of papers.
@StefanieSenger Do you then want to approve the PR as a reviewer? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StefanieSenger Do you then want to approve the PR as a reviewer?
It's ready to be merged in my opinion.
I've reviewed readability, not content, which I cannot judge.
Since my review doesn't count anyway, I will still do it for fun. :)
Edit: To avoid misunderstandings: What I actually wanted to say is that I would normally not approve a PR that I cannot fully check (what I had done here was a partial review on readability). Since I was asked to approve, I did, but marking verbally, that it wouldn't be fully legit.
I would argue that the decision making rules on documentation are a bit ambiguous in defining what is a minor or major contribution. If this contribution is something in between maybe we can consider the two current approvals enough for a merge? (After fixing the failure in the CI, of course). Any of the maintainers that already participated in the discussion want to jump in? |
I counts for and matters to me. |
For the formalities: @jjerphan @glemaitre you did reveiw, could you approve and merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice addition, just nitpicks
If I were to merge it, I would review it again first. To be truly honest, I don't see when I would be able to get some time in the coming days since I don't have time outside working hours. But other maintainers can go. Stefanie: regarding your remark in #11430 (review), I value any review from non-maintainers, and people should feel free to do so. That's a necessary step towards getting further involved in the project as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I also think this is an improvement. I would only like a small change such that the wikipedia article appear as a reference as well.
.. rubric:: References | ||
|
||
.. [Gneiting2007] T. Gneiting and A. E. Raftery. :doi:`Strictly Proper | ||
Scoring Rules, Prediction, and Estimation <10.1198/016214506000001437>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. rubric:: References | |
.. [Gneiting2007] T. Gneiting and A. E. Raftery. :doi:`Strictly Proper | |
Scoring Rules, Prediction, and Estimation <10.1198/016214506000001437>` | |
.. rubric:: References | |
.. [Wikipedia2024] Wikipedia contributors. "Scoring rule." | |
Wikipedia, The Free Encyclopedia. | |
Wikipedia, The Free Encyclopedia, 10 Nov. 2024. Web. 15 Nov. 2024. | |
.. [Gneiting2007] T. Gneiting and A. E. Raftery. :doi:`Strictly Proper | |
Scoring Rules, Prediction, and Estimation <10.1198/016214506000001437>` |
For classification **strictly proper scoring rules**, see | ||
`Wikipedia entry for Scoring rule <https://en.wikipedia.org/wiki/Scoring_rule>`_ | ||
and [Gneiting2007]_, coincide with strictly consistent scoring functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For classification **strictly proper scoring rules**, see | |
`Wikipedia entry for Scoring rule <https://en.wikipedia.org/wiki/Scoring_rule>`_ | |
and [Gneiting2007]_, coincide with strictly consistent scoring functions. | |
For classification **strictly proper scoring rules**, see | |
`Wikipedia entry for Scoring rule <https://en.wikipedia.org/wiki/Scoring_rule>`_ [Wikipedia2024]_ | |
and [Gneiting2007]_, coincide with strictly consistent scoring functions. |
Bringing something up after looking at this anew: I think it is better if this new section follows the bird eye's description of where scoring is used. ("There are 3 different APIs for evaluating the quality of a model’s prediction...") which used to be the starting point of this chapter. And rightly so. It's better to map out things first for people before going into the details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good improvement already, let's have improvements on this in future PRs.
Thank you, everyone! |
@adrinjalali Thanks, really much appreciated! |
Co-authored-by: Christian Lorentzen <lorentzen.ch@googlemail.com> Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@googlemail.com> Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@googlemail.com> Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
Solves issue #10584.
Adds the new chapter "3.3.2. Which scoring function should I use?" to the user guide.