-
-
Notifications
You must be signed in to change notification settings - Fork 26k
DOC Fix the description of some features in load_diabetes #19366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Improved description of features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I overlooked the changes, I think tch
also needs a change, and ltg
is probably also not correct.
sklearn/datasets/descr/diabetes.rst
Outdated
@@ -21,11 +21,11 @@ quantitative measure of disease progression one year after baseline. | |||
- sex | |||
- bmi body mass index | |||
- bp average blood pressure | |||
- s1 tc, T-Cells (a type of white blood cells) | |||
- s1 tc, total serum cholesterol | |||
- s2 ldl, low-density lipoproteins | |||
- s3 hdl, high-density lipoproteins | |||
- s4 tch, thyroid stimulating hormone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tch is consistent with actually being total cholesterol over HDL, as my comments in issue #18940 indicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meaning s4 == s1 / s3
? i tried comparing the values but they're not exactly the same. some rows differ by quite a bit. i'm hoping this is due to rounding error from the original dataset.
sklearn/datasets/descr/diabetes.rst
Outdated
- s2 ldl, low-density lipoproteins | ||
- s3 hdl, high-density lipoproteins | ||
- s4 tch, thyroid stimulating hormone | ||
- s5 ltg, lamotrigine | ||
- s5 ltg, serum concentration of lamotrigine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ltg is almost certainly not anything to do with lamotrigine, an epilepsy drug. In communication with one of the original authors, Hastie, he and Efron indicated that it's probably log of triglycerides. That's a relevant quantity, although log_10 (which Efron suggested) gives unreasonable overall values (assuming the usual units). Natural log is more plausible. I'd suggest "ltg, possibly log serum triglycerides level". See my original comment in issue #18940
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
- s4 tch, thyroid stimulating hormone | ||
- s5 ltg, lamotrigine | ||
- s4 tch, total cholesterol / HDL | ||
- s5 ltg, possibly log of serum triglycerides level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- s5 ltg, possibly log of serum triglycerides level | |
- s5 ltg, log of serum triglycerides level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add a warning to state that the reliability of the meaning of each feature is not as good as we would have liked because the documentation of the source dataset is not very explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a note in the API docs would be more appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a note to sklearn.datasets.load_diabetes
on how the meaning of each feature might not be clear.
sklearn/datasets/_base.py
Outdated
@@ -757,7 +757,9 @@ def load_digits(*, n_class=10, return_X_y=False, as_frame=False): | |||
|
|||
@_deprecate_positional_args | |||
def load_diabetes(*, return_X_y=False, as_frame=False): | |||
"""Load and return the diabetes dataset (regression). | |||
"""Load and return the diabetes dataset (regression). The meaning of each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put the note after the table below using sphinx:
.. note::
The meaning of each feature (i.e. `feature_names`) might be unclear (especially for `ltg`)
as the documentation of the original dataset is not explicit. We provide information that
seems correct in regard with the scientific literature in this field of research.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for showing how sphinx can be used! Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once this address, I think that we can merge this PR. We will not find a better explanation for the moment.
Thanks @hongshaoyang |
…rn#19366) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
…rn#19366) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Reference Issues/PRs
Closes #18940
What does this implement/fix? Explain your changes.
Any other comments?