Skip to content

DOC Fix the description of some features in load_diabetes #19366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 16, 2021

Conversation

hongshaoyang
Copy link
Contributor

Reference Issues/PRs

Closes #18940

What does this implement/fix? Explain your changes.

Any other comments?

@hongshaoyang
Copy link
Contributor Author

hongshaoyang commented Feb 6, 2021

Improved description of features s1 and s5

cc. @noambernstein @reubengann

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overlooked the changes, I think tch also needs a change, and ltg is probably also not correct.

@@ -21,11 +21,11 @@ quantitative measure of disease progression one year after baseline.
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, T-Cells (a type of white blood cells)
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, thyroid stimulating hormone

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tch is consistent with actually being total cholesterol over HDL, as my comments in issue #18940 indicate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meaning s4 == s1 / s3? i tried comparing the values but they're not exactly the same. some rows differ by quite a bit. i'm hoping this is due to rounding error from the original dataset.

- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, thyroid stimulating hormone
- s5 ltg, lamotrigine
- s5 ltg, serum concentration of lamotrigine

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ltg is almost certainly not anything to do with lamotrigine, an epilepsy drug. In communication with one of the original authors, Hastie, he and Efron indicated that it's probably log of triglycerides. That's a relevant quantity, although log_10 (which Efron suggested) gives unreasonable overall values (assuming the usual units). Natural log is more plausible. I'd suggest "ltg, possibly log serum triglycerides level". See my original comment in issue #18940

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

- s4 tch, thyroid stimulating hormone
- s5 ltg, lamotrigine
- s4 tch, total cholesterol / HDL
- s5 ltg, possibly log of serum triglycerides level
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- s5 ltg, possibly log of serum triglycerides level
- s5 ltg, log of serum triglycerides level

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add a warning to state that the reliability of the meaning of each feature is not as good as we would have liked because the documentation of the source dataset is not very explicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a note in the API docs would be more appropriate

Copy link
Contributor Author

@hongshaoyang hongshaoyang Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note to sklearn.datasets.load_diabetes on how the meaning of each feature might not be clear.

@@ -757,7 +757,9 @@ def load_digits(*, n_class=10, return_X_y=False, as_frame=False):

@_deprecate_positional_args
def load_diabetes(*, return_X_y=False, as_frame=False):
"""Load and return the diabetes dataset (regression).
"""Load and return the diabetes dataset (regression). The meaning of each
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put the note after the table below using sphinx:

.. note::
   The meaning of each feature (i.e. `feature_names`) might be unclear (especially for `ltg`)
   as the documentation of the original dataset is not explicit. We provide information that
   seems correct in regard with the scientific literature in this field of research.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for showing how sphinx can be used! Updated.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this address, I think that we can merge this PR. We will not find a better explanation for the moment.

@glemaitre glemaitre changed the title Fix Diabetes data set description DOC Fix the description of some features in load_diabetes Apr 16, 2021
@glemaitre glemaitre merged commit 90b3992 into scikit-learn:main Apr 16, 2021
@glemaitre
Copy link
Member

Thanks @hongshaoyang

@hongshaoyang hongshaoyang deleted the 18940-diabetes branch April 16, 2021 16:43
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Apr 19, 2021
…rn#19366)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@glemaitre glemaitre mentioned this pull request Apr 22, 2021
12 tasks
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Apr 22, 2021
…rn#19366)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
glemaitre added a commit that referenced this pull request Apr 28, 2021
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Diabetes data set description. Possibly inaccurate?
5 participants