DOC Fix the description of some features in load_diabetes #19366

hongshaoyang · 2021-02-06T08:07:42Z

Reference Issues/PRs

Closes #18940

What does this implement/fix? Explain your changes.

Any other comments?

hongshaoyang · 2021-02-06T08:08:34Z

Improved description of features s1 and s5

cc. @noambernstein @reubengann

adrinjalali

I overlooked the changes, I think tch also needs a change, and ltg is probably also not correct.

noambernstein · 2021-02-06T16:23:31Z

sklearn/datasets/descr/diabetes.rst

@@ -21,11 +21,11 @@ quantitative measure of disease progression one year after baseline.
      - sex
      - bmi     body mass index
      - bp      average blood pressure
-      - s1      tc, T-Cells (a type of white blood cells)
+      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, thyroid stimulating hormone


tch is consistent with actually being total cholesterol over HDL, as my comments in issue #18940 indicate

meaning s4 == s1 / s3? i tried comparing the values but they're not exactly the same. some rows differ by quite a bit. i'm hoping this is due to rounding error from the original dataset.

noambernstein · 2021-02-06T16:26:38Z

sklearn/datasets/descr/diabetes.rst

      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, thyroid stimulating hormone
-      - s5      ltg, lamotrigine
+      - s5      ltg, serum concentration of lamotrigine


ltg is almost certainly not anything to do with lamotrigine, an epilepsy drug. In communication with one of the original authors, Hastie, he and Efron indicated that it's probably log of triglycerides. That's a relevant quantity, although log_10 (which Efron suggested) gives unreasonable overall values (assuming the usual units). Natural log is more plausible. I'd suggest "ltg, possibly log serum triglycerides level". See my original comment in issue #18940

glemaitre · 2021-02-08T09:11:24Z

sklearn/datasets/descr/diabetes.rst

-      - s4      tch, thyroid stimulating hormone
-      - s5      ltg, lamotrigine
+      - s4      tch, total cholesterol / HDL
+      - s5      ltg, possibly log of serum triglycerides level


Suggested change

- s5 ltg, possibly log of serum triglycerides level

- s5 ltg, log of serum triglycerides level

Maybe we should add a warning to state that the reliability of the meaning of each feature is not as good as we would have liked because the documentation of the source dataset is not very explicit.

I think a note in the API docs would be more appropriate

I added a note to sklearn.datasets.load_diabetes on how the meaning of each feature might not be clear.

glemaitre · 2021-04-16T16:20:49Z

sklearn/datasets/_base.py

@@ -757,7 +757,9 @@ def load_digits(*, n_class=10, return_X_y=False, as_frame=False):

 @_deprecate_positional_args
 def load_diabetes(*, return_X_y=False, as_frame=False):
-    """Load and return the diabetes dataset (regression).
+    """Load and return the diabetes dataset (regression). The meaning of each


Let's put the note after the table below using sphinx:

.. note:: The meaning of each feature (i.e. `feature_names`) might be unclear (especially for `ltg`) as the documentation of the original dataset is not explicit. We provide information that seems correct in regard with the scientific literature in this field of research.

Thanks for showing how sphinx can be used! Updated.

glemaitre

Once this address, I think that we can merge this PR. We will not find a better explanation for the moment.

glemaitre · 2021-04-16T16:42:11Z

Thanks @hongshaoyang

…rn#19366) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Fix Diabetes data set description

dfda8ae

github-actions bot added the module:datasets label Feb 6, 2021

adrinjalali approved these changes Feb 6, 2021

View reviewed changes

adrinjalali requested changes Feb 6, 2021

View reviewed changes

noambernstein reviewed Feb 6, 2021

View reviewed changes

Fix Diabetes data set description

3b270c6

glemaitre reviewed Feb 8, 2021

View reviewed changes

hongshaoyang added 2 commits April 16, 2021 23:45

Merge remote-tracking branch 'upstream/main' into 18940-diabetes

bd79ffb

Add note to API docs

4cbea62

glemaitre reviewed Apr 16, 2021

View reviewed changes

glemaitre approved these changes Apr 16, 2021

View reviewed changes

glemaitre changed the title ~~Fix Diabetes data set description~~ DOC Fix the description of some features in load_diabetes Apr 16, 2021

github-actions bot added the Documentation label Apr 16, 2021

hongshaoyang and others added 2 commits April 17, 2021 00:38

Add note to API docs

720bed7

Update _base.py

b93801f

glemaitre merged commit 90b3992 into scikit-learn:main Apr 16, 2021

hongshaoyang deleted the 18940-diabetes branch April 16, 2021 16:43

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Apr 19, 2021

DOC Fix the description of some features in load_diabetes (scikit-lea…

ae3cf2a

…rn#19366) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Apr 22, 2021

DOC Fix the description of some features in load_diabetes (scikit-lea…

444dc75

…rn#19366) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

glemaitre added a commit that referenced this pull request Apr 28, 2021

DOC Fix the description of some features in load_diabetes (#19366)

ff6d2f0

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DOC Fix the description of some features in load_diabetes #19366

DOC Fix the description of some features in load_diabetes #19366

Uh oh!

hongshaoyang commented Feb 6, 2021

Uh oh!

hongshaoyang commented Feb 6, 2021 •

edited

Loading

Uh oh!

adrinjalali left a comment

Uh oh!

noambernstein Feb 6, 2021

Uh oh!

hongshaoyang Feb 7, 2021

Uh oh!

noambernstein Feb 6, 2021

Uh oh!

hongshaoyang Feb 7, 2021

Uh oh!

glemaitre Feb 8, 2021

Uh oh!

ogrisel Feb 8, 2021

Uh oh!

adrinjalali Feb 8, 2021

Uh oh!

hongshaoyang Apr 16, 2021 •

edited

Loading

Uh oh!

glemaitre Apr 16, 2021

Uh oh!

hongshaoyang Apr 16, 2021

Uh oh!

glemaitre left a comment

Uh oh!

glemaitre commented Apr 16, 2021

Uh oh!

Uh oh!

	- s5 ltg, possibly log of serum triglycerides level
	- s5 ltg, log of serum triglycerides level

Uh oh!

DOC Fix the description of some features in load_diabetes #19366

DOC Fix the description of some features in load_diabetes #19366

Uh oh!

Conversation

hongshaoyang commented Feb 6, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

hongshaoyang commented Feb 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hongshaoyang Apr 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Apr 16, 2021

Uh oh!

Uh oh!

hongshaoyang commented Feb 6, 2021 •

edited

Loading

hongshaoyang Apr 16, 2021 •

edited

Loading