add example of a good docstring for defaults and examples #12356

adrinjalali · 2018-10-11T13:41:36Z

Although numpy's docstring standards don't require it, it seems there's a preference towards having the defaults mentioned in the docstrings. @jnothman is not really convinced on this IIRC though, but PRs that fix the mentions to defaults are generally accepted, as far as I know.

An "Examples" section is also added to most classes by now, therefore I think it's not a bad idea to refer to an example in the developers' guide which includes both the abovementioned points.

NicolasHug · 2018-10-11T21:30:25Z

+1, this is the kind of implicit stuff that deserves being made explicit in the contributing guide. New contributors don't know what's the commonly accepted way of specifying defaults (among other things), and will basically git grep patterns until they find one that has enough hits.

I can tell, that's what I did...

qinhanmin2014 · 2018-10-12T03:32:59Z

I think in #12111, we've decided not to force contributors to specify parameter defaults in docstring?
LinearSVC doesn't seems to be a good example. We don't record the default value of class_weight and the optional label only appears in some optional parameters.

adrinjalali · 2018-10-12T07:14:30Z

@qinhanmin2014 my main point here is that there should be a place where we give clear examples of how scikit-learn wants the docstrings to look like. Even if we can't agree on one style, we should then point to two variants we like, and say "choose either of the two styles and be consistent".

We can fix the issue with LinearSVC of course, and fix it. But I intentionally put that one here because it has a parameter with None as the default, i.e. class_weight. To me, if the default is None, it is truly optional and mentioning its default value is not necessary. But again, we should decide on that and point developers to an example or two which we agree upon.

I personally like having the default values in the docstring, cause I find it sometimes tedious to go back and forth between the beginning of the page and where the parameter is explained. But I also agree the default values should ideally be generated automatically in the docstring, in order to handle discrepancies caused by PRs changing the __init__ and not the docstring down the line.

qinhanmin2014 · 2018-10-12T07:36:24Z

As long as you can find a good example, I'll vote +1 (still need a +2). LinearSVC seems not good.

jnothman · 2018-10-14T01:07:13Z

I think it might be better to explicitly state the kinds of things a contributor should look out for than to link to an example

adrinjalali · 2018-10-16T11:51:27Z

Sure, could do that. I was trying to keep it minimal since it would probably trigger a long discussion.

Just one question, in this section, the requirements for a "user guide" documentation and "docstring" documentation are kinda mixed. Should I separate them into two clear sections?

jnothman · 2018-10-16T21:56:08Z

I don't have the time to look into the current state of the docs in detail now, but I don't think anyone minds the contributor docs being improved from the perspective of new contributors. Another thing we have not done yet is referenced the glossary, which might allow these docs to be more succinct and structured more didactically.

NicolasHug · 2018-10-17T14:39:20Z

Why not "just" fix the doc discrepancies in LinearSVC in this PR then?

I also agree that defaults would be better set automatically by some doc processing, but until then, this seems like a good option.

Moreover, is this doc processing ever going to happen? As far as I understand, those kind of changes that impact a lot of files are frowned upon because they would conflict with too many existing PRs.

jnothman · 2018-10-18T02:17:22Z

The doc processing I was referring to would be putting a change in numpydoc, not changing the docstrings... but it's also never going to happen :) It's currently a nuisance to look up the default value. But I don't find "default=None" helpful when the semantics of "None" are unclear. I would rather a descriptive "by default, ...."

NicolasHug

that's a lot of rules. I feel like having 2-3 examples of good docstrings could properly cover all the cases and would be easier to grasp for contributors.

Here are some comments anyway

NicolasHug · 2019-07-29T19:41:25Z

doc/developers/contributing.rst

@@ -603,6 +603,24 @@ Finally, follow the formatting rules below to make it consistently good:
    SelectKBest : Select features based on the k highest scores.
    SelectFpr : Select features based on a false positive rate test.

+* When documenting the parameters and attributes, have the following in mind:
+
+    1. Do not use optional. Use `default=`. `str {'a', 'b'} or float, default=1.0`


(Not sure?)

Suggested change

1. Do not use optional. Use `default=`. `str {'a', 'b'} or float, default=1.0`

1. Do not use optional. Use `default=`.

NicolasHug · 2019-07-29T19:41:45Z

doc/developers/contributing.rst

+    1. Do not use optional. Use `default=`. `str {'a', 'b'} or float, default=1.0`
+    2. Python basic types. (`bool` instead of `boolean`)
+    3. When defining 1-D shape, use parenthesis. `array-like shape=(n_samples,), None, default=None`.
+    4. When defining 2-D shape, use parenthesis. `array-like shape=(n_samples, n_features)`.


merge with 3?

NicolasHug · 2019-07-29T19:43:26Z

doc/developers/contributing.rst

+    4. When defining 2-D shape, use parenthesis. `array-like shape=(n_samples, n_features)`.
+    5. str with multiple options: `input: {'log', 'squared', 'multinomial'}`
+    6. Only use `or` for separating types. `float, int or None, default=None`
+    7. Only use comma to separate types. Information such as `shape` or `string`


I'm not sure what you mean by that: you also use a comma in {'a', 'b'}?

Also this is rst, not markdown ;)

NicolasHug · 2019-07-29T19:52:02Z

doc/developers/contributing.rst

+    {'a', 'b'}, float, int, array-like shape=(n_samples,) or None, default=None
+    ```
+
+    8. When supporting array-like and sparse matrix use:


Why having a special case for this?

array-like or sparse matrix shape=(n_samples, n_features)

would work too?

Is there a reason to not have any separator between the matrix type and shape.

EDIT: I see 7. now.

Since that we can have dataframe as well, we could make this point more generic.

NicolasHug · 2019-07-29T19:56:30Z

I would proposed something like this

here are some examples of good docstrings:

some_param : bool, int or str {'hello', 'goodbye'}, default=True
	Parameter description goes here

Sometimes the default value is better described in plain English:

sample_weight : array-like or None
	The sample weights, ignored by default (None).

rth · 2019-07-30T13:40:41Z

It would be nice to standardize it indeed.

Do we anticipate to use type annotations (#11170) in the future in which case one could (optimistically) imagine generating the docstrings (including type and defaults) from type annotations ( numpy/numpydoc#196)? Not sure how realistic that would be, bit if possible it could save us some string formatting work.

glemaitre · 2019-07-30T13:41:12Z

doc/developers/contributing.rst

+    1. Do not use optional. Use `default=`. `str {'a', 'b'} or float, default=1.0`
+    2. Python basic types. (`bool` instead of `boolean`)
+    3. When defining 1-D shape, use parenthesis. `array-like shape=(n_samples,), None, default=None`.
+    4. When defining 2-D shape, use parenthesis. `array-like shape=(n_samples, n_features)`.


What about the case that the array can be both 1-D and 2-D (e.g. output y_pred)

glemaitre · 2019-07-30T13:46:39Z

doc/developers/contributing.rst

+    options are defined together. Here is an extreme example:
+
+    ```
+    {'a', 'b'}, float, int, array-like shape=(n_samples,) or None, default=None


If shape= is a standard then I am fine with it. But I would rather find the following easier to parse visually.

""" array-like of shape (n_samples,) {array-like, sparse matrix or dataframe} of shape (n_samples, n_features) """

I am okay with of shape. I would go with:

{array-like, sparse matrix, dataframe} of shape (n_samples, n_features)

(There is no "or" before "dataframe")

glemaitre · 2019-07-30T13:48:26Z

I think that it could be nice to define the type of array that should be mentioned:

ndarray
array-like
sparse matrix
dataframe

thomasjpfan · 2019-07-30T15:25:09Z

@rth We have so much flexibility when it comes to what we put into the "type" of a parameter. This includes metadata such as "shape" or the options of a string type. There is pep 593 that tries to place this metadata into the type itself.

In near future, I see us continuing with using docstrings for types. Once we settle on a standard, there will be a test to make sure the standard is upheld allowing the CI to do the string formatting check for us.

amueller · 2019-07-30T15:29:06Z

btw @thomasjpfan is working on a formal way to describe allowed parameters but it's unclear whether this can go into sklearn. The docstrings need something simpler, though, as the docstring types don't document the interactions between the parameters.

amueller · 2019-07-30T15:30:22Z

@glemaitre do we have anything that excepts only nd-array? I don't think so. I think the three types of interest are array-like, sparse matrix and dataframe.
I think CountVectorizer and DictVectorizer might require lists or 1d-arrays?

glemaitre · 2019-07-30T15:41:59Z

@glemaitre do we have anything that excepts only nd-array?

I would not be surprised to find some public helper functions (especially wrapping some cython) working only with ndarray.

amueller · 2019-07-30T15:47:50Z

@glemaitre fair. Which reminds me of #6616, which I think we should address these days lol

thomasjpfan · 2019-07-30T17:01:32Z

btw @thomasjpfan is working on a formal way to describe allowed parameters but it's unclear whether this can go into sklearn.

That was scoped to __init__ which is slightly easier to do. There are not as many "array-like, sparse matrices" parameters in __init__.

doc/developers/contributing.rst

amueller · 2019-08-02T20:42:00Z

Maybe add an example of where the default behavior is complicated and we don't document it in the type but instead in the text?

qdeffense · 2019-08-04T14:00:27Z

I think it would be nice to have easier examples too. If someone looks at this example :
some_param : {'hello', 'goodbye'}, bool or int, default=True
But have only one string, it could be
{'hello'} or bool, default=True
or
'hello' or bool, default=True

adrinjalali · 2019-08-05T12:08:06Z

Maybe add an example of where the default behavior is complicated and we don't document it in the type but instead in the text?

Modified the second example to include the default value.

adrinjalali · 2019-08-07T11:41:16Z

Is this good to go?

NicolasHug

Formatting issues, but LGTM when addressed

Thanks Adrin

doc/developers/contributing.rst

NicolasHug · 2019-08-07T13:12:50Z

doc/developers/contributing.rst

+
+    1. Python basic types. (`bool` instead of `boolean`)
+    2. Use parenthesis for defining shapes: `array-like of shape (n_samples,)`
+    or `array-like of shape (n_samples, n_features)`


This list doesn't render properly, you'll need to indent the new lines (please checked rendered doc)

doc/developers/contributing.rst

qinhanmin2014

LGTM, not sure who has time to correct existing doc o(╥﹏╥)o (open an issue?)

amueller · 2019-08-07T16:54:12Z

doc/developers/contributing.rst

+        literal (either `hello` or `goodbye`), a bool, or an int. The default
+        value is True.
+
+    array_parameter : {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) or (n_samples,)


uhh did we have a discussion of "of shape" vs "shape=". Did someone count that?

"of shape": 513 lines; "shape=" 338 lines; "shape =" 1316 lines. "of shape =": 162 lines, "of shape=" 1 line.

These are regex matches, so non-exclusive, so "of shape" includes "of shape =".
So we have 350 "of shape (...)" and 1115 "shape = (..)"

Yes shape= would result in a smaller diff. Since this PR will result in many docstring changes, we should decide based on which is subjectively better.

(I have been +0.25 for shape= because it saves a few more characters.) Do you have a opinion on this?

My preference is shape= but I don't mind that much.

we are often short on space in that line.

array-like of shape (n_samples,) vs array-like, shape=(n_samples,) saves 2 characters. Not sure if that's a good argument?

we can fix those cases that don't comply with this with a sed, so I don't think the number of instances of each case are important that much anyway. I kinda liked not having the comma cause the comma is separating other things:

int, array-like of shape (n_samples,), or None, default=None (note that we don't have the oxford comma in our guide now, but I rather have it)

I don't like not having the comma :P

I prefer of shape if there is no comma. I find it more difficult to visually parse it.
I also find it closer to literal English --- even if we are writing documentation for expert computer scientists (I should not troll about Matlab post).

Anyway, I am going with the consensus (since it is merged it seems we are going for that one?)

Ok, fine with keeping it.
@adrinjalali feel free to go ahead and do the sed and then fix all the merge conflicts :P
If whis is the way we want to go, we should probably do the sed.

So are we doing this?

I guess so, I'll submit a PR.

glemaitre · 2019-08-08T14:09:28Z

I am a missing couple of thing. I can open a PR if we decide on the style:

sometimes we were having a parameter taking a list and we were mentioning list of *** (e.g. list of str or list of (str, estimator) tuples. Shall we mention this corner case?
when one of the parameters is an estimator, we agree that we mention estimator obj?

thomasjpfan · 2019-08-08T14:21:32Z

sometimes we were having a parameter taking a list and we were mentioning list of *** (e.g. list of str or list of (str, estimator) tuples. Shall we mention this corner case?

I see list of (str, estimator) as a list of tuples and list of {str, estimator} as a list of strings or estimators.

when one of the parameters is an estimator, we agree that we mention estimator obj?

I think just having estimator should be enough.

rth · 2019-10-07T15:46:57Z

I think adding standard validation tool to numpydocs (e.g. by adapting the docstring validation scripts from pandas) see numpy/numpydoc#213 is a better long term solution that defining our own docstring standards.

I mean our current docstring standards are fine, but for future work in that area it would be more productive to collaborate on numpydoc IMO.

NicolasHug · 2019-10-07T16:13:37Z

Agreed, I'd be very happy to see numpy/numpydoc#213 move forward

thomasjpfan · 2019-10-07T16:26:17Z

The work surrounding https://github.com/pandas-dev/pandas/blob/master/scripts/validate_docstrings.py is pretty amazing.

add example of a good docstring for defaults and examples

615824b

adrinjalali mentioned this pull request Jul 22, 2019

Standardize Docstrings for Parameters and Attributes #14404

Closed

adrinjalali added 2 commits July 28, 2019 16:21

Merge remote-tracking branch 'upstream/master' into doc/developers

620de77

trying Thomas's proposal

71cb6b3

NicolasHug reviewed Jul 29, 2019

View reviewed changes

amueller mentioned this pull request Jul 29, 2019

[WIP] DOC documentation for default values in linear_models #14505

Closed

glemaitre reviewed Jul 30, 2019

View reviewed changes

This was referenced Jul 30, 2019

[WIP] DOC Document default values for bayes.py #14518

Merged

[MRG] DOC document default values for Lars #14512

Merged

glemaitre mentioned this pull request Jul 30, 2019

DOC specify default of error_score in cross_validate #14488

Merged

further simplify and clarify instructions

1ece81a

rth mentioned this pull request Aug 2, 2019

Parameter documentation for linear models #14452

Closed

amueller reviewed Aug 2, 2019

View reviewed changes

doc/developers/contributing.rst Outdated Show resolved Hide resolved

adrinjalali added 3 commits August 5, 2019 14:09

add default and an easy example

76a7273

Merge remote-tracking branch 'upstream/master' into doc/developers

74e581f

tab -> spaces

19ba145

qdeffense approved these changes Aug 5, 2019

View reviewed changes

adrinjalali mentioned this pull request Aug 6, 2019

Fix document for sklearn.svm.libsvm predict and predict_proba. #11125

Closed

adrinjalali mentioned this pull request Aug 7, 2019

DOC Minor fixes for parameter documentation in ridge #14453

Merged

4 tasks

NicolasHug approved these changes Aug 7, 2019

View reviewed changes

address comments

0256ea1

qinhanmin2014 approved these changes Aug 7, 2019

View reviewed changes

qinhanmin2014 merged commit 60191ce into scikit-learn:master Aug 7, 2019

adrinjalali deleted the doc/developers branch August 7, 2019 15:53

qdeffense added a commit to qdeffense/scikit-learn that referenced this pull request Aug 7, 2019

update the array-like to respect scikit-learn#12356

562a081

amueller reviewed Aug 7, 2019

View reviewed changes

qdeffense added a commit to qdeffense/scikit-learn that referenced this pull request Aug 7, 2019

update to respect scikit-learn#12356

88d0941

rth mentioned this pull request Aug 8, 2019

[MRG] Expose random seed in Hashingvectorizer #14605

Merged

adrinjalali mentioned this pull request Aug 13, 2019

DOC docstring (shape= -> of shape) #14640

Merged

thomasjpfan mentioned this pull request Aug 15, 2019

Add more dtype details to docstring standard doc #14664

Closed

lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Nov 3, 2019

DOC PEP 257 and scikit-learn#12356 in _ridge.py

bbcdd0d

	1. Do not use optional. Use `default=`. `str {'a', 'b'} or float, default=1.0`
	1. Do not use optional. Use `default=`.

add example of a good docstring for defaults and examples #12356

add example of a good docstring for defaults and examples #12356

Conversation

adrinjalali commented Oct 11, 2018

NicolasHug commented Oct 11, 2018

qinhanmin2014 commented Oct 12, 2018

adrinjalali commented Oct 12, 2018

qinhanmin2014 commented Oct 12, 2018

jnothman commented Oct 14, 2018 via email

adrinjalali commented Oct 16, 2018

jnothman commented Oct 16, 2018 via email

NicolasHug commented Oct 17, 2018

jnothman commented Oct 18, 2018 via email

NicolasHug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre Jul 30, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug commented Jul 29, 2019

rth commented Jul 30, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented Jul 30, 2019

thomasjpfan commented Jul 30, 2019

amueller commented Jul 30, 2019

amueller commented Jul 30, 2019

glemaitre commented Jul 30, 2019 • edited Loading

amueller commented Jul 30, 2019

thomasjpfan commented Jul 30, 2019

amueller commented Aug 2, 2019

qdeffense commented Aug 4, 2019

adrinjalali commented Aug 5, 2019

adrinjalali commented Aug 7, 2019

NicolasHug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qinhanmin2014 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented Aug 8, 2019

thomasjpfan commented Aug 8, 2019

rth commented Oct 7, 2019 • edited Loading

NicolasHug commented Oct 7, 2019

thomasjpfan commented Oct 7, 2019

glemaitre Jul 30, 2019 •

edited

Loading

rth commented Jul 30, 2019 •

edited

Loading

glemaitre commented Jul 30, 2019 •

edited

Loading

rth commented Oct 7, 2019 •

edited

Loading