-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
add example of a good docstring for defaults and examples #12356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
+1, this is the kind of implicit stuff that deserves being made explicit in the contributing guide. New contributors don't know what's the commonly accepted way of specifying defaults (among other things), and will basically I can tell, that's what I did... |
I think in #12111, we've decided not to force contributors to specify parameter defaults in docstring? |
@qinhanmin2014 my main point here is that there should be a place where we give clear examples of how scikit-learn wants the docstrings to look like. Even if we can't agree on one style, we should then point to two variants we like, and say "choose either of the two styles and be consistent". We can fix the issue with I personally like having the default values in the docstring, cause I find it sometimes tedious to go back and forth between the beginning of the page and where the parameter is explained. But I also agree the default values should ideally be generated automatically in the docstring, in order to handle discrepancies caused by PRs changing the |
As long as you can find a good example, I'll vote +1 (still need a +2). LinearSVC seems not good. |
I think it might be better to explicitly state the kinds of things a
contributor should look out for than to link to an example
|
Sure, could do that. I was trying to keep it minimal since it would probably trigger a long discussion. Just one question, in this section, the requirements for a "user guide" documentation and "docstring" documentation are kinda mixed. Should I separate them into two clear sections? |
I don't have the time to look into the current state of the docs in detail
now, but I don't think anyone minds the contributor docs being improved
from the perspective of new contributors. Another thing we have not done
yet is referenced the glossary, which might allow these docs to be more
succinct and structured more didactically.
|
Why not "just" fix the doc discrepancies in I also agree that defaults would be better set automatically by some doc processing, but until then, this seems like a good option. Moreover, is this doc processing ever going to happen? As far as I understand, those kind of changes that impact a lot of files are frowned upon because they would conflict with too many existing PRs. |
The doc processing I was referring to would be putting a change in
numpydoc, not changing the docstrings... but it's also never going to
happen :)
It's currently a nuisance to look up the default value. But I don't find
"default=None" helpful when the semantics of "None" are unclear. I would
rather a descriptive "by default, ...."
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a lot of rules. I feel like having 2-3 examples of good docstrings could properly cover all the cases and would be easier to grasp for contributors.
Here are some comments anyway
doc/developers/contributing.rst
Outdated
@@ -603,6 +603,24 @@ Finally, follow the formatting rules below to make it consistently good: | |||
SelectKBest : Select features based on the k highest scores. | |||
SelectFpr : Select features based on a false positive rate test. | |||
|
|||
* When documenting the parameters and attributes, have the following in mind: | |||
|
|||
1. Do not use optional. Use `default=`. `str {'a', 'b'} or float, default=1.0` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not sure?)
1. Do not use optional. Use `default=`. `str {'a', 'b'} or float, default=1.0` | |
1. Do not use optional. Use `default=`. |
doc/developers/contributing.rst
Outdated
1. Do not use optional. Use `default=`. `str {'a', 'b'} or float, default=1.0` | ||
2. Python basic types. (`bool` instead of `boolean`) | ||
3. When defining 1-D shape, use parenthesis. `array-like shape=(n_samples,), None, default=None`. | ||
4. When defining 2-D shape, use parenthesis. `array-like shape=(n_samples, n_features)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge with 3?
doc/developers/contributing.rst
Outdated
4. When defining 2-D shape, use parenthesis. `array-like shape=(n_samples, n_features)`. | ||
5. str with multiple options: `input: {'log', 'squared', 'multinomial'}` | ||
6. Only use `or` for separating types. `float, int or None, default=None` | ||
7. Only use comma to separate types. Information such as `shape` or `string` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean by that: you also use a comma in {'a', 'b'}
?
Also this is rst, not markdown ;)
doc/developers/contributing.rst
Outdated
{'a', 'b'}, float, int, array-like shape=(n_samples,) or None, default=None | ||
``` | ||
|
||
8. When supporting array-like and sparse matrix use: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why having a special case for this?
array-like or sparse matrix shape=(n_samples, n_features)
would work too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to not have any separator between the matrix type
and shape
.
EDIT: I see 7. now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since that we can have dataframe as well, we could make this point more generic.
I would proposed something like this here are some examples of good docstrings:
Sometimes the default value is better described in plain English:
|
It would be nice to standardize it indeed. Do we anticipate to use type annotations (#11170) in the future in which case one could (optimistically) imagine generating the docstrings (including type and defaults) from type annotations ( numpy/numpydoc#196)? Not sure how realistic that would be, bit if possible it could save us some string formatting work. |
doc/developers/contributing.rst
Outdated
1. Do not use optional. Use `default=`. `str {'a', 'b'} or float, default=1.0` | ||
2. Python basic types. (`bool` instead of `boolean`) | ||
3. When defining 1-D shape, use parenthesis. `array-like shape=(n_samples,), None, default=None`. | ||
4. When defining 2-D shape, use parenthesis. `array-like shape=(n_samples, n_features)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the case that the array can be both 1-D and 2-D (e.g. output y_pred
)
doc/developers/contributing.rst
Outdated
options are defined together. Here is an extreme example: | ||
|
||
``` | ||
{'a', 'b'}, float, int, array-like shape=(n_samples,) or None, default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If shape=
is a standard then I am fine with it. But I would rather find the following easier to parse visually.
"""
array-like of shape (n_samples,)
{array-like, sparse matrix or dataframe} of shape (n_samples, n_features)
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with of shape
. I would go with:
{array-like, sparse matrix, dataframe} of shape (n_samples, n_features)
(There is no "or" before "dataframe")
I think that it could be nice to define the type of array that should be mentioned:
|
@rth We have so much flexibility when it comes to what we put into the "type" of a parameter. This includes metadata such as "shape" or the options of a string type. There is pep 593 that tries to place this metadata into the type itself. In near future, I see us continuing with using docstrings for types. Once we settle on a standard, there will be a test to make sure the standard is upheld allowing the CI to do the string formatting check for us. |
btw @thomasjpfan is working on a formal way to describe allowed parameters but it's unclear whether this can go into sklearn. The docstrings need something simpler, though, as the docstring types don't document the interactions between the parameters. |
@glemaitre do we have anything that excepts only nd-array? I don't think so. I think the three types of interest are array-like, sparse matrix and dataframe. |
I would not be surprised to find some public helper functions (especially wrapping some cython) working only with ndarray. |
@glemaitre fair. Which reminds me of #6616, which I think we should address these days lol |
That was scoped to |
Maybe add an example of where the default behavior is complicated and we don't document it in the type but instead in the text? |
I think it would be nice to have easier examples too. If someone looks at this example : |
Modified the second example to include the default value. |
Is this good to go? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting issues, but LGTM when addressed
Thanks Adrin
doc/developers/contributing.rst
Outdated
|
||
1. Python basic types. (`bool` instead of `boolean`) | ||
2. Use parenthesis for defining shapes: `array-like of shape (n_samples,)` | ||
or `array-like of shape (n_samples, n_features)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list doesn't render properly, you'll need to indent the new lines (please checked rendered doc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, not sure who has time to correct existing doc o(╥﹏╥)o (open an issue?)
literal (either `hello` or `goodbye`), a bool, or an int. The default | ||
value is True. | ||
|
||
array_parameter : {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) or (n_samples,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhh did we have a discussion of "of shape" vs "shape=". Did someone count that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"of shape": 513 lines; "shape=" 338 lines; "shape =" 1316 lines. "of shape =": 162 lines, "of shape=" 1 line.
These are regex matches, so non-exclusive, so "of shape" includes "of shape =".
So we have 350 "of shape (...)" and 1115 "shape = (..)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes shape=
would result in a smaller diff. Since this PR will result in many docstring changes, we should decide based on which is subjectively better.
(I have been +0.25 for shape=
because it saves a few more characters.) Do you have a opinion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference is shape= but I don't mind that much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are often short on space in that line.
array-like of shape (n_samples,)
vs array-like, shape=(n_samples,)
saves 2 characters. Not sure if that's a good argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can fix those cases that don't comply with this with a sed
, so I don't think the number of instances of each case are important that much anyway. I kinda liked not having the comma cause the comma is separating other things:
int, array-like of shape (n_samples,), or None, default=None
(note that we don't have the oxford comma in our guide now, but I rather have it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like not having the comma :P
I prefer of shape
if there is no comma. I find it more difficult to visually parse it.
I also find it closer to literal English --- even if we are writing documentation for expert computer scientists (I should not troll about Matlab post).
Anyway, I am going with the consensus (since it is merged it seems we are going for that one?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fine with keeping it.
@adrinjalali feel free to go ahead and do the sed and then fix all the merge conflicts :P
If whis is the way we want to go, we should probably do the sed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So are we doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess so, I'll submit a PR.
I am a missing couple of thing. I can open a PR if we decide on the style:
|
I see
I think just having |
I think adding standard validation tool to numpydocs (e.g. by adapting the docstring validation scripts from pandas) see numpy/numpydoc#213 is a better long term solution that defining our own docstring standards. I mean our current docstring standards are fine, but for future work in that area it would be more productive to collaborate on numpydoc IMO. |
Agreed, I'd be very happy to see numpy/numpydoc#213 move forward |
The work surrounding https://github.com/pandas-dev/pandas/blob/master/scripts/validate_docstrings.py is pretty amazing. |
Although numpy's docstring standards don't require it, it seems there's a preference towards having the defaults mentioned in the docstrings. @jnothman is not really convinced on this IIRC though, but PRs that fix the mentions to defaults are generally accepted, as far as I know.
An "Examples" section is also added to most classes by now, therefore I think it's not a bad idea to refer to an example in the developers' guide which includes both the abovementioned points.