Skip to content

add example of a good docstring for defaults and examples #12356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Aug 7, 2019

Conversation

adrinjalali
Copy link
Member

Although numpy's docstring standards don't require it, it seems there's a preference towards having the defaults mentioned in the docstrings. @jnothman is not really convinced on this IIRC though, but PRs that fix the mentions to defaults are generally accepted, as far as I know.

An "Examples" section is also added to most classes by now, therefore I think it's not a bad idea to refer to an example in the developers' guide which includes both the abovementioned points.

@NicolasHug
Copy link
Member

+1, this is the kind of implicit stuff that deserves being made explicit in the contributing guide. New contributors don't know what's the commonly accepted way of specifying defaults (among other things), and will basically git grep patterns until they find one that has enough hits.

I can tell, that's what I did...

@qinhanmin2014
Copy link
Member

I think in #12111, we've decided not to force contributors to specify parameter defaults in docstring?
LinearSVC doesn't seems to be a good example. We don't record the default value of class_weight and the optional label only appears in some optional parameters.

@adrinjalali
Copy link
Member Author

@qinhanmin2014 my main point here is that there should be a place where we give clear examples of how scikit-learn wants the docstrings to look like. Even if we can't agree on one style, we should then point to two variants we like, and say "choose either of the two styles and be consistent".

We can fix the issue with LinearSVC of course, and fix it. But I intentionally put that one here because it has a parameter with None as the default, i.e. class_weight. To me, if the default is None, it is truly optional and mentioning its default value is not necessary. But again, we should decide on that and point developers to an example or two which we agree upon.

I personally like having the default values in the docstring, cause I find it sometimes tedious to go back and forth between the beginning of the page and where the parameter is explained. But I also agree the default values should ideally be generated automatically in the docstring, in order to handle discrepancies caused by PRs changing the __init__ and not the docstring down the line.

@qinhanmin2014
Copy link
Member

As long as you can find a good example, I'll vote +1 (still need a +2). LinearSVC seems not good.

@jnothman
Copy link
Member

jnothman commented Oct 14, 2018 via email

@adrinjalali
Copy link
Member Author

Sure, could do that. I was trying to keep it minimal since it would probably trigger a long discussion.

Just one question, in this section, the requirements for a "user guide" documentation and "docstring" documentation are kinda mixed. Should I separate them into two clear sections?

@jnothman
Copy link
Member

jnothman commented Oct 16, 2018 via email

@NicolasHug
Copy link
Member

Why not "just" fix the doc discrepancies in LinearSVC in this PR then?

I also agree that defaults would be better set automatically by some doc processing, but until then, this seems like a good option.

Moreover, is this doc processing ever going to happen? As far as I understand, those kind of changes that impact a lot of files are frowned upon because they would conflict with too many existing PRs.

@jnothman
Copy link
Member

jnothman commented Oct 18, 2018 via email

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a lot of rules. I feel like having 2-3 examples of good docstrings could properly cover all the cases and would be easier to grasp for contributors.

Here are some comments anyway

@@ -603,6 +603,24 @@ Finally, follow the formatting rules below to make it consistently good:
SelectKBest : Select features based on the k highest scores.
SelectFpr : Select features based on a false positive rate test.

* When documenting the parameters and attributes, have the following in mind:

1. Do not use optional. Use `default=`. `str {'a', 'b'} or float, default=1.0`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not sure?)

Suggested change
1. Do not use optional. Use `default=`. `str {'a', 'b'} or float, default=1.0`
1. Do not use optional. Use `default=`.

1. Do not use optional. Use `default=`. `str {'a', 'b'} or float, default=1.0`
2. Python basic types. (`bool` instead of `boolean`)
3. When defining 1-D shape, use parenthesis. `array-like shape=(n_samples,), None, default=None`.
4. When defining 2-D shape, use parenthesis. `array-like shape=(n_samples, n_features)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge with 3?

4. When defining 2-D shape, use parenthesis. `array-like shape=(n_samples, n_features)`.
5. str with multiple options: `input: {'log', 'squared', 'multinomial'}`
6. Only use `or` for separating types. `float, int or None, default=None`
7. Only use comma to separate types. Information such as `shape` or `string`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by that: you also use a comma in {'a', 'b'}?

Also this is rst, not markdown ;)

{'a', 'b'}, float, int, array-like shape=(n_samples,) or None, default=None
```

8. When supporting array-like and sparse matrix use:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why having a special case for this?

array-like or sparse matrix shape=(n_samples, n_features)

would work too?

Copy link
Member

@glemaitre glemaitre Jul 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to not have any separator between the matrix type and shape.

EDIT: I see 7. now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since that we can have dataframe as well, we could make this point more generic.

@NicolasHug
Copy link
Member

I would proposed something like this


here are some examples of good docstrings:

some_param : bool, int or str {'hello', 'goodbye'}, default=True
	Parameter description goes here

Sometimes the default value is better described in plain English:

sample_weight : array-like or None
	The sample weights, ignored by default (None).

@rth
Copy link
Member

rth commented Jul 30, 2019

It would be nice to standardize it indeed.

Do we anticipate to use type annotations (#11170) in the future in which case one could (optimistically) imagine generating the docstrings (including type and defaults) from type annotations ( numpy/numpydoc#196)? Not sure how realistic that would be, bit if possible it could save us some string formatting work.

1. Do not use optional. Use `default=`. `str {'a', 'b'} or float, default=1.0`
2. Python basic types. (`bool` instead of `boolean`)
3. When defining 1-D shape, use parenthesis. `array-like shape=(n_samples,), None, default=None`.
4. When defining 2-D shape, use parenthesis. `array-like shape=(n_samples, n_features)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the case that the array can be both 1-D and 2-D (e.g. output y_pred)

options are defined together. Here is an extreme example:

```
{'a', 'b'}, float, int, array-like shape=(n_samples,) or None, default=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If shape= is a standard then I am fine with it. But I would rather find the following easier to parse visually.

"""
array-like of shape (n_samples,)
{array-like, sparse matrix or dataframe} of shape (n_samples, n_features)
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with of shape. I would go with:

{array-like, sparse matrix, dataframe} of shape (n_samples, n_features)

(There is no "or" before "dataframe")

@glemaitre
Copy link
Member

I think that it could be nice to define the type of array that should be mentioned:

  • ndarray
  • array-like
  • sparse matrix
  • dataframe

@thomasjpfan
Copy link
Member

@rth We have so much flexibility when it comes to what we put into the "type" of a parameter. This includes metadata such as "shape" or the options of a string type. There is pep 593 that tries to place this metadata into the type itself.

In near future, I see us continuing with using docstrings for types. Once we settle on a standard, there will be a test to make sure the standard is upheld allowing the CI to do the string formatting check for us.

@amueller
Copy link
Member

btw @thomasjpfan is working on a formal way to describe allowed parameters but it's unclear whether this can go into sklearn. The docstrings need something simpler, though, as the docstring types don't document the interactions between the parameters.

@amueller
Copy link
Member

@glemaitre do we have anything that excepts only nd-array? I don't think so. I think the three types of interest are array-like, sparse matrix and dataframe.
I think CountVectorizer and DictVectorizer might require lists or 1d-arrays?

@glemaitre
Copy link
Member

glemaitre commented Jul 30, 2019

@glemaitre do we have anything that excepts only nd-array?

I would not be surprised to find some public helper functions (especially wrapping some cython) working only with ndarray.

@amueller
Copy link
Member

@glemaitre fair. Which reminds me of #6616, which I think we should address these days lol

@thomasjpfan
Copy link
Member

btw @thomasjpfan is working on a formal way to describe allowed parameters but it's unclear whether this can go into sklearn.

That was scoped to __init__ which is slightly easier to do. There are not as many "array-like, sparse matrices" parameters in __init__.

@amueller
Copy link
Member

amueller commented Aug 2, 2019

Maybe add an example of where the default behavior is complicated and we don't document it in the type but instead in the text?

@qdeffense
Copy link
Contributor

I think it would be nice to have easier examples too. If someone looks at this example :
some_param : {'hello', 'goodbye'}, bool or int, default=True
But have only one string, it could be
{'hello'} or bool, default=True
or
'hello' or bool, default=True

@adrinjalali
Copy link
Member Author

Maybe add an example of where the default behavior is complicated and we don't document it in the type but instead in the text?

Modified the second example to include the default value.

@adrinjalali
Copy link
Member Author

Is this good to go?

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting issues, but LGTM when addressed

Thanks Adrin


1. Python basic types. (`bool` instead of `boolean`)
2. Use parenthesis for defining shapes: `array-like of shape (n_samples,)`
or `array-like of shape (n_samples, n_features)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list doesn't render properly, you'll need to indent the new lines (please checked rendered doc)

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, not sure who has time to correct existing doc o(╥﹏╥)o (open an issue?)

@qinhanmin2014 qinhanmin2014 merged commit 60191ce into scikit-learn:master Aug 7, 2019
@adrinjalali adrinjalali deleted the doc/developers branch August 7, 2019 15:53
qdeffense added a commit to qdeffense/scikit-learn that referenced this pull request Aug 7, 2019
literal (either `hello` or `goodbye`), a bool, or an int. The default
value is True.

array_parameter : {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) or (n_samples,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhh did we have a discussion of "of shape" vs "shape=". Did someone count that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"of shape": 513 lines; "shape=" 338 lines; "shape =" 1316 lines. "of shape =": 162 lines, "of shape=" 1 line.

These are regex matches, so non-exclusive, so "of shape" includes "of shape =".
So we have 350 "of shape (...)" and 1115 "shape = (..)"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes shape= would result in a smaller diff. Since this PR will result in many docstring changes, we should decide based on which is subjectively better.

(I have been +0.25 for shape= because it saves a few more characters.) Do you have a opinion on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is shape= but I don't mind that much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are often short on space in that line.

array-like of shape (n_samples,) vs array-like, shape=(n_samples,) saves 2 characters. Not sure if that's a good argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can fix those cases that don't comply with this with a sed, so I don't think the number of instances of each case are important that much anyway. I kinda liked not having the comma cause the comma is separating other things:

int, array-like of shape (n_samples,), or None, default=None (note that we don't have the oxford comma in our guide now, but I rather have it)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like not having the comma :P

I prefer of shape if there is no comma. I find it more difficult to visually parse it.
I also find it closer to literal English --- even if we are writing documentation for expert computer scientists (I should not troll about Matlab post).

Anyway, I am going with the consensus (since it is merged it seems we are going for that one?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fine with keeping it.
@adrinjalali feel free to go ahead and do the sed and then fix all the merge conflicts :P
If whis is the way we want to go, we should probably do the sed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are we doing this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so, I'll submit a PR.

qdeffense added a commit to qdeffense/scikit-learn that referenced this pull request Aug 7, 2019
@glemaitre
Copy link
Member

I am a missing couple of thing. I can open a PR if we decide on the style:

  • sometimes we were having a parameter taking a list and we were mentioning list of *** (e.g. list of str or list of (str, estimator) tuples. Shall we mention this corner case?
  • when one of the parameters is an estimator, we agree that we mention estimator obj?

@thomasjpfan
Copy link
Member

sometimes we were having a parameter taking a list and we were mentioning list of *** (e.g. list of str or list of (str, estimator) tuples. Shall we mention this corner case?

I see list of (str, estimator) as a list of tuples and list of {str, estimator} as a list of strings or estimators.

when one of the parameters is an estimator, we agree that we mention estimator obj?

I think just having estimator should be enough.

@rth
Copy link
Member

rth commented Oct 7, 2019

I think adding standard validation tool to numpydocs (e.g. by adapting the docstring validation scripts from pandas) see numpy/numpydoc#213 is a better long term solution that defining our own docstring standards.

I mean our current docstring standards are fine, but for future work in that area it would be more productive to collaborate on numpydoc IMO.

@NicolasHug
Copy link
Member

Agreed, I'd be very happy to see numpy/numpydoc#213 move forward

@thomasjpfan
Copy link
Member

The work surrounding https://github.com/pandas-dev/pandas/blob/master/scripts/validate_docstrings.py is pretty amazing.

lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Nov 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants