Skip to content

ENH Adds typing support to LogisticRegression #17799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 53 commits into
base: main
Choose a base branch
from

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jul 1, 2020

Reference Issues/PRs

Related to #16705
Related to #11170

What does this implement/fix? Explain your changes.

This PR experiments with static typing by:

  1. Adding typing to parameters and attributes of LogisticRegression.
  2. Uses Annotated from PEP 593 to provide metadata about the shape of a ndarray.
  3. Adds a inject_docstring keyword to __init_subclass__ to inject the type into the docstring. This allows us to not depend on numpydoc. We also have custom formating when we define types in our docstrings.
  4. Another option is to have the type only in the generated docs so we would not need to inject the types during runtime.
  5. This will help standardize the types in our docstrings since they are automatically generated.

Any other comments?

This PR is for adding simple types:

    def __init__(self,
                  penalty: Literal['l1', 'l2', 'elasticnet', 'none'] = 'l2',
                  *,
                  dual: bool = False,
                  tol: float = 1e-4,
                  C: float = 1.0,
                  fit_intercept: bool = True,
                  intercept_scaling: float = 1,
                  class_weight: Union[dict, Literal['balanced']] = None)

Not part of this PR and the future I can see this Annotated be useful with ranges:

max_features: Union[Annotated[float, Range(0, 1)], Annotated[int, Range(1, upper=None)]],
                    Literal['auto', 'sqrt', 'log2'], None]

CC @rth

@rth
Copy link
Member

rth commented Jul 1, 2020

Very interesting! So it seem to work for the docstring in https://110636-843222-gh.circle-artifacts.com/0/doc/modules/generated/sklearn.linear_model.LogisticRegression.html

The init becomes somewhat difficult to read, but maybe that's something one can get used to.

It would be interesting to benchmark how it takes to init this class before/after.

@amueller
Copy link
Member

amueller commented Jul 1, 2020

Honestly I like the type annotations in the init. It just makes the insanity that is our parameters more visible ;)
So is the goal of inject_docstring to have this info available in jupyter? I think this is not something that should be fixed in sklearn. How are other projects handling this?

Interesting discussion here: scikit-image/scikit-image#3153

@@ -1036,7 +1042,7 @@ class LogisticRegression(BaseEstimator, LinearClassifierMixin,

Parameters
----------
penalty : {'l1', 'l2', 'elasticnet', 'none'}, default='l2'
penalty :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think space before : is necessary then anymore, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the space is a part of the rst definition list spec

Optional classifiers may follow the term on the same line, each after an inline " : " (space, colon, space).

Now that I look at the spec, technically we do not need the : at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that space is part of numpydoc style guide. Google style is for example without it, See comparison.

Because to my understanding, the internal Sphinx syntax expected is:

:param path: The path of the file to wrap
:type path: str
:param field_storage: The :class:`FileStorage` instance to wrap
:type field_storage: FileStorage
:param temporary: Whether or not to delete the file when the File
   instance is destructed
:type temporary: bool
:returns: A buffered writable file descriptor
:rtype: BufferedFileStorage

So I am guessing sklearn is using napoleon to convert numpy docstring to internal Sphinx one. So one should check how napoleon converts when there is no space before :?

(This is all me guessing. I do not know those details of sklearn.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am guessing sklearn is using napoleon to convert numpy docstring to internal Sphinx one.

We are currently using numpydoc.

The space is picked up as a part of the parameter name:

no space

Screen Shot 2020-07-03 at 4 49 45 PM

with space

Screen Shot 2020-07-03 at 4 49 54 PM

("n_classes", "n_features"))] # noqa
intercept_: Annotated[np.ndarray, Shape((1,), # noqa
("n_classes",))] # noqa
n_iter_: Annotated[np.ndarray, Shape(("n_classes",), (1,))] # noqa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

C: float = 1.0,
fit_intercept: bool = True,
intercept_scaling: float = 1,
class_weight: Union[dict, Literal['balanced']] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to specify dict more precisely.

from ..utils._typing import Annotated
from ..utils._typing import Shape
from typing import Union
from typing import Optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why these are all on a separate line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a personal style choice so diffs are easier to look at. I also like:

from typing import (
	Union,
	Optional,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True as well.

Unrelated: Does sklearn uses black?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are slowly working our way on deciding this: #11336

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is relevant here: #11336 (comment)

If we go with adding typing information, then black is reasonable for its style choices.

@mitar
Copy link
Contributor

mitar commented Jul 2, 2020

Honestly I like the type annotations in the init. It just makes the insanity that is our parameters more visible ;)

Maybe this could be improved with style change to how this is shown, with every parameter being in one line or something. Becoming something like TOC you can click on and get to its documentation.

@mitar
Copy link
Contributor

mitar commented Jul 2, 2020

I like inject_docstring, but maybe performance implications should be checked?

@thomasjpfan
Copy link
Member Author

I like inject_docstring, but maybe performance implications should be checked?

Yea I am not too happy with doing any string parsing during runtime.

After speaking with @amueller , I would be okay with two sources of truth and have a linter than makes sure that docstring type is consistent with the typing annotation.

@mitar
Copy link
Contributor

mitar commented Jul 3, 2020

After speaking with @amueller , I would be okay with two sources of truth and have a linter than makes sure that docstring type is consistent with the typing annotation.

So then I would say: still one source of truth, only that docstrings are compiled and committed into the repository. You can use CI then to check that author of PR did not forget to do so.

But downside here i that while for core sklearn you are able to verify this, 3rd party estimators might become out of sync.

One option could be is to have an env variable which you can set to get docstrings to be dynamically updated at runtime as well. Do note that this updating happens only at module load time, only once per process run. So I would really first check this performance hit because it might be premature optimization to try to have compiled docstrings in the repository. It adds complexity. So if loading sklearn or its estimators before and after this change is not significantly different, maybe it is not a problem.

So let's not do any decision here of one or the other approach until this is profiled?

@thomasjpfan
Copy link
Member Author

But downside here i that while for core sklearn you are able to verify this, 3rd party estimators might become out of sync.

In all cases we will recommend third party estimators have type annotations. If we go this route, we can provide a tool and/or function to help them make sure their docstring are consistent.

So let's not do any decision here of one or the other approach until this is profiled?

Agreed, I will profile the current injection solution before making any changes.

@rth
Copy link
Member

rth commented Jul 4, 2020

The idea of not dynamically generating docstrings, but only validating them sounds good and is likely more reliable.

we can provide a tool

I would strongly push to get this took included in some specialized docstring packages. It could be numpydoc numpy/numpydoc#196 or it looks like https://github.com/terrencepreilly/darglint does some of it. I understand the style of what is put in doctring types can be customized but if we rely on such a tool, it would be nice to try standardizing this in the scientific python community while we are at it.

@thomasjpfan
Copy link
Member Author

So let's not do any decision here of one or the other approach until this is profiled?

The overhead of injection comes from module import time, so it will make the first import slower:

Without injection

%time from sklearn.linear_model import LogisticRegression
# CPU times: user 8 µs, sys: 1e+03 ns, total: 9 µs
# Wall time: 11.9 µs

With injection

%time from sklearn.linear_model import LogisticRegression
# CPU times: user 694 ms, sys: 161 ms, total: 854 ms
# Wall time: 854 ms

Given this, I would prefer the checking. The other benefit with checking is this makes it easier to check the docstrings for functions as well. To be specific, at first the check will only have only a python interface that can be called with pytest. Next we can create a simple python script to check specific classes or functions and generate the correct docstring types so a developer can copy and paste.

It could be numpydoc numpy/numpydoc#196 or it looks like terrencepreilly/darglint does some of it.

I'll take a look at to see if we can take advantage of Darglint.

I understand the style of what is put in doctring types can be customized but if we rely on such a tool, it would be nice to try standardizing this in the scientific python community while we are at it.

I agree standardizing will be nice, but I assume it will take a long time. I thought of doing this in a step by step basis:

  1. Complete the typing + checks for scikit-learn.
  2. Use 1. to have an implementation and to demonstrate value to community.
  3. Discuss with overall community on how to standardize the docstring. (I assume this will take the most time)
  4. We may need to change the docstring types in scikit-learn, but this should be easy since they can be automatically generated.

@rth
Copy link
Member

rth commented Aug 29, 2020

This looks promising indeed. Once concern is the addition of the typing_extension dependency. Vendoring typining_extension.Litteral doesn't look too straightforward. What we could to at least define, in utils.fixes

try:
	from typing import Litteral
except:
    # Python <3.9
    from typing_extension import Litteral

and only add it conditionally as a dependency for Python <3.9, to indicate that we won't need it in 4-5 years.

About whether this conversion is sufficient for our needs, I suppose we'll see once we use it on more classes.

I am still a bit bothered that this "type -> str" conversion is done in scikit-learn as opposed to some specialized library, but I imagine we could also contribute it somewhere once we have a solution that we are happy with.

COPYING Outdated
@@ -1,32 +1,29 @@
New BSD License
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was merged in a different PR. Maybe merge master in?

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Aug 29, 2020

It would not be hard to vendor all of typing extensions since it is all one file: https://github.com/python/typing/blob/master/typing_extensions/src_py3/typing_extensions.py

I would need to check if all the nice typing hints still work if we vendor this.

Edit: I think since the typing_extensions module is GPL we would not be able to vendor it.

@rth
Copy link
Member

rth commented Aug 30, 2020

I think since the typing_extensions module is GPL we would not be able to vendor it.

It's compatible with GPL but it's not GPL (it's the same license as CPython source code) as far as I can tell. So copyleft should not be an issue I think.

@rth rth mentioned this pull request Aug 31, 2020
Base automatically changed from master to main January 22, 2021 10:52
@rth
Copy link
Member

rth commented Jun 11, 2021

So I think the blocker in this PR is the addition of typing_extension as a hard dependency for Python 3.9. I'm not necessarily opposed to it, but it would need more discussion.

Could we for now replace Litteral with str, in this first iteration? That would make it much easier to merge.

@rth
Copy link
Member

rth commented Jun 11, 2021

If #16705 (comment) works for Litteral that would be even better.

@thomasjpfan I think if we want to move forward on this better to keep it simple.

  1. Add types to init, in this example to LogisticRegression
  2. In a separate PR propose the check for docstring.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Jun 11, 2021

If #16705 (comment) works for Litteral that would be even better.

Now that we are 3.7+, using from __future__ import annotations now works. :)

@thomasjpfan thomasjpfan changed the title ENH Adds typing support with linting to check docstring ENH Adds typing support to LogisticRegression Jun 11, 2021
@rth
Copy link
Member

rth commented Jun 12, 2021

What was the consensus on types annotations in the generated documentation? It's true that all those Litteral are maybe not very readable there, at least with the current formatting.

@thomasjpfan thomasjpfan reopened this Apr 24, 2022
C: float = 1.0,
fit_intercept: bool = True,
intercept_scaling: float = 1,
class_weight: Union[dict, Literal["balanced"]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As annotations are not evaluated with from __future__ import annotations, we can use directly the future | syntax inside them. It is already supported by Mypy.

Also, don't use the dict type unless you want to allow inserting operations, and specify the known types for keys and values.

In addition, the allowed value None was missing.

Suggested change
class_weight: Union[dict, Literal["balanced"]] = None,
class_weight: Mapping[str | float, float] | Literal["balanced"] | None = None,

fit_intercept: bool = True,
intercept_scaling: float = 1,
class_weight: Union[dict, Literal["balanced"]] = None,
random_state: Union[int, np.random.RandomState, None] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be useful to define a RandomLike type.

Suggested change
random_state: Union[int, np.random.RandomState, None] = None,
random_state: int | np.random.RandomState | None = None,

multi_class: Literal["auto", "ovr", "multinomial"] = "auto",
verbose: int = 0,
warm_start: bool = False,
n_jobs: Optional[int] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
n_jobs: Optional[int] = None,
n_jobs: int | None = None,

verbose: int = 0,
warm_start: bool = False,
n_jobs: Optional[int] = None,
l1_ratio: Optional[float] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
l1_ratio: Optional[float] = None,
l1_ratio: float | None = None,

Comment on lines +20 to +21
from typing import Union
from typing import Optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from typing import Union
from typing import Optional
from typing import Mapping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants