-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Adds typing support to LogisticRegression #17799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Very interesting! So it seem to work for the docstring in https://110636-843222-gh.circle-artifacts.com/0/doc/modules/generated/sklearn.linear_model.LogisticRegression.html The init becomes somewhat difficult to read, but maybe that's something one can get used to. It would be interesting to benchmark how it takes to init this class before/after. |
Honestly I like the type annotations in the init. It just makes the insanity that is our parameters more visible ;) Interesting discussion here: scikit-image/scikit-image#3153 |
sklearn/linear_model/_logistic.py
Outdated
@@ -1036,7 +1042,7 @@ class LogisticRegression(BaseEstimator, LinearClassifierMixin, | |||
|
|||
Parameters | |||
---------- | |||
penalty : {'l1', 'l2', 'elasticnet', 'none'}, default='l2' | |||
penalty : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think space before :
is necessary then anymore, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the space is a part of the rst definition list spec
Optional classifiers may follow the term on the same line, each after an inline " : " (space, colon, space).
Now that I look at the spec, technically we do not need the :
at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that space is part of numpydoc style guide. Google style is for example without it, See comparison.
Because to my understanding, the internal Sphinx syntax expected is:
:param path: The path of the file to wrap
:type path: str
:param field_storage: The :class:`FileStorage` instance to wrap
:type field_storage: FileStorage
:param temporary: Whether or not to delete the file when the File
instance is destructed
:type temporary: bool
:returns: A buffered writable file descriptor
:rtype: BufferedFileStorage
So I am guessing sklearn is using napoleon to convert numpy docstring to internal Sphinx one. So one should check how napoleon converts when there is no space before :
?
(This is all me guessing. I do not know those details of sklearn.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sklearn/linear_model/_logistic.py
Outdated
("n_classes", "n_features"))] # noqa | ||
intercept_: Annotated[np.ndarray, Shape((1,), # noqa | ||
("n_classes",))] # noqa | ||
n_iter_: Annotated[np.ndarray, Shape(("n_classes",), (1,))] # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
sklearn/linear_model/_logistic.py
Outdated
C: float = 1.0, | ||
fit_intercept: bool = True, | ||
intercept_scaling: float = 1, | ||
class_weight: Union[dict, Literal['balanced']] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to specify dict
more precisely.
from ..utils._typing import Annotated | ||
from ..utils._typing import Shape | ||
from typing import Union | ||
from typing import Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why these are all on a separate line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a personal style choice so diffs are easier to look at. I also like:
from typing import (
Union,
Optional,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True as well.
Unrelated: Does sklearn uses black?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are slowly working our way on deciding this: #11336
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is relevant here: #11336 (comment)
If we go with adding typing information, then black is reasonable for its style choices.
Maybe this could be improved with style change to how this is shown, with every parameter being in one line or something. Becoming something like TOC you can click on and get to its documentation. |
I like |
Yea I am not too happy with doing any string parsing during runtime. After speaking with @amueller , I would be okay with two sources of truth and have a linter than makes sure that docstring type is consistent with the typing annotation. |
So then I would say: still one source of truth, only that docstrings are compiled and committed into the repository. You can use CI then to check that author of PR did not forget to do so. But downside here i that while for core sklearn you are able to verify this, 3rd party estimators might become out of sync. One option could be is to have an env variable which you can set to get docstrings to be dynamically updated at runtime as well. Do note that this updating happens only at module load time, only once per process run. So I would really first check this performance hit because it might be premature optimization to try to have compiled docstrings in the repository. It adds complexity. So if loading sklearn or its estimators before and after this change is not significantly different, maybe it is not a problem. So let's not do any decision here of one or the other approach until this is profiled? |
In all cases we will recommend third party estimators have type annotations. If we go this route, we can provide a tool and/or function to help them make sure their docstring are consistent.
Agreed, I will profile the current injection solution before making any changes. |
The idea of not dynamically generating docstrings, but only validating them sounds good and is likely more reliable.
I would strongly push to get this took included in some specialized docstring packages. It could be numpydoc numpy/numpydoc#196 or it looks like https://github.com/terrencepreilly/darglint does some of it. I understand the style of what is put in doctring types can be customized but if we rely on such a tool, it would be nice to try standardizing this in the scientific python community while we are at it. |
The overhead of injection comes from module import time, so it will make the first import slower: Without injection%time from sklearn.linear_model import LogisticRegression
# CPU times: user 8 µs, sys: 1e+03 ns, total: 9 µs
# Wall time: 11.9 µs With injection%time from sklearn.linear_model import LogisticRegression
# CPU times: user 694 ms, sys: 161 ms, total: 854 ms
# Wall time: 854 ms Given this, I would prefer the checking. The other benefit with checking is this makes it easier to check the docstrings for functions as well. To be specific, at first the check will only have only a python interface that can be called with pytest. Next we can create a simple python script to check specific classes or functions and generate the correct docstring types so a developer can copy and paste.
I'll take a look at to see if we can take advantage of
I agree standardizing will be nice, but I assume it will take a long time. I thought of doing this in a step by step basis:
|
This looks promising indeed. Once concern is the addition of the try:
from typing import Litteral
except:
# Python <3.9
from typing_extension import Litteral and only add it conditionally as a dependency for Python <3.9, to indicate that we won't need it in 4-5 years. About whether this conversion is sufficient for our needs, I suppose we'll see once we use it on more classes. I am still a bit bothered that this "type -> str" conversion is done in scikit-learn as opposed to some specialized library, but I imagine we could also contribute it somewhere once we have a solution that we are happy with. |
COPYING
Outdated
@@ -1,32 +1,29 @@ | |||
New BSD License |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was merged in a different PR. Maybe merge master in?
Edit: I think since the typing_extensions module is GPL we would not be able to vendor it. |
It's compatible with GPL but it's not GPL (it's the same license as CPython source code) as far as I can tell. So copyleft should not be an issue I think. |
So I think the blocker in this PR is the addition of Could we for now replace |
If #16705 (comment) works for Litteral that would be even better. @thomasjpfan I think if we want to move forward on this better to keep it simple.
|
Now that we are 3.7+, using |
What was the consensus on types annotations in the generated documentation? It's true that all those Litteral are maybe not very readable there, at least with the current formatting. |
C: float = 1.0, | ||
fit_intercept: bool = True, | ||
intercept_scaling: float = 1, | ||
class_weight: Union[dict, Literal["balanced"]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As annotations are not evaluated with from __future__ import annotations
, we can use directly the future |
syntax inside them. It is already supported by Mypy.
Also, don't use the dict
type unless you want to allow inserting operations, and specify the known types for keys and values.
In addition, the allowed value None
was missing.
class_weight: Union[dict, Literal["balanced"]] = None, | |
class_weight: Mapping[str | float, float] | Literal["balanced"] | None = None, |
fit_intercept: bool = True, | ||
intercept_scaling: float = 1, | ||
class_weight: Union[dict, Literal["balanced"]] = None, | ||
random_state: Union[int, np.random.RandomState, None] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be useful to define a RandomLike
type.
random_state: Union[int, np.random.RandomState, None] = None, | |
random_state: int | np.random.RandomState | None = None, |
multi_class: Literal["auto", "ovr", "multinomial"] = "auto", | ||
verbose: int = 0, | ||
warm_start: bool = False, | ||
n_jobs: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_jobs: Optional[int] = None, | |
n_jobs: int | None = None, |
verbose: int = 0, | ||
warm_start: bool = False, | ||
n_jobs: Optional[int] = None, | ||
l1_ratio: Optional[float] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
l1_ratio: Optional[float] = None, | |
l1_ratio: float | None = None, |
from typing import Union | ||
from typing import Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from typing import Union | |
from typing import Optional | |
from typing import Mapping |
Reference Issues/PRs
Related to #16705
Related to #11170
What does this implement/fix? Explain your changes.
This PR experiments with static typing by:
LogisticRegression
.Annotated
from PEP 593 to provide metadata about the shape of andarray
.Adds ainject_docstring
keyword to__init_subclass__
to inject the type into the docstring. This allows us to not depend on numpydoc. We also have custom formating when we define types in our docstrings.Any other comments?
This PR is for adding simple types:
Not part of this PR and the future I can see this
Annotated
be useful with ranges:CC @rth