ENH Adds typing support to LogisticRegression #17799

thomasjpfan · 2020-07-01T02:56:31Z

Reference Issues/PRs

Related to #16705
Related to #11170

What does this implement/fix? Explain your changes.

This PR experiments with static typing by:

Adding typing to parameters and attributes of LogisticRegression.
Uses Annotated from PEP 593 to provide metadata about the shape of a ndarray.
Adds a inject_docstring keyword to __init_subclass__ to inject the type into the docstring. This allows us to not depend on numpydoc. We also have custom formating when we define types in our docstrings.
Another option is to have the type only in the generated docs so we would not need to inject the types during runtime.
This will help standardize the types in our docstrings since they are automatically generated.

Any other comments?

This PR is for adding simple types:

    def __init__(self,
                  penalty: Literal['l1', 'l2', 'elasticnet', 'none'] = 'l2',
                  *,
                  dual: bool = False,
                  tol: float = 1e-4,
                  C: float = 1.0,
                  fit_intercept: bool = True,
                  intercept_scaling: float = 1,
                  class_weight: Union[dict, Literal['balanced']] = None)

Not part of this PR and the future I can see this Annotated be useful with ranges:

max_features: Union[Annotated[float, Range(0, 1)], Annotated[int, Range(1, upper=None)]],
                    Literal['auto', 'sqrt', 'log2'], None]

CC @rth

rth · 2020-07-01T11:31:55Z

Very interesting! So it seem to work for the docstring in https://110636-843222-gh.circle-artifacts.com/0/doc/modules/generated/sklearn.linear_model.LogisticRegression.html

The init becomes somewhat difficult to read, but maybe that's something one can get used to.

It would be interesting to benchmark how it takes to init this class before/after.

amueller · 2020-07-01T17:07:54Z

Honestly I like the type annotations in the init. It just makes the insanity that is our parameters more visible ;)
So is the goal of inject_docstring to have this info available in jupyter? I think this is not something that should be fixed in sklearn. How are other projects handling this?

Interesting discussion here: scikit-image/scikit-image#3153

mitar · 2020-07-02T07:14:23Z

sklearn/linear_model/_logistic.py

@@ -1036,7 +1042,7 @@ class LogisticRegression(BaseEstimator, LinearClassifierMixin,

    Parameters
    ----------
-    penalty : {'l1', 'l2', 'elasticnet', 'none'}, default='l2'
+    penalty :


I do not think space before : is necessary then anymore, no?

I think the space is a part of the rst definition list spec

Optional classifiers may follow the term on the same line, each after an inline " : " (space, colon, space).

Now that I look at the spec, technically we do not need the : at all.

No, that space is part of numpydoc style guide. Google style is for example without it, See comparison.

Because to my understanding, the internal Sphinx syntax expected is:

:param path: The path of the file to wrap :type path: str :param field_storage: The :class:`FileStorage` instance to wrap :type field_storage: FileStorage :param temporary: Whether or not to delete the file when the File instance is destructed :type temporary: bool :returns: A buffered writable file descriptor :rtype: BufferedFileStorage

So I am guessing sklearn is using napoleon to convert numpy docstring to internal Sphinx one. So one should check how napoleon converts when there is no space before :?

(This is all me guessing. I do not know those details of sklearn.)

So I am guessing sklearn is using napoleon to convert numpy docstring to internal Sphinx one.

We are currently using numpydoc.

The space is picked up as a part of the parameter name:

no space

with space

mitar · 2020-07-02T07:15:16Z

sklearn/linear_model/_logistic.py

+                                       ("n_classes", "n_features"))]  # noqa
+    intercept_: Annotated[np.ndarray, Shape((1,),  # noqa
+                                            ("n_classes",))]  # noqa
+    n_iter_: Annotated[np.ndarray, Shape(("n_classes",), (1,))]  # noqa


mitar · 2020-07-02T07:16:52Z

sklearn/linear_model/_logistic.py

+                 C: float = 1.0,
+                 fit_intercept: bool = True,
+                 intercept_scaling: float = 1,
+                 class_weight: Union[dict, Literal['balanced']] = None,


You might want to specify dict more precisely.

mitar · 2020-07-02T07:17:42Z

sklearn/linear_model/_logistic.py

+from ..utils._typing import Annotated
+from ..utils._typing import Shape
+from typing import Union
+from typing import Optional


Any reason why these are all on a separate line?

It's a personal style choice so diffs are easier to look at. I also like:

from typing import ( Union, Optional, )

True as well.

Unrelated: Does sklearn uses black?

We are slowly working our way on deciding this: #11336

I think this comment is relevant here: #11336 (comment)

If we go with adding typing information, then black is reasonable for its style choices.

mitar · 2020-07-02T07:20:30Z

Honestly I like the type annotations in the init. It just makes the insanity that is our parameters more visible ;)

Maybe this could be improved with style change to how this is shown, with every parameter being in one line or something. Becoming something like TOC you can click on and get to its documentation.

mitar · 2020-07-02T07:21:49Z

I like inject_docstring, but maybe performance implications should be checked?

thomasjpfan · 2020-07-03T17:48:28Z

I like inject_docstring, but maybe performance implications should be checked?

Yea I am not too happy with doing any string parsing during runtime.

After speaking with @amueller , I would be okay with two sources of truth and have a linter than makes sure that docstring type is consistent with the typing annotation.

mitar · 2020-07-03T19:22:09Z

After speaking with @amueller , I would be okay with two sources of truth and have a linter than makes sure that docstring type is consistent with the typing annotation.

So then I would say: still one source of truth, only that docstrings are compiled and committed into the repository. You can use CI then to check that author of PR did not forget to do so.

But downside here i that while for core sklearn you are able to verify this, 3rd party estimators might become out of sync.

One option could be is to have an env variable which you can set to get docstrings to be dynamically updated at runtime as well. Do note that this updating happens only at module load time, only once per process run. So I would really first check this performance hit because it might be premature optimization to try to have compiled docstrings in the repository. It adds complexity. So if loading sklearn or its estimators before and after this change is not significantly different, maybe it is not a problem.

So let's not do any decision here of one or the other approach until this is profiled?

thomasjpfan · 2020-07-03T20:29:27Z

But downside here i that while for core sklearn you are able to verify this, 3rd party estimators might become out of sync.

In all cases we will recommend third party estimators have type annotations. If we go this route, we can provide a tool and/or function to help them make sure their docstring are consistent.

So let's not do any decision here of one or the other approach until this is profiled?

Agreed, I will profile the current injection solution before making any changes.

rth · 2020-07-04T09:05:11Z

The idea of not dynamically generating docstrings, but only validating them sounds good and is likely more reliable.

we can provide a tool

I would strongly push to get this took included in some specialized docstring packages. It could be numpydoc numpy/numpydoc#196 or it looks like https://github.com/terrencepreilly/darglint does some of it. I understand the style of what is put in doctring types can be customized but if we rely on such a tool, it would be nice to try standardizing this in the scientific python community while we are at it.

thomasjpfan · 2020-07-06T16:44:53Z

So let's not do any decision here of one or the other approach until this is profiled?

The overhead of injection comes from module import time, so it will make the first import slower:

Without injection

%time from sklearn.linear_model import LogisticRegression
# CPU times: user 8 µs, sys: 1e+03 ns, total: 9 µs
# Wall time: 11.9 µs

With injection

%time from sklearn.linear_model import LogisticRegression
# CPU times: user 694 ms, sys: 161 ms, total: 854 ms
# Wall time: 854 ms

Given this, I would prefer the checking. The other benefit with checking is this makes it easier to check the docstrings for functions as well. To be specific, at first the check will only have only a python interface that can be called with pytest. Next we can create a simple python script to check specific classes or functions and generate the correct docstring types so a developer can copy and paste.

It could be numpydoc numpy/numpydoc#196 or it looks like terrencepreilly/darglint does some of it.

I'll take a look at to see if we can take advantage of Darglint.

I understand the style of what is put in doctring types can be customized but if we rely on such a tool, it would be nice to try standardizing this in the scientific python community while we are at it.

I agree standardizing will be nice, but I assume it will take a long time. I thought of doing this in a step by step basis:

Complete the typing + checks for scikit-learn.
Use 1. to have an implementation and to demonstrate value to community.
Discuss with overall community on how to standardize the docstring. (I assume this will take the most time)
We may need to change the docstring types in scikit-learn, but this should be easy since they can be automatically generated.

rth · 2020-08-29T10:04:50Z

This looks promising indeed. Once concern is the addition of the typing_extension dependency. Vendoring typining_extension.Litteral doesn't look too straightforward. What we could to at least define, in utils.fixes

try:
	from typing import Litteral
except:
    # Python <3.9
    from typing_extension import Litteral

and only add it conditionally as a dependency for Python <3.9, to indicate that we won't need it in 4-5 years.

About whether this conversion is sufficient for our needs, I suppose we'll see once we use it on more classes.

I am still a bit bothered that this "type -> str" conversion is done in scikit-learn as opposed to some specialized library, but I imagine we could also contribute it somewhere once we have a solution that we are happy with.

rth · 2020-08-29T09:44:02Z

COPYING

@@ -1,32 +1,29 @@
-New BSD License


That was merged in a different PR. Maybe merge master in?

thomasjpfan · 2020-08-29T17:28:32Z

~~It would not be hard to vendor all of typing extensions since it is all one file: https://github.com/python/typing/blob/master/typing_extensions/src_py3/typing_extensions.py~~

~~I would need to check if all the nice typing hints still work if we vendor this.~~

Edit: I think since the typing_extensions module is GPL we would not be able to vendor it.

rth · 2020-08-30T12:16:48Z

I think since the typing_extensions module is GPL we would not be able to vendor it.

It's compatible with GPL but it's not GPL (it's the same license as CPython source code) as far as I can tell. So copyleft should not be an issue I think.

rth · 2021-06-11T07:50:29Z

So I think the blocker in this PR is the addition of typing_extension as a hard dependency for Python 3.9. I'm not necessarily opposed to it, but it would need more discussion.

Could we for now replace Litteral with str, in this first iteration? That would make it much easier to merge.

rth · 2021-06-11T07:54:27Z

If #16705 (comment) works for Litteral that would be even better.

@thomasjpfan I think if we want to move forward on this better to keep it simple.

Add types to init, in this example to LogisticRegression
In a separate PR propose the check for docstring.

thomasjpfan · 2021-06-11T13:55:44Z

If #16705 (comment) works for Litteral that would be even better.

Now that we are 3.7+, using from __future__ import annotations now works. :)

rth · 2021-06-12T14:47:03Z

What was the consensus on types annotations in the generated documentation? It's true that all those Litteral are maybe not very readable there, at least with the current formatting.

vnmabus · 2022-06-30T12:09:53Z

sklearn/linear_model/_logistic.py

+        C: float = 1.0,
+        fit_intercept: bool = True,
+        intercept_scaling: float = 1,
+        class_weight: Union[dict, Literal["balanced"]] = None,


As annotations are not evaluated with from __future__ import annotations, we can use directly the future | syntax inside them. It is already supported by Mypy.

Also, don't use the dict type unless you want to allow inserting operations, and specify the known types for keys and values.

In addition, the allowed value None was missing.

Suggested change

class_weight: Union[dict, Literal["balanced"]] = None,

class_weight: Mapping[str | float, float] | Literal["balanced"] | None = None,

vnmabus · 2022-06-30T12:11:00Z

sklearn/linear_model/_logistic.py

+        fit_intercept: bool = True,
+        intercept_scaling: float = 1,
+        class_weight: Union[dict, Literal["balanced"]] = None,
+        random_state: Union[int, np.random.RandomState, None] = None,


It would probably be useful to define a RandomLike type.

Suggested change

random_state: Union[int, np.random.RandomState, None] = None,

random_state: int | np.random.RandomState | None = None,

vnmabus · 2022-06-30T12:15:36Z

sklearn/linear_model/_logistic.py

+        multi_class: Literal["auto", "ovr", "multinomial"] = "auto",
+        verbose: int = 0,
+        warm_start: bool = False,
+        n_jobs: Optional[int] = None,


Suggested change

n_jobs: Optional[int] = None,

n_jobs: int | None = None,

vnmabus · 2022-06-30T12:15:51Z

sklearn/linear_model/_logistic.py

+        verbose: int = 0,
+        warm_start: bool = False,
+        n_jobs: Optional[int] = None,
+        l1_ratio: Optional[float] = None,


Suggested change

l1_ratio: Optional[float] = None,

l1_ratio: float | None = None,

vnmabus · 2022-06-30T12:18:00Z

sklearn/linear_model/_logistic.py

+from typing import Union
+from typing import Optional


Suggested change

from typing import Union

from typing import Optional

from typing import Mapping

WIP Adds typing support for automatic docstring generation

075d7e3

github-actions bot added module:linear_model module:utils labels Jul 1, 2020

thomasjpfan added 4 commits June 30, 2020 23:13

WIP Temporary fix

6b35ee1

WIP Temporary fix

e158440

WIP Install typing extensions for linting

4a925e1

WIP Try again

6cb1e79

amueller mentioned this pull request Jul 1, 2020

add type annotations in some modules in skimage.segmentation scikit-image/scikit-image#4794

Draft

4 tasks

mitar reviewed Jul 2, 2020

View reviewed changes

mitar mentioned this pull request Jul 2, 2020

API split "model" parameters from "config" params. #17566

Open

Merge remote-tracking branch 'upstream/master' into typing

2930743

thomasjpfan added 7 commits July 8, 2020 00:19

ENH Adds tests with docstring type generator tool

87f6b07

STY Lint error

4c47776

Merge remote-tracking branch 'upstream/master' into typing

7578b63

MNT Adds F821 to ignore flag

d8c85d0

BLD Try to fix typing-extensions

2e67bc4

BUG Fix link

cf9ff0c

CI Fixes version?

2535e9f

rth reviewed Aug 29, 2020

View reviewed changes

COPYING Outdated

@@ -1,32 +1,29 @@

New BSD License

Copy link

Member

rth Aug 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was merged in a different PR. Maybe merge master in?

thomasjpfan added 2 commits August 29, 2020 13:14

Merge remote-tracking branch 'upstream/master' into typing

923ac16

CLN Move Literal into fixes

8d2aa0e

FIX Fixes spelling of typing_extensions

2378769

rth mentioned this pull request Aug 31, 2020

Support typing #16705

Closed

thomasjpfan added 3 commits August 31, 2020 18:53

MNT Soft dependency on typing_extensions

67c4574

MNT Soft dependency on typing_extensions

f05e47c

Merge remote-tracking branch 'upstream/master' into typing

9f258ad

Base automatically changed from master to main January 22, 2021 10:52

thomasjpfan added 6 commits January 23, 2021 21:28

Merge remote-tracking branch 'upstream/main' into typing

301eee1

ENH Simplify handling of typing

e2334ec

Merge remote-tracking branch 'upstream/main' into typing

7bbcec7

ENH Better diffs

459c0bf

ENH Simplify to only hyper-parameters

39d3fea

ENH Adds cvsplit

b43dd9c

jjerphan mentioned this pull request Jun 9, 2021

Make BaseShuffleSplit public #20056

Merged

thomasjpfan added 2 commits June 11, 2021 09:19

Merge remote-tracking branch 'upstream/main' into typing

63c0591

CLN Smaller diff

09cad30

thomasjpfan changed the title ~~ENH Adds typing support with linting to check docstring~~ ENH Adds typing support to LogisticRegression Jun 11, 2021

thomasjpfan closed this Apr 24, 2022

thomasjpfan reopened this Apr 24, 2022

Merge remote-tracking branch 'upstream/main' into typing

64079ea

vnmabus reviewed Jun 30, 2022

View reviewed changes

	class_weight: Union[dict, Literal["balanced"]] = None,
	class_weight: Mapping[str \| float, float] \| Literal["balanced"] \| None = None,

	random_state: Union[int, np.random.RandomState, None] = None,
	random_state: int \| np.random.RandomState \| None = None,

	l1_ratio: Optional[float] = None,
	l1_ratio: float \| None = None,

	from typing import Union
	from typing import Optional
	from typing import Mapping

Uh oh!

ENH Adds typing support to LogisticRegression #17799

Are you sure you want to change the base?

ENH Adds typing support to LogisticRegression #17799

Uh oh!

Conversation

thomasjpfan commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

rth commented Jul 1, 2020

Uh oh!

amueller commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

no space

with space

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mitar commented Jul 2, 2020

Uh oh!

mitar commented Jul 2, 2020

Uh oh!

thomasjpfan commented Jul 3, 2020

Uh oh!

mitar commented Jul 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Jul 3, 2020

Uh oh!

rth commented Jul 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Jul 6, 2020

Without injection

With injection

Uh oh!

rth commented Aug 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Aug 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Aug 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Jun 11, 2021

Uh oh!

rth commented Jun 11, 2021

Uh oh!

thomasjpfan commented Jun 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Jun 12, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Jul 1, 2020 •

edited

Loading

amueller commented Jul 1, 2020 •

edited

Loading

mitar commented Jul 3, 2020 •

edited

Loading

rth commented Jul 4, 2020 •

edited

Loading

thomasjpfan commented Aug 29, 2020 •

edited

Loading

rth commented Aug 30, 2020 •

edited

Loading

thomasjpfan commented Jun 11, 2021 •

edited

Loading