Skip to content

DOC Ensures that LatentDirichletAllocation passes numpydoc validation #20402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion maint_tools/test_docstrings.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,6 @@
"LassoLars",
"LassoLarsCV",
"LassoLarsIC",
"LatentDirichletAllocation",
"LedoitWolf",
"LinearDiscriminantAnalysis",
"LinearRegression",
Expand Down
42 changes: 26 additions & 16 deletions sklearn/decomposition/_lda.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ def _update_doc_distribution(


class LatentDirichletAllocation(TransformerMixin, BaseEstimator):
"""Latent Dirichlet Allocation with online variational Bayes algorithm
"""Latent Dirichlet Allocation with online variational Bayes algorithm.

.. versionadded:: 0.17

Expand Down Expand Up @@ -188,7 +188,7 @@ class LatentDirichletAllocation(TransformerMixin, BaseEstimator):
``n_samples``, the update method is same as batch learning. In the
literature, this is called kappa.

learning_offset : float, default=10.
learning_offset : float, default=10.0
A (positive) parameter that downweights early iterations in online
learning. It should be greater than 1.0. In the literature, this is
called tau_0.
Expand Down Expand Up @@ -278,6 +278,23 @@ class LatentDirichletAllocation(TransformerMixin, BaseEstimator):
Prior of topic word distribution `beta`. If the value is None, it is
`1 / n_components`.

See Also
--------
sklearn.discriminant_analysis.LinearDiscriminantAnalysis:
A classifier with a linear decision boundary, generated by fitting
class conditional densities to the data and using Bayes’ rule.

References
----------
.. [1] "Online Learning for Latent Dirichlet Allocation", Matthew D.
Hoffman, David M. Blei, Francis Bach, 2010

[2] "Stochastic Variational Inference", Matthew D. Hoffman, David M. Blei,
Chong Wang, John Paisley, 2013

[3] Matthew D. Hoffman's onlineldavb code. Link:
Comment on lines +292 to +295
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think,

Suggested change
[2] "Stochastic Variational Inference", Matthew D. Hoffman, David M. Blei,
Chong Wang, John Paisley, 2013
[3] Matthew D. Hoffman's onlineldavb code. Link:
.. [2] "Stochastic Variational Inference", Matthew D. Hoffman, David M. Blei,
Chong Wang, John Paisley, 2013
.. [3] Matthew D. Hoffman's onlineldavb code. Link:

but need to check the rendering (ci/circleci: doc artifact CI job), currently it doesn't look ideal: https://142780-843222-gh.circle-artifacts.com/0/doc/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might raise an error if the article is not linked (somewhere we expect [2]_) in the docstring. If this is not the case, we could either link where it is meaningful, otherwise, we could as well remove the reference.

https://github.com/blei-lab/onlineldavb

Examples
--------
>>> from sklearn.decomposition import LatentDirichletAllocation
Expand All @@ -293,18 +310,6 @@ class LatentDirichletAllocation(TransformerMixin, BaseEstimator):
>>> lda.transform(X[-2:])
array([[0.00360392, 0.25499205, 0.0036211 , 0.64236448, 0.09541846],
[0.15297572, 0.00362644, 0.44412786, 0.39568399, 0.003586 ]])

References
----------
.. [1] "Online Learning for Latent Dirichlet Allocation", Matthew D.
Hoffman, David M. Blei, Francis Bach, 2010

[2] "Stochastic Variational Inference", Matthew D. Hoffman, David M. Blei,
Chong Wang, John Paisley, 2013

[3] Matthew D. Hoffman's onlineldavb code. Link:
https://github.com/blei-lab/onlineldavb

"""

def __init__(
Expand Down Expand Up @@ -539,10 +544,12 @@ def partial_fit(self, X, y=None):
Document word matrix.

y : Ignored
Not used, present here for API consistency by convention.

Returns
-------
self
self:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self:
self

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self : object for consistency with other places in the package

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but I don't think it's correct. The base class for all estimators is BaseEstimator not object, so I think there is no point in indicating the object dtype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Partially fitted estimator.
"""
self._check_params()
first_time = not hasattr(self, "components_")
Expand Down Expand Up @@ -587,10 +594,12 @@ def fit(self, X, y=None):
Document word matrix.

y : Ignored
Not used, present here for API consistency by convention.

Returns
-------
self
self:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self:
self

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self : object

Copy link
Member

@rth rth Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial version was correct for cases where there are no types, and it's a bit pointless to indicate object dtype for estimators which is also not accurate https://numpydoc.readthedocs.io/en/latest/format.html#parameters

cc @glemaitre

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit pointless to indicate object dtype for estimators which is also not accurate

I agree. However, we have this way of documenting everywhere. I would prefer to keep this inaccurate convention until we do a find/replace regex in another PR. Would it not be easier to make the change?

Copy link
Member

@rth rth Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but my point is there is no issue with this line of docstring. It should pass the numpydoc validation, I think? And if so I would rather we didn't change it instead of introducing unnecessary code churn with a solution we know is not helpful / correct.

self : object is used elsewhere, but 27% of estimators still use the current conversion,

$ rg "^\s+self$"  | grep self | wc -l
38
$ rg "^\s+self\s*:$"  | grep self | wc -l
2
$ rg "^\s+self\s*:\s*object$"  | grep self | wc -l
103

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rth Thank you so much for reviewing my PR. The numpydoc validation requires a description and that's the reason a added a tautology. I'll be more than happy to make any changes according to your guidance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the description is ok. The point of @rth is about self : object vs self alone on the current line, not the description below.

Fitted estimator.
"""
self._check_params()
X = self._check_non_neg_array(
Expand Down Expand Up @@ -799,6 +808,7 @@ def score(self, X, y=None):
Document word matrix.

y : Ignored
Not used, present here for API consistency by convention.

Returns
-------
Expand Down