Tests for attribute documentation #13385

amueller · 2019-03-04T15:23:38Z

Not sure if this has been started before?
This adds a test to check that all public attributes are documented.
I also added some missing attributes. It might actually be tricky to make this pass, since some attributes are conditional on init parameter (maybe this could be figured out by parsing the docstring for "only present" or something like that?), and some are deprecated and not documented.

I think we should at least try to fix the errors. I can separate this into two PRs.
There's still 99 estimators with at least one undocumented attribute (but LogisticRegression ain't one).

Some are a bit suspicious, for example the ExtraTreeRegressor (and some other tree-based regressors) have classes_.

Would be nice if someone ran the test and fixed these up. Current output is here:
https://pastebin.com/FtAECdE6

agramfort · 2019-03-04T20:18:37Z

+1000 on this one

jnothman · 2019-03-05T07:24:39Z

Nice! I've considered this, but never gone ahead and implemented. It's obviously going to have false negatives, but very helpful!

jnothman

Note that we otherwise set up docstring checking in sklearn/tests/test_docstring_parameters.py and then only in some test runs. Was that for speed, or only to avoid a hard dependency on numpydoc?

amueller · 2019-03-05T13:18:15Z

@jnothman Didn't you implement that other test? This test is quite different from the existing tests as it needs to instantiate estimators. @thomasjpfan suggested we could avoid that by parsing the source but I think it might be fine to do it the way it's done now.

Right now I'm worried more about false positives than false negatives, because it makes merging this hard/impossible. But I might be able to find a way around with adding enough exceptions?

thomasjpfan · 2019-03-05T14:32:59Z

Here is an approach to parse the source to get the attributes:

import re
from inspect import getsource, getmro

att_match = re.compile(r'self\.(?!_)(\w+_)\W')

def get_attributes(target_cls):
    attributes = set()
    
    for current_cls in getmro(target_cls):
        if current_cls == BaseEstimator:
            break
        attributes |= set(att_match.findall(getsource(current_cls)))
    return attributes

from sklearn.ensemble import BaggingClassifier
get_attributes(BaggingClassifier)

This returns:

{'base_estimator_',
 'classes_',
 'estimators_',
 'estimators_features_',
 'estimators_samples_',
 'n_classes_',
 'n_features_',
 'oob_decision_function_',
 'oob_score_'}

amueller · 2019-03-05T18:03:13Z

thanks @TomDLT !

jnothman · 2019-03-07T09:36:21Z

@thomasjpfan what's the difference between the two approaches?

thomasjpfan · 2019-03-07T20:02:47Z

This PR's approach calls fit and looks for attributes afterwards, by checking if it ends in _.

My regex approach, looks at the source code of the class, and finds all strings that matches 'self\.(\w+_)\W' (starts with self., ends with _).

The major advantage to the parsing the source code is that it can find all the attributes while calling fit may not find them all. For example, the attribute explained_variance_ratio_ in LinearDiscriminantAnalysis is only defined when for certain solvers.

jnothman · 2019-03-10T10:27:58Z

Sorry for being unclear. I didn't mean what's the difference in methodology.... I hoped you could just produce a list of the differences in extracted attributes.

thomasjpfan · 2019-03-11T19:52:21Z

Here are the attributes found by using regex approach on the HEAD of this PR: https://pastebin.com/iT4m7pUV

Here is what check_attribute_docstrings look like with the regex approach.

jnothman · 2019-03-12T01:03:11Z

Well the approaches are obviously complementary. > means found by the regex method only; < means found by the fit-and-check method only. I've not yet analysed for false positives (which should only occur for the regex method).

> ARDRegression: 'X_offset_'
> ARDRegression: 'X_scale_'
< BaggingClassifier: 'oob_decision_function_'
< BaggingClassifier: 'oob_score_'
< BaggingRegressor: 'oob_prediction_'
< BaggingRegressor: 'oob_score_'
< BernoulliNB: 'intercept_'
> BernoulliRBM: 'random_state_'
> CheckingClassifier: 'classes_'
< ComplementNB: 'intercept_'
> CountVectorizer: 'fixed_vocabulary_'
< ExtraTreesClassifier: 'oob_decision_function_'
< ExtraTreesClassifier: 'oob_score_'
< ExtraTreesRegressor: 'oob_prediction_'
< ExtraTreesRegressor: 'oob_score_'
> ExtraTreesRegressor: 'classes_'
> ExtraTreesRegressor: 'n_classes_'
> GaussianRandomProjection: 'n_component_'
> GaussianRandomProjection: 'n_components_'
> GradientBoostingClassifier: 'base_estimator_'
< GradientBoostingClassifier: 'oob_improvement_'
> GradientBoostingRegressor: 'base_estimator_'
> GradientBoostingRegressor: 'classes_'
< GradientBoostingRegressor: 'oob_improvement_'
> IsolationForest: 'oob_score_'
> IsotonicRegression: 'increasing_'
< KNeighborsClassifier: 'classes_'
< KNeighborsClassifier: 'effective_metric_'
< KNeighborsClassifier: 'effective_metric_params_'
< KNeighborsClassifier: 'outputs_2d_'
< KNeighborsRegressor: 'effective_metric_'
< KNeighborsRegressor: 'effective_metric_params_'
< KernelPCA: 'X_transformed_fit_'
< KernelPCA: 'dual_coef_'
> LassoCV: 'l1_ratio_'
> LassoLarsIC: 'active_'
> LassoLarsIC: 'coef_path_'
< LinearDiscriminantAnalysis: 'covariance_'
> MLPClassifier: 'best_validation_score_'
> MLPClassifier: 'validation_scores_'
> MLPRegressor: 'best_validation_score_'
< MeanShift Parameter 'labels_ :' has an empty type spec. Remove the colon
> MLPRegressor: 'validation_scores_'
< MiniBatchKMeans Parameter 'labels_ :' has an empty type spec. Remove the colon
> MiniBatchDictionaryLearning: 'random_state_'
< MiniBatchSparsePCA: 'mean_'
> MiniBatchKMeans: 'random_state_'
> MiniBatchSparsePCA: 'error_'
< NMF: 'n_components_'
< NearestNeighbors: 'effective_metric_'
< NearestNeighbors: 'effective_metric_params_'
< NeighborhoodComponentsAnalysis: 'random_state_'
> MultiTaskLassoCV: 'l1_ratio_'
> MultinomialNB: 'intercept_'
> NuSVR: 'classes_'
> OneClassSVM: 'classes_'
< OneHotEncoder: 'drop_idx_'
> OneVsOneClassifier: 'n_classes_'
> OneVsOneClassifier: 'pairwise_indices_'
> PassiveAggressiveClassifier: 'average_coef_'
> PassiveAggressiveClassifier: 'average_intercept_'
> PassiveAggressiveClassifier: 'standard_coef_'
> PassiveAggressiveClassifier: 'standard_intercept_'
> PassiveAggressiveRegressor: 'average_coef_'
> PassiveAggressiveRegressor: 'average_intercept_'
> PassiveAggressiveRegressor: 'standard_coef_'
> PassiveAggressiveRegressor: 'standard_intercept_'
> Perceptron: 'average_coef_'
> Perceptron: 'average_intercept_'
< QuadraticDiscriminantAnalysis: 'covariance_'
> Perceptron: 'standard_coef_'
> Perceptron: 'standard_intercept_'
> RFE: 'scores_'
< RadiusNeighborsClassifier: 'effective_metric_'
< RadiusNeighborsClassifier: 'effective_metric_params_'
< RadiusNeighborsClassifier: 'outputs_2d_'
< RadiusNeighborsRegressor: 'effective_metric_'
< RadiusNeighborsRegressor: 'effective_metric_params_'
> RFECV: 'scores_'
< RandomForestClassifier: 'oob_decision_function_'
< RandomForestClassifier: 'oob_score_'
< RandomForestRegressor: 'oob_prediction_'
< RandomForestRegressor: 'oob_score_'
> RandomForestRegressor: 'classes_'
> RandomForestRegressor: 'n_classes_'
> RandomTreesEmbedding: 'classes_'
> RandomTreesEmbedding: 'n_classes_'
< RidgeCV: 'cv_values_'
< RidgeClassifierCV: 'cv_values_'
< SGDRegressor: 'average_coef_'
< SGDRegressor: 'average_intercept_'
> SGDClassifier: 'average_coef_'
> SGDClassifier: 'average_intercept_'
> SGDClassifier: 'standard_coef_'
> SGDClassifier: 'standard_intercept_'
> SGDRegressor: 'standard_coef_'
> SGDRegressor: 'standard_intercept_'
> SVR: 'classes_'
> SimpleImputer: 'statistcs_'
< SparsePCA: 'mean_'
< SpectralCoclustering: 'biclusters_'
< SpectralEmbedding: 'n_neighbors_'
> SparseRandomProjection: 'n_component_'
> SparseRandomProjection: 'n_components_'
> SpectralEmbedding: 'gamma_'
> TfidfVectorizer: 'fixed_vocabulary_'

jnothman · 2019-03-12T01:05:21Z

One of the problems/features with the regex approach, if I'm not much mistaken, is it's sometimes checking the presence in the docstring, even if the class never sets that attribute.

jnothman · 2019-03-12T01:06:18Z

In cases of inheritance it's often finding classes_ for non-classifiers.

jnothman · 2019-03-12T02:16:58Z

I think we should take Andy's approach in the code, but review and resolve the additions from the regex-based solution.

thomasjpfan · 2019-03-12T12:27:01Z

Currently the regex approach looking for text that looks like self.attribute_ or ‘def attribute_’. If the docstring includes these styles, it assume it’s an attribute.

In the case of SVR it finds classes_, because it is in the BaseLibSVM source, but is only activated for classifiers.

Given this, I think a mixture of the two approaches would cover most use cases. Use Andy’s approach and only look at the source for self.attribute_in the final estimator (don’t parse parent classes). Andy’s approach will already find the ‘def attribute_’ properties.

# Conflicts: # sklearn/decomposition/fastica_.py # sklearn/neighbors/classification.py # sklearn/neighbors/regression.py # sklearn/svm/classes.py

…ikit-learn#13385)

amueller · 2020-03-05T19:01:47Z

fixed in #16286

amueller added 3 commits March 3, 2019 09:13

add test for attribute docstrings

2756c28

better error, stupid variable rename fix

3d4c949

add t_ and classes_ documentation in a couple of places

1ed2235

jnothman reviewed Mar 5, 2019

View reviewed changes

add attributes in a couple of places

c464ad4

amueller mentioned this pull request Jul 12, 2019

Ensure all attributes are documented #14312

Closed

amueller added 4 commits July 24, 2019 11:25

Merge branch 'master' into attr_doc_tests

f4cbbfa

# Conflicts: # sklearn/decomposition/fastica_.py # sklearn/neighbors/classification.py # sklearn/neighbors/regression.py # sklearn/svm/classes.py

merge issues, better test using __dict__.keys()

84f9659

fix duplicate section

74749e3

pep8 in someone's elses changes. my favorite.

4fd4813

amueller mentioned this pull request Jul 24, 2019

DOC some attribute documentation #14459

Merged

pwalchessen mentioned this pull request Nov 2, 2019

[DOC] ensure all attributes are documented for Perceptron #14312 #15507

Closed

lesteve mentioned this pull request Jan 29, 2020

MRG adding test of fit attributes #16286

Merged

judithabk6 added a commit to judithabk6/scikit-learn that referenced this pull request Jan 31, 2020

a bit of refactor of test_fit_docstring_attributes (inspiration PR sc…

6236e9b

…ikit-learn#13385)

amueller closed this Mar 5, 2020

Uh oh!

Tests for attribute documentation #13385

Tests for attribute documentation #13385

Uh oh!

Conversation

amueller commented Mar 4, 2019 • edited by TomDLT Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agramfort commented Mar 4, 2019

Uh oh!

jnothman commented Mar 5, 2019 via email

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

amueller commented Mar 5, 2019

Uh oh!

thomasjpfan commented Mar 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Mar 5, 2019

Uh oh!

jnothman commented Mar 7, 2019 via email

Uh oh!

thomasjpfan commented Mar 7, 2019

Uh oh!

jnothman commented Mar 10, 2019

Uh oh!

thomasjpfan commented Mar 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Mar 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Mar 12, 2019

Uh oh!

jnothman commented Mar 12, 2019

Uh oh!

jnothman commented Mar 12, 2019

Uh oh!

thomasjpfan commented Mar 12, 2019

Uh oh!

amueller commented Mar 5, 2020

Uh oh!

Uh oh!

amueller commented Mar 4, 2019 •

edited by TomDLT

Loading

thomasjpfan commented Mar 5, 2019 •

edited

Loading

thomasjpfan commented Mar 11, 2019 •

edited

Loading

jnothman commented Mar 12, 2019 •

edited

Loading