Skip to content

Tests for attribute documentation #13385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

amueller
Copy link
Member

@amueller amueller commented Mar 4, 2019

Not sure if this has been started before?
This adds a test to check that all public attributes are documented.
I also added some missing attributes. It might actually be tricky to make this pass, since some attributes are conditional on init parameter (maybe this could be figured out by parsing the docstring for "only present" or something like that?), and some are deprecated and not documented.

I think we should at least try to fix the errors. I can separate this into two PRs.
There's still 99 estimators with at least one undocumented attribute (but LogisticRegression ain't one).

Some are a bit suspicious, for example the ExtraTreeRegressor (and some other tree-based regressors) have classes_.

Would be nice if someone ran the test and fixed these up. Current output is here:
https://pastebin.com/FtAECdE6

@agramfort
Copy link
Member

+1000 on this one

@jnothman
Copy link
Member

jnothman commented Mar 5, 2019 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we otherwise set up docstring checking in sklearn/tests/test_docstring_parameters.py and then only in some test runs. Was that for speed, or only to avoid a hard dependency on numpydoc?

@amueller
Copy link
Member Author

amueller commented Mar 5, 2019

@jnothman Didn't you implement that other test? This test is quite different from the existing tests as it needs to instantiate estimators. @thomasjpfan suggested we could avoid that by parsing the source but I think it might be fine to do it the way it's done now.

Right now I'm worried more about false positives than false negatives, because it makes merging this hard/impossible. But I might be able to find a way around with adding enough exceptions?

@thomasjpfan
Copy link
Member

thomasjpfan commented Mar 5, 2019

Here is an approach to parse the source to get the attributes:

import re
from inspect import getsource, getmro

att_match = re.compile(r'self\.(?!_)(\w+_)\W')

def get_attributes(target_cls):
    attributes = set()
    
    for current_cls in getmro(target_cls):
        if current_cls == BaseEstimator:
            break
        attributes |= set(att_match.findall(getsource(current_cls)))
    return attributes

from sklearn.ensemble import BaggingClassifier
get_attributes(BaggingClassifier)

This returns:

{'base_estimator_',
 'classes_',
 'estimators_',
 'estimators_features_',
 'estimators_samples_',
 'n_classes_',
 'n_features_',
 'oob_decision_function_',
 'oob_score_'}

@amueller
Copy link
Member Author

amueller commented Mar 5, 2019

thanks @TomDLT !

@jnothman
Copy link
Member

jnothman commented Mar 7, 2019 via email

@thomasjpfan
Copy link
Member

This PR's approach calls fit and looks for attributes afterwards, by checking if it ends in _.

My regex approach, looks at the source code of the class, and finds all strings that matches 'self\.(\w+_)\W' (starts with self., ends with _).

The major advantage to the parsing the source code is that it can find all the attributes while calling fit may not find them all. For example, the attribute explained_variance_ratio_ in LinearDiscriminantAnalysis is only defined when for certain solvers.

@jnothman
Copy link
Member

Sorry for being unclear. I didn't mean what's the difference in methodology.... I hoped you could just produce a list of the differences in extracted attributes.

@thomasjpfan
Copy link
Member

thomasjpfan commented Mar 11, 2019

Here are the attributes found by using regex approach on the HEAD of this PR: https://pastebin.com/iT4m7pUV

Here is what check_attribute_docstrings look like with the regex approach.

@jnothman
Copy link
Member

jnothman commented Mar 12, 2019

Well the approaches are obviously complementary. > means found by the regex method only; < means found by the fit-and-check method only. I've not yet analysed for false positives (which should only occur for the regex method).

> ARDRegression: 'X_offset_'
> ARDRegression: 'X_scale_'
< BaggingClassifier: 'oob_decision_function_'
< BaggingClassifier: 'oob_score_'
< BaggingRegressor: 'oob_prediction_'
< BaggingRegressor: 'oob_score_'
< BernoulliNB: 'intercept_'
> BernoulliRBM: 'random_state_'
> CheckingClassifier: 'classes_'
< ComplementNB: 'intercept_'
> CountVectorizer: 'fixed_vocabulary_'
< ExtraTreesClassifier: 'oob_decision_function_'
< ExtraTreesClassifier: 'oob_score_'
< ExtraTreesRegressor: 'oob_prediction_'
< ExtraTreesRegressor: 'oob_score_'
> ExtraTreesRegressor: 'classes_'
> ExtraTreesRegressor: 'n_classes_'
> GaussianRandomProjection: 'n_component_'
> GaussianRandomProjection: 'n_components_'
> GradientBoostingClassifier: 'base_estimator_'
< GradientBoostingClassifier: 'oob_improvement_'
> GradientBoostingRegressor: 'base_estimator_'
> GradientBoostingRegressor: 'classes_'
< GradientBoostingRegressor: 'oob_improvement_'
> IsolationForest: 'oob_score_'
> IsotonicRegression: 'increasing_'
< KNeighborsClassifier: 'classes_'
< KNeighborsClassifier: 'effective_metric_'
< KNeighborsClassifier: 'effective_metric_params_'
< KNeighborsClassifier: 'outputs_2d_'
< KNeighborsRegressor: 'effective_metric_'
< KNeighborsRegressor: 'effective_metric_params_'
< KernelPCA: 'X_transformed_fit_'
< KernelPCA: 'dual_coef_'
> LassoCV: 'l1_ratio_'
> LassoLarsIC: 'active_'
> LassoLarsIC: 'coef_path_'
< LinearDiscriminantAnalysis: 'covariance_'
> MLPClassifier: 'best_validation_score_'
> MLPClassifier: 'validation_scores_'
> MLPRegressor: 'best_validation_score_'
< MeanShift Parameter 'labels_ :' has an empty type spec. Remove the colon
> MLPRegressor: 'validation_scores_'
< MiniBatchKMeans Parameter 'labels_ :' has an empty type spec. Remove the colon
> MiniBatchDictionaryLearning: 'random_state_'
< MiniBatchSparsePCA: 'mean_'
> MiniBatchKMeans: 'random_state_'
> MiniBatchSparsePCA: 'error_'
< NMF: 'n_components_'
< NearestNeighbors: 'effective_metric_'
< NearestNeighbors: 'effective_metric_params_'
< NeighborhoodComponentsAnalysis: 'random_state_'
> MultiTaskLassoCV: 'l1_ratio_'
> MultinomialNB: 'intercept_'
> NuSVR: 'classes_'
> OneClassSVM: 'classes_'
< OneHotEncoder: 'drop_idx_'
> OneVsOneClassifier: 'n_classes_'
> OneVsOneClassifier: 'pairwise_indices_'
> PassiveAggressiveClassifier: 'average_coef_'
> PassiveAggressiveClassifier: 'average_intercept_'
> PassiveAggressiveClassifier: 'standard_coef_'
> PassiveAggressiveClassifier: 'standard_intercept_'
> PassiveAggressiveRegressor: 'average_coef_'
> PassiveAggressiveRegressor: 'average_intercept_'
> PassiveAggressiveRegressor: 'standard_coef_'
> PassiveAggressiveRegressor: 'standard_intercept_'
> Perceptron: 'average_coef_'
> Perceptron: 'average_intercept_'
< QuadraticDiscriminantAnalysis: 'covariance_'
> Perceptron: 'standard_coef_'
> Perceptron: 'standard_intercept_'
> RFE: 'scores_'
< RadiusNeighborsClassifier: 'effective_metric_'
< RadiusNeighborsClassifier: 'effective_metric_params_'
< RadiusNeighborsClassifier: 'outputs_2d_'
< RadiusNeighborsRegressor: 'effective_metric_'
< RadiusNeighborsRegressor: 'effective_metric_params_'
> RFECV: 'scores_'
< RandomForestClassifier: 'oob_decision_function_'
< RandomForestClassifier: 'oob_score_'
< RandomForestRegressor: 'oob_prediction_'
< RandomForestRegressor: 'oob_score_'
> RandomForestRegressor: 'classes_'
> RandomForestRegressor: 'n_classes_'
> RandomTreesEmbedding: 'classes_'
> RandomTreesEmbedding: 'n_classes_'
< RidgeCV: 'cv_values_'
< RidgeClassifierCV: 'cv_values_'
< SGDRegressor: 'average_coef_'
< SGDRegressor: 'average_intercept_'
> SGDClassifier: 'average_coef_'
> SGDClassifier: 'average_intercept_'
> SGDClassifier: 'standard_coef_'
> SGDClassifier: 'standard_intercept_'
> SGDRegressor: 'standard_coef_'
> SGDRegressor: 'standard_intercept_'
> SVR: 'classes_'
> SimpleImputer: 'statistcs_'
< SparsePCA: 'mean_'
< SpectralCoclustering: 'biclusters_'
< SpectralEmbedding: 'n_neighbors_'
> SparseRandomProjection: 'n_component_'
> SparseRandomProjection: 'n_components_'
> SpectralEmbedding: 'gamma_'
> TfidfVectorizer: 'fixed_vocabulary_'

@jnothman
Copy link
Member

One of the problems/features with the regex approach, if I'm not much mistaken, is it's sometimes checking the presence in the docstring, even if the class never sets that attribute.

@jnothman
Copy link
Member

In cases of inheritance it's often finding classes_ for non-classifiers.

@jnothman
Copy link
Member

I think we should take Andy's approach in the code, but review and resolve the additions from the regex-based solution.

@thomasjpfan
Copy link
Member

Currently the regex approach looking for text that looks like self.attribute_ or ‘def attribute_’. If the docstring includes these styles, it assume it’s an attribute.

In the case of SVR it finds classes_, because it is in the BaseLibSVM source, but is only activated for classifiers.

Given this, I think a mixture of the two approaches would cover most use cases. Use Andy’s approach and only look at the source for self.attribute_in the final estimator (don’t parse parent classes). Andy’s approach will already find the ‘def attribute_’ properties.

amueller added 4 commits July 24, 2019 11:25
# Conflicts:
#	sklearn/decomposition/fastica_.py
#	sklearn/neighbors/classification.py
#	sklearn/neighbors/regression.py
#	sklearn/svm/classes.py
@amueller
Copy link
Member Author

amueller commented Mar 5, 2020

fixed in #16286

@amueller amueller closed this Mar 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants