-
-
Notifications
You must be signed in to change notification settings - Fork 26k
Tests for attribute documentation #13385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
+1000 on this one |
Nice! I've considered this, but never gone ahead and implemented. It's
obviously going to have false negatives, but very helpful!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we otherwise set up docstring checking in sklearn/tests/test_docstring_parameters.py and then only in some test runs. Was that for speed, or only to avoid a hard dependency on numpydoc?
@jnothman Didn't you implement that other test? This test is quite different from the existing tests as it needs to instantiate estimators. @thomasjpfan suggested we could avoid that by parsing the source but I think it might be fine to do it the way it's done now. Right now I'm worried more about false positives than false negatives, because it makes merging this hard/impossible. But I might be able to find a way around with adding enough exceptions? |
Here is an approach to parse the source to get the attributes: import re
from inspect import getsource, getmro
att_match = re.compile(r'self\.(?!_)(\w+_)\W')
def get_attributes(target_cls):
attributes = set()
for current_cls in getmro(target_cls):
if current_cls == BaseEstimator:
break
attributes |= set(att_match.findall(getsource(current_cls)))
return attributes
from sklearn.ensemble import BaggingClassifier
get_attributes(BaggingClassifier) This returns: {'base_estimator_',
'classes_',
'estimators_',
'estimators_features_',
'estimators_samples_',
'n_classes_',
'n_features_',
'oob_decision_function_',
'oob_score_'} |
thanks @TomDLT ! |
@thomasjpfan what's the difference between the two approaches?
|
This PR's approach calls My regex approach, looks at the source code of the class, and finds all strings that matches The major advantage to the parsing the source code is that it can find all the attributes while calling |
Sorry for being unclear. I didn't mean what's the difference in methodology.... I hoped you could just produce a list of the differences in extracted attributes. |
Here are the attributes found by using regex approach on the HEAD of this PR: https://pastebin.com/iT4m7pUV Here is what |
Well the approaches are obviously complementary.
|
One of the problems/features with the regex approach, if I'm not much mistaken, is it's sometimes checking the presence in the docstring, even if the class never sets that attribute. |
In cases of inheritance it's often finding |
I think we should take Andy's approach in the code, but review and resolve the additions from the regex-based solution. |
Currently the regex approach looking for text that looks like In the case of SVR it finds Given this, I think a mixture of the two approaches would cover most use cases. Use Andy’s approach and only look at the source for |
# Conflicts: # sklearn/decomposition/fastica_.py # sklearn/neighbors/classification.py # sklearn/neighbors/regression.py # sklearn/svm/classes.py
fixed in #16286 |
Not sure if this has been started before?
This adds a test to check that all public attributes are documented.
I also added some missing attributes. It might actually be tricky to make this pass, since some attributes are conditional on init parameter (maybe this could be figured out by parsing the docstring for "only present" or something like that?), and some are deprecated and not documented.
I think we should at least try to fix the errors. I can separate this into two PRs.
There's still 99 estimators with at least one undocumented attribute (but LogisticRegression ain't one).
Some are a bit suspicious, for example the ExtraTreeRegressor (and some other tree-based regressors) have
classes_
.Would be nice if someone ran the test and fixed these up. Current output is here:
https://pastebin.com/FtAECdE6