-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG + 1] Add check for estimator: parameters not modified by fit
#7846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG + 1] Add check for estimator: parameters not modified by fit
#7846
Conversation
if hasattr(estimator, "n_clusters"): | ||
estimator.n_clusters = 1 | ||
|
||
set_random_state(estimator, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part seems to be very repetitive, so maybe it's possible to refactor / extract method here.. not sure though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
def check_fit2d_predict1d(name, Estimator): | ||
# check by fitting a 2d array and prediting with a 1d array | ||
# check by fitting a 2d array and predicting with a 1d array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not relevant to the current commit, just caught my eye. Hope it's innocent enough!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that's fine :)
there are some amount of estimators, which is not follow this rule from the beginning, that's the reason of failing tests |
for val in substracted_dicts: | ||
assert_true(val.startswith('_') or val.endswith('_'), | ||
('Estimator sets invalid attributes during the fit method' | ||
'should either start with an _ or end with a _')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please list the attributes that are set inappropriately here? Then the test logs will be more informative.
fit
estimator.fit(X, y) | ||
|
||
dict_after_fit = estimator.__dict__ | ||
# leave only attributes which have been set by fit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd also like to check that attributes set in __init__
were not modified by fit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, that uncovers my total lack of knowledge of how the things work 😕 . I wonder how this sentence and this comment: #7553 (comment) are working together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😿
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the tears? Sorry response times are slow atm... I'm watching, but everything is just queued up for the moment!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha I understand! Just trying to figure out how to make this run and for me your sentence saying: "attributes set in __init__
were not modified by fit
" and @amueller comment some time ago saying: "We certainly change the __dict__
during fit
" conflicting in my mind, so I can't really continue here without resolving this conflict.
So the real question is: should we or shouldn't we modify __dict__
during fit
for any given estimator?
Do I understand correctly, that we can add new attributes to __dict__
during fit
, but we can't modify the old one (set by __init__
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fit
, we should only usually be modifying (usually adding, but not necessarily) attributes that start or end with _
.
Made error message more informative, also add one more test, checking that A lot of estimators violate this rules, though. |
the trailing underscore is not about public or private, it's a scikit-learn convention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good already :)
|
||
dict_after_fit = estimator.__dict__ | ||
# leave only attributes which have been set by fit | ||
substracted_dicts_keys = [k for k in dict_after_fit.keys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should check that they are the same before and after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this we check this in the previous test here
substracted_dicts_keys = [k for k in dict_after_fit.keys() | ||
if k not in dict_before_fit.keys()] | ||
|
||
for val in substracted_dicts_keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the right test :)
def check_fit2d_predict1d(name, Estimator): | ||
# check by fitting a 2d array and prediting with a 1d array | ||
# check by fitting a 2d array and predicting with a 1d array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that's fine :)
if not (attr.startswith('_') or attr.endswith('_'))] | ||
|
||
for attr in public_attrs: | ||
assert_equal(dict_before_fit[attr], dict_after_fit[attr], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you just do this as part of the test above, instead of removing the entries in the dict? There is a lot of code duplication otherwise as you saw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, we actually don't need the second tests, as you pointed out before, cause we have this test for fit doesn't change __dict__
. So looks like I mislead myself here :)
|
||
|
||
def check_fit_changes_private_attributes_only(name, Estimator): | ||
if name in ['GaussianProcess', 'GaussianProcessRegressor', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty bad. Can you remove this line so that we can see the errors in continuous integration? These seems problematic and we might want to fix them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
I uncommented passing of the estimators. So now those ones which do not follow the rule "do not change public attributes" are failing in the CI. |
Should I try to do something with these estimators? |
These appear to be |
here:
|
Thanks! That is a very useful summary. Will look into it soon. |
This is great, thanks @kiote, although I can tell that it's only erroring for one attribute even if multiple are set. Would be a good idea to list all added. In my opinion:
Individual PRs welcome. If you don't want to make that effort, please post this as one or more new issues so we can seek contributors. |
Thanks Joel! To make this issue bounded somehow, I can suggest:
What do you think? |
I'd like to try and fix the issues as far as possible before merging this PR. It shouldn't be too hard to do most of those changes. |
okay than I'm on that |
Maybe skip classes that are deprecated? (you can find out if something is deprecated if |
If only we had some way to easily rename attributes cough futurepastcough. Maybe I should finish that at some point? |
After skipping depreciated estimators and printing all attributes we have the same result as before but without
error for |
y_train_mean can become _y_train_mean. y_train_ is already stored.
…On 2 December 2016 at 00:48, Ekaterina Krivich ***@***.***> wrote:
After skipping depreciated estimators and printing all attributes we have
the same result as before
<#7846 (comment)>
but without GaussianProcess and with
Estimators are only allowed to add private attributes either started with
_ or ended with _ but rng, y_train_mean added
error for GaussianProcessRegressor. I can follow Joel's recommendations
or just stop here and wait for Andreas [image:
![]() |
@@ -564,6 +565,12 @@ def __init__(self, alphas=4, n_refinements=4, cv=None, tol=1e-4, | |||
# The base class needs this for the score method | |||
self.store_precision = True | |||
|
|||
@property | |||
@deprecated("Attribute grid_scores was deprecated in version 0.__ and " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.19 and 0.21
@@ -930,6 +931,12 @@ def _check_initialized(self): | |||
raise NotFittedError("Estimator not fitted, call `fit`" | |||
" before making predictions`.") | |||
|
|||
@property | |||
@deprecated("Attribute n_features was deprecated in version 0.__ and " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.19 and 0.21
return self._rng | ||
|
||
@property | ||
@deprecated("Attribute y_train_mean was deprecated in version 0.__ and " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above
@@ -337,6 +337,12 @@ def __init__(self, loss="hinge", penalty='l2', alpha=0.0001, l1_ratio=0.15, | |||
self.classes_ = None | |||
self.n_jobs = int(n_jobs) | |||
|
|||
@property | |||
@deprecated("Attribute loss_function was deprecated in version 0.__ and " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again
@@ -788,6 +789,12 @@ def _fit(self, X, skip_num_points=0): | |||
neighbors=neighbors_nn, | |||
skip_num_points=skip_num_points) | |||
|
|||
@property | |||
@deprecated("Attribute n_iter_final was deprecated in version 0.__ and " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again
Set right versions in the deprecation messages |
@@ -850,25 +850,26 @@ def _check_params(self): | |||
if self.max_features == "auto": | |||
# if is_classification | |||
if self.n_classes_ > 1: | |||
max_features = max(1, int(np.sqrt(self.n_features))) | |||
max_features = max(1, int(np.sqrt(self._n_features))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert of that code but it is important to keep n_features
in memory.
Can we change the signature of _check_params(self)
-> _check_params(self, X, y)
?
X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'], dtype=DTYPE)
can also be done in _check_params
.
@@ -587,14 +587,15 @@ class Lars(LinearModel, RegressorMixin): | |||
sklearn.decomposition.sparse_encode | |||
|
|||
""" | |||
method = 'lar' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.method
is it used in another place than l678 and l699.
I don't like defined like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is done in other part of the code but is it absolutely necessary to put it here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we were discussing this change before with @jnothman I suppose, that was his idea to move it here, so we'll be able to make new tests. Also he pointed that it's better to define it this way, so I'm a bit confused, honestly.
I didn't understand what l678 and l699, thought it means "line 678", but not sure what should I see on those lines then 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for that, I've missed the discussion indeed. Forget my comment :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not wonderful, but I don't think it's this PR's job to fix it. The problem derives from the use of inheritance, but non-use of super...__init__
. In the current code, method
is twice set, like this, as a class attribute (in the CV objects), and twice as an instance attribute. It might make more sense to have it as a private attribute, but this approach at least ensures a consistency that is not in master, and that the instance's __dict__
only has true parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see. Thanks for the explanation @jnothman.
@@ -828,6 +829,7 @@ class LassoLars(Lars): | |||
sklearn.decomposition.sparse_encode | |||
|
|||
""" | |||
method = 'lasso' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for that, I've missed the discussion indeed. Forget my comment :)
ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule see #7763
Thanks for your review, but honestly I was trying to keep this PR as much specific as possible, so with that changes you've pointed to (like changing the signature of _check_params) it would be much harder 😟 |
@kiote We can change that in another PR indeed. Thanks for your answers and your work. LGTM |
@tguillemot, did you happen to check that deprecated attributes etc are no longer in use in e.g. examples? I can't remember if I did that check, but it was @amueller's last stated concern since he gave his premature +1. |
I have missed one
Can you replace it @kiote ? |
okay, I also noticed that I promised to rename |
Thanks a lot, @kiote! |
Thx @kiote |
…ikit-learn#7846) ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule
…ikit-learn#7846) ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule
…ikit-learn#7846) ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule
…ikit-learn#7846) ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule
…ikit-learn#7846) ensure that estimators only add private attributes and attributes with trailing _ in cases when existing estimators don't follow this new rule, we deprecate the attributes and make them follow this rule
Reference Issue
Fixes #7763
What does this implement/fix? Explain your changes.
Add simple test to estimator_tests, which checks that
__dict__
does not have non-private attributes afterfit
Any other comments?
didn't add any documentation for that, do I need to?