Skip to content

[MRG + 1] Add check for estimator: parameters not modified by fit #7846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 20, 2017
Merged

[MRG + 1] Add check for estimator: parameters not modified by fit #7846

merged 4 commits into from
Jan 20, 2017

Conversation

kiote
Copy link
Contributor

@kiote kiote commented Nov 9, 2016

Reference Issue

Fixes #7763

What does this implement/fix? Explain your changes.

Add simple test to estimator_tests, which checks that __dict__ does not have non-private attributes after fit

Any other comments?

didn't add any documentation for that, do I need to?

if hasattr(estimator, "n_clusters"):
estimator.n_clusters = 1

set_random_state(estimator, 1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part seems to be very repetitive, so maybe it's possible to refactor / extract method here.. not sure though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

def check_fit2d_predict1d(name, Estimator):
# check by fitting a 2d array and prediting with a 1d array
# check by fitting a 2d array and predicting with a 1d array
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not relevant to the current commit, just caught my eye. Hope it's innocent enough!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that's fine :)

@kiote kiote changed the title Add check for estimator [WIP] Add check for estimator Nov 10, 2016
@kiote
Copy link
Contributor Author

kiote commented Nov 10, 2016

there are some amount of estimators, which is not follow this rule from the beginning, that's the reason of failing tests

for val in substracted_dicts:
assert_true(val.startswith('_') or val.endswith('_'),
('Estimator sets invalid attributes during the fit method'
'should either start with an _ or end with a _'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please list the attributes that are set inappropriately here? Then the test logs will be more informative.

@jnothman jnothman changed the title [WIP] Add check for estimator [WIP] Add check for estimator: parameters not modified by fit Nov 10, 2016
estimator.fit(X, y)

dict_after_fit = estimator.__dict__
# leave only attributes which have been set by fit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd also like to check that attributes set in __init__ were not modified by fit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, that uncovers my total lack of knowledge of how the things work 😕 . I wonder how this sentence and this comment: #7553 (comment) are working together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😿

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the tears? Sorry response times are slow atm... I'm watching, but everything is just queued up for the moment!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha I understand! Just trying to figure out how to make this run and for me your sentence saying: "attributes set in __init__ were not modified by fit" and @amueller comment some time ago saying: "We certainly change the __dict__ during fit" conflicting in my mind, so I can't really continue here without resolving this conflict.

So the real question is: should we or shouldn't we modify __dict__ during fit for any given estimator?

Do I understand correctly, that we can add new attributes to __dict__ during fit, but we can't modify the old one (set by __init__)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fit, we should only usually be modifying (usually adding, but not necessarily) attributes that start or end with _.

@kiote
Copy link
Contributor Author

kiote commented Nov 18, 2016

Made error message more informative, also add one more test, checking that fit does not change private attributes. So, with this we have two new tests: fit does not add new non-private attributes and fit does not change any non-private attributes added before.

A lot of estimators violate this rules, though.

@amueller
Copy link
Member

the trailing underscore is not about public or private, it's a scikit-learn convention.
Leading underscores are indicating private attributes.
In __init__ we should only add attributes that have neigher a leading not a trailing underscore. In fit we should not change any of these, but we can add (or change if we call fit repeatedly) attributes with trailing underscore. Adding or changing private attributes is also fine.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good already :)


dict_after_fit = estimator.__dict__
# leave only attributes which have been set by fit
substracted_dicts_keys = [k for k in dict_after_fit.keys()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should check that they are the same before and after.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this we check this in the previous test here

substracted_dicts_keys = [k for k in dict_after_fit.keys()
if k not in dict_before_fit.keys()]

for val in substracted_dicts_keys:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the right test :)

def check_fit2d_predict1d(name, Estimator):
# check by fitting a 2d array and prediting with a 1d array
# check by fitting a 2d array and predicting with a 1d array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that's fine :)

if not (attr.startswith('_') or attr.endswith('_'))]

for attr in public_attrs:
assert_equal(dict_before_fit[attr], dict_after_fit[attr],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you just do this as part of the test above, instead of removing the entries in the dict? There is a lot of code duplication otherwise as you saw.

Copy link
Contributor Author

@kiote kiote Nov 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, we actually don't need the second tests, as you pointed out before, cause we have this test for fit doesn't change __dict__. So looks like I mislead myself here :)



def check_fit_changes_private_attributes_only(name, Estimator):
if name in ['GaussianProcess', 'GaussianProcessRegressor',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty bad. Can you remove this line so that we can see the errors in continuous integration? These seems problematic and we might want to fix them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@kiote
Copy link
Contributor Author

kiote commented Nov 22, 2016

I uncommented passing of the estimators. So now those ones which do not follow the rule "do not change public attributes" are failing in the CI.

@kiote
Copy link
Contributor Author

kiote commented Nov 23, 2016

Should I try to do something with these estimators?

@jnothman
Copy link
Member

These appear to be KeyErrors, not AssertionErrors. Could you please start by making the error message more informative as to what needs to be fixed? Then please list the errors in a comment here so that we can review if they are straightforward to fix.

@kiote
Copy link
Contributor Author

kiote commented Nov 24, 2016

here:

Estimator Error
GaussianProcess Estiamtor adds public attribute during the fit method. Estimators are only allowed to add private attributes either started with _ or ended with _ but X_mean added
GaussianProcessRegressor ... rng added
GradientBoostingClassifier ... n_features added
GradientBoostingRegressor ... n_features added
GraphLassoCV ... grid_scores added
LarsCV ... fit_path added
LassoLarsCV ... fit_path added
LassoLarsIC ... fit_path added
PassiveAggressiveClassifier ... loss_function added
Perceptron ... loss_function added
SGDClassifier ... loss_function added
TSNE ... n_iter_final added

@jnothman
Copy link
Member

Thanks! That is a very useful summary. Will look into it soon.

@jnothman
Copy link
Member

This is great, thanks @kiote, although I can tell that it's only erroring for one attribute even if multiple are set. Would be a good idea to list all added.

In my opinion:

Estimator Attribute Solution
GaussianProcess X_mean we could deprecate this and other similar attributes (X_std) etc, so that the tests pass, but the class is deprecated in entirety.
GaussianProcessRegressor rng Deprecate for removal to a local variable, I think
GradientBoostingClassifier n_features deprecate and rename to _n_features; or abolish it and use len(self.feature_importances_) instead at prediction time.
GradientBoostingRegressor n_features "
GraphLassoCV grid_scores Deprecate and rename to grid_scores_. We may replace this with cv_results_ some time soon, anyway.
LarsCV fit_path Move initialisation to __init__, I think
LassoLarsCV fit_path "
LassoLarsIC fit_path "
PassiveAggressiveClassifier loss_function Deprecate and rename to loss_function_
Perceptron loss_function "
SGDClassifier loss_function "
TSNE n_iter_final deprecate and rename to n_iter_

Individual PRs welcome. If you don't want to make that effort, please post this as one or more new issues so we can seek contributors.

@kiote
Copy link
Contributor Author

kiote commented Nov 30, 2016

Thanks Joel!

To make this issue bounded somehow, I can suggest:

  1. As you said, add all attributes to error message instead of just one;
  2. New input may be needed after that;
  3. Make a deprecation message instead of error messages while testing, to be able to merge this PR;
  4. Create new issue saying "replace deprecation message with failing tests" or something like that and fix all estimators during that new issue (hopefully I'll be able to do that as well).

What do you think?

@jnothman
Copy link
Member

I'd like to try and fix the issues as far as possible before merging this PR. It shouldn't be too hard to do most of those changes.

@kiote
Copy link
Contributor Author

kiote commented Nov 30, 2016

okay than I'm on that

@amueller
Copy link
Member

amueller commented Nov 30, 2016

Maybe skip classes that are deprecated? (you can find out if something is deprecated if object.__init__ has an attribute _deprecated_original. (to get rid of the GaussianProcess failure)

@amueller
Copy link
Member

If only we had some way to easily rename attributes cough futurepastcough. Maybe I should finish that at some point?

@kiote
Copy link
Contributor Author

kiote commented Dec 1, 2016

After skipping depreciated estimators and printing all attributes we have the same result as before but without GaussianProcess and with

Estimators are only allowed to add private attributes either started with _ or ended with _ but rng, y_train_mean added

error for GaussianProcessRegressor. I can follow Joel's recommendations or just stop here and wait for Andreas :suspect:

@jnothman
Copy link
Member

jnothman commented Dec 2, 2016 via email

@@ -564,6 +565,12 @@ def __init__(self, alphas=4, n_refinements=4, cv=None, tol=1e-4,
# The base class needs this for the score method
self.store_precision = True

@property
@deprecated("Attribute grid_scores was deprecated in version 0.__ and "
Copy link
Member

@amueller amueller Dec 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.19 and 0.21

@@ -930,6 +931,12 @@ def _check_initialized(self):
raise NotFittedError("Estimator not fitted, call `fit`"
" before making predictions`.")

@property
@deprecated("Attribute n_features was deprecated in version 0.__ and "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.19 and 0.21

return self._rng

@property
@deprecated("Attribute y_train_mean was deprecated in version 0.__ and "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

@@ -337,6 +337,12 @@ def __init__(self, loss="hinge", penalty='l2', alpha=0.0001, l1_ratio=0.15,
self.classes_ = None
self.n_jobs = int(n_jobs)

@property
@deprecated("Attribute loss_function was deprecated in version 0.__ and "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again

@@ -788,6 +789,12 @@ def _fit(self, X, skip_num_points=0):
neighbors=neighbors_nn,
skip_num_points=skip_num_points)

@property
@deprecated("Attribute n_iter_final was deprecated in version 0.__ and "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again

@kiote
Copy link
Contributor Author

kiote commented Dec 7, 2016

Set right versions in the deprecation messages

@@ -850,25 +850,26 @@ def _check_params(self):
if self.max_features == "auto":
# if is_classification
if self.n_classes_ > 1:
max_features = max(1, int(np.sqrt(self.n_features)))
max_features = max(1, int(np.sqrt(self._n_features)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert of that code but it is important to keep n_features in memory.
Can we change the signature of _check_params(self) -> _check_params(self, X, y) ?

X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'], dtype=DTYPE) can also be done in _check_params.

@@ -587,14 +587,15 @@ class Lars(LinearModel, RegressorMixin):
sklearn.decomposition.sparse_encode

"""
method = 'lar'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.method is it used in another place than l678 and l699.
I don't like defined like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is done in other part of the code but is it absolutely necessary to put it here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were discussing this change before with @jnothman I suppose, that was his idea to move it here, so we'll be able to make new tests. Also he pointed that it's better to define it this way, so I'm a bit confused, honestly.

I didn't understand what l678 and l699, thought it means "line 678", but not sure what should I see on those lines then 😕

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for that, I've missed the discussion indeed. Forget my comment :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not wonderful, but I don't think it's this PR's job to fix it. The problem derives from the use of inheritance, but non-use of super...__init__. In the current code, method is twice set, like this, as a class attribute (in the CV objects), and twice as an instance attribute. It might make more sense to have it as a private attribute, but this approach at least ensures a consistency that is not in master, and that the instance's __dict__ only has true parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see. Thanks for the explanation @jnothman.

@@ -828,6 +829,7 @@ class LassoLars(Lars):
sklearn.decomposition.sparse_encode

"""
method = 'lasso'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for that, I've missed the discussion indeed. Forget my comment :)

kiote added 2 commits January 19, 2017 10:25
  ensure that estimators only add private attributes and attributes with
  trailing _

  in cases when existing estimators don't follow this new rule, we deprecate the
  attributes and make them follow this rule

  see #7763
@kiote
Copy link
Contributor Author

kiote commented Jan 19, 2017

Thanks for your review, but honestly I was trying to keep this PR as much specific as possible, so with that changes you've pointed to (like changing the signature of _check_params) it would be much harder 😟

@tguillemot
Copy link
Contributor

@kiote We can change that in another PR indeed. Thanks for your answers and your work.

LGTM

@jnothman
Copy link
Member

jnothman commented Jan 19, 2017

@tguillemot, did you happen to check that deprecated attributes etc are no longer in use in e.g. examples? I can't remember if I did that check, but it was @amueller's last stated concern since he gave his premature +1.

@tguillemot
Copy link
Contributor

I have missed one n_features from partial_dependence.py -- line 261

   if gbrt.n_features != X.shape[1]:
        raise ValueError('X.shape[1] does not match gbrt.n_features')

Can you replace it @kiote ?

@kiote
Copy link
Contributor Author

kiote commented Jan 19, 2017

okay, I also noticed that I promised to rename _n_features to n_features_ to @amueller, but didn't! Sorry, will fix that, too.

@jnothman jnothman merged commit be305ce into scikit-learn:master Jan 20, 2017
@jnothman
Copy link
Member

Thanks a lot, @kiote!

@tguillemot
Copy link
Contributor

Thx @kiote

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
…ikit-learn#7846)

  ensure that estimators only add private attributes and attributes with
  trailing _

  in cases when existing estimators don't follow this new rule, we deprecate the
  attributes and make them follow this rule
@Przemo10 Przemo10 mentioned this pull request Mar 17, 2017
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…ikit-learn#7846)

  ensure that estimators only add private attributes and attributes with
  trailing _

  in cases when existing estimators don't follow this new rule, we deprecate the
  attributes and make them follow this rule
trevorstephens added a commit to trevorstephens/scikit-learn that referenced this pull request Jul 26, 2017
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
…ikit-learn#7846)

  ensure that estimators only add private attributes and attributes with
  trailing _

  in cases when existing estimators don't follow this new rule, we deprecate the
  attributes and make them follow this rule
trevorstephens added a commit to trevorstephens/scikit-learn that referenced this pull request Aug 12, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…ikit-learn#7846)

  ensure that estimators only add private attributes and attributes with
  trailing _

  in cases when existing estimators don't follow this new rule, we deprecate the
  attributes and make them follow this rule
sebp added a commit to sebp/scikit-survival that referenced this pull request Oct 16, 2017
sebp added a commit to sebp/scikit-survival that referenced this pull request Oct 16, 2017
sebp added a commit to sebp/scikit-survival that referenced this pull request Oct 30, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…ikit-learn#7846)

  ensure that estimators only add private attributes and attributes with
  trailing _

  in cases when existing estimators don't follow this new rule, we deprecate the
  attributes and make them follow this rule
sebp added a commit to sebp/scikit-survival that referenced this pull request Nov 18, 2017
trevorstephens added a commit to trevorstephens/scikit-learn that referenced this pull request May 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ensure that estimators only add private attributes and attributes with trailing _
4 participants