Skip to content

[MRG+1] Ensure delegated ducktyping in MetaEstimators #2854

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

jnothman
Copy link
Member

Supersedes #2019 to fix #1805, with a more readable (?) reimplementation, and an extension of the test to fix #2853.

This patch ensures that:

  • GridSearchCV
  • RandomizedSearchCV
  • Pipeline
  • RFE and
  • RFECV

have hasattr(metaest, method) == True iff the sub-estimator does for the set of standard methods:

  • inverse_transform
  • transform
  • predict
  • predict_proba
  • predict_log_proba
  • decision_function
  • score

(with some exceptions where delegation doesn't apply).

To fix #2853, hasattr must be True before the metaestimator is fit, and if the delegating method is called before fit, an exception will be raised, as with other un-fit estimators.

Some alternatives are posed here and below:

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 81096ef on jnothman:meta-ducktyping-new into 998a57a on scikit-learn:master.

@agramfort
Copy link
Member

isn't there simple and stupid solution to this problem? meta-ducktyping and decorators are a huge code complexity overhead.

@jnothman
Copy link
Member Author

Yes: don't use hasattr, and specify capabilities some more explicit way,
with a method to list capabilities, or an __instancecheck__.

On 14 February 2014 03:14, Alexandre Gramfort notifications@github.comwrote:

isn't there simple and stupid solution to this problem? meta-ducktyping
and decorators are a huge code complexity overhead.

Reply to this email directly or view it on GitHubhttps://github.com//pull/2854#issuecomment-34994426
.

@jnothman
Copy link
Member Author

And it's possible to implement without a custom descriptor, just using @property as I did in #2019. I thought giving it a name, as I've done here, and allowing the methods to be implemented as methods (not getters), documented the code better.

A final alternative that avoids decorators is implementing __hasattribute__ but I think that's worst of all.

@agramfort
Copy link
Member

I would not have expected that the GridSearch I initially wrote in 2010 would become such as beast (monster?). I am really -1 on this PR. There must be a simple solution. Why is using hasattr so bad for you?

@GaelVaroquaux
Copy link
Member

I would not have expected that the GridSearch I initially wrote in 2010 would
become such as beast (monster?).

It's really too big. It's a problem. The code is hard to follow and bugs
keep poping up.

I think that their is a general problem of scope and organization in the
model selection and grid search code. It is growing organically and it is
going to collapse under its own weight.

@jnothman
Copy link
Member Author

Why is using hasattr so bad for you?

hasattr only works with this sort of patch, sorry to say. Hasattr is all fine when the functional attributes of a class do not change depending on its instance. See #1805.

The code is hard to follow and bugs keep poping up.

Not bugs so much as edge cases.

I think that their is a general problem of scope and organization in the model selection and grid search code.

I think some recent refactoring has benefited, but there is certainly room for improvement. But things like scorers which are really there to patch over issues in API design certainly help to make appearances of a mess.

@jnothman
Copy link
Member Author

I would not have expected that the GridSearch I initially wrote in 2010 would become such as beast (monster?).

Your original implementation used:

        if hasattr(best_estimator, 'predict'):
            self.predict = best_estimator.predict
        if hasattr(best_estimator, 'score'):
            self.score = best_estimator.score

Good luck pickling that object. Identify a better serialisation method and the code can be cleaner.

@jnothman
Copy link
Member Author

This is a bug I've been considering how to fix for most of a year. If there "must be a simple solution", @agramfort, please help me find it, rather than rejecting for cosmetic/sentimental reasons.

@agramfort
Copy link
Member

rather than having GridSearchCV adding constraints on the estimators implementations, which ultimately creates frameworkish code, I would patch GridSearchCV with properties. It might be more verbose / less elegant but your solution, although pythonic, has the downside that new comers who want to contribute need to be python experts. scikit-learn needs more people able to write algorithms and boost training speed than python experts.

so I would have done:

@ property
def predict(self, X):
      return self.best_estimator_.predict(X)

by reading this I know what it's doing immediately.

@jnothman
Copy link
Member Author

Thanks. I wrote this descriptor because it made the property's purpose clearer by giving it a name. I'm happy to remove it, but your solution doesn't work.

I think what you meant to write is:

@property
def predict(self, X):
      return self.best_estimator_.predict

This is what we have in master (and which I wrote when I thought it was sufficient, but still unclear). But #2853 is broken still: hasattr(gscv, 'predict') must return True as long as gscv.estimator, not gscv.best_estimator_, has predict.

So a minimal fix with property is:

@property
def predict(self):
    self.estimator.predict  # determine hasattr response
    return lambda X: self.best_estimator_.predict(X)

Although explicit, I find this exceedingly verbose and hard to understand both its effect and its motivation at a glance. But it certainly works, and I'm happy to use this solution if there's some consensus that it's the better choice.

@mblondel
Copy link
Member

self.estimator.predict # determine hasattr response

I'm not sure if this line is gonna work. Python may optimize it as a no-op.

@jnothman
Copy link
Member Author

Is there anywhere you know of that suggests attribute lookup side-effects shouldn't be taken for granted?

I don't have a problem with the following equivalent:

@property
def predict(self):
    if hasattr(self.estimator, 'predict'):
        return lambda X: self.best_estimator_.predict(X)
    else:
        raise AttributeError('estimator cannot predict')

If that suits @agramfort, I'm happy to implement it, but it is somewhat repetitious.

@jnothman
Copy link
Member Author

The main problem with this approach is it seems an awkward way to write it. A custom decorator makes it look a bit magic, but it seems to me a bit more straightforward to read.

@larsmans
Copy link
Member

@mblondel It isn't a no-op since it can raise an exception, so Python cannot optimize it away without breaking its own semantics (this is why writing optimizing Python compilers is hard). I actually find that solution quite readable.

@jnothman
Copy link
Member Author

Which one is "that one"? I guess I need to number them now...

On 18 February 2014 20:49, Lars Buitinck notifications@github.com wrote:

@mblondel https://github.com/mblondel It isn't a no-op since it can
raise an exception, so Python cannot optimize it away without breaking its
own semantics (this is why writing optimizing Python compilers is hard). I
actually find that solution quite readable.

Reply to this email directly or view it on GitHubhttps://github.com//pull/2854#issuecomment-35367876
.

@larsmans
Copy link
Member

This one.

@property
def predict(self):
    self.estimator.predict  # determine hasattr response
    return lambda X: self.best_estimator_.predict(X)

Maybe change the comment to "trigger AttributeError before call"

@jnothman
Copy link
Member Author

I also should note that a solution like that's not entirely appropriate in Pipeline where lambda can't be used.

And I don't think "trigger AttributeError before call" is sufficient for people who don't realise triggering AttributeError will affect the hasattr response for a property... which is most Python devs, I would guess.

@agramfort
Copy link
Member

awkward but explicit and simple :) magic often ends up bringing more problems than expected

@jnothman
Copy link
Member Author

I think that depends on what you mean by explicit. I think we're talking
about a confusing property of Python attribute access, as a result of which
the behaviour is most explicit by naming a decorator. But that's my
opinion, which seems in the minority.

On 18 February 2014 21:04, Alexandre Gramfort notifications@github.comwrote:

awkward but explicit and simple :) magic often ends up bringing more
problems than expected

Reply to this email directly or view it on GitHubhttps://github.com//pull/2854#issuecomment-35368948
.

@mblondel
Copy link
Member

It seems like @agramfort and @larsmans like the solution you suggested in #2854 (comment). Instead of abstracting things away, perhaps you should use this solution wherever needed (at the cost of some code duplication).

I also should note that a solution like that's not entirely appropriate in Pipeline where lambda can't be used.

Can't you use private methods in this case?

@jnothman
Copy link
Member Author

Can't you use private methods in this case?

Yes, in one of my previous attempts to fix this bug in Pipeline I did exactly that, using a helper method, and received a critique along the lines of "flat is better than nested".

I think it has to be admitted that someone will dislike the solution to this because it can't be done super-neatly. Either we:

  • give up support for ducktyping and indicate an estimator's traits some other way (bad in terms of compatibility/extensibility)
  • use property as above (and as in my previous PR), the main cosmetic disadvantages being:
    • the reason for requiring a property (rather than what was used in the original implementation of GridSearchCV et al) remains implicit, or else needs to be documented in each application;
    • by returning a method rather than implementing the method directly, it looks like an unidiomatic method definition.
  • use a custom descriptor that can be named to make its purpose clear, the main disadvantage of which is that the precise operation of that descriptor is locally unclear (but documented elsewhere).

Ultimately I think we need to fix this bug, and not worry too much about the cosmetics until someone can come up with a much better solution.

@jnothman jnothman force-pushed the meta-ducktyping-new branch from 81096ef to 52f2812 Compare November 13, 2014 10:34
@jnothman
Copy link
Member Author

(rebased)

@larsmans larsmans changed the title [MRG] Ensure delegated ducktyping in MetaEstimators [MRG+1] Ensure delegated ducktyping in MetaEstimators Dec 15, 2014
@larsmans
Copy link
Member

+1 for merge. I just got bit by this again, and it needs fixing.

from .externals.six import iteritems

__all__ = ['Pipeline', 'FeatureUnion']


# One round of beers on me if someone finds out why the backslash
# is needed in the Attributes section so as not to upset sphinx.
iff_final_estimator_has_method = make_delegation_decorator('_final_estimator')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to give this name instead of just using the function?

@amueller
Copy link
Member

I am also in favor of the general approach, but I am not 100% sure I understand why the implementation looks like it does.

class _IffHasAttrDescriptor(object):
def __init__(self, fn, attr):
self.fn = fn
self.get_attr = attrgetter(attr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the benefit of this way compared to storing attr and calling getattr(obj, attr)?

@amueller
Copy link
Member

It seems to me you can simplify _iff_attr_has_method to

def _iff_attr_has_method(prefix, fn):
    attr = '%s.%s' % (prefix, fn.__name__)
    out = _IffHasAttrDescriptor(fn, attr)
    update_wrapper(out, fn)
    return out

Other than that 👍 for merge from me, too.
I find it a bit odd that we need to call update_wrapper twice, but I don't see away around it.
Maybe say that _IffHasAttrDescriptor implements the descriptor protocol.

@amueller
Copy link
Member

Actually, the update_wrapper in _if_attr_has_method is not necessary, is it?

@amueller
Copy link
Member

Doing

return lambda fn: _IffHasAttrDescriptor(fn, '%s.%s' % (prefix, fn.__name__))

in make_delegation_decorator and dropping _iff_attr_has_method seems more readable to me.

@amueller
Copy link
Member

@jnothman what do you think about that simplification?
I can do it if you like.

return out


def _iff_attr_has_method(prefix, fn, name=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the impression that this function is used only once. Thus maybe not all the genericity is required (for instance, name is always None, so it can be removed). I think that by focusing on little things like this, the code can be made simpler.

@GaelVaroquaux
Copy link
Member

Lots of great ideas here, and impressive feat of design.

However, I must say that I am worried about the code of the PR: it really is too much meta-programming code, that is hard to understand (I spotted in the discussion at least one person who said he was not following well).

I think that going along this direction is dangerous, and I am -1 on merging. This is the kind of code that we wrote a lot in mayavi2. The core devs (2 people, including me) were really enjoying it. We had a lot of features from such a meta-programming model. That got us a lot of users. But nobody converted to being a core dev. Such type of code is not good for including new people.

To give specific general advice, I would strive to: avoid decorators as much as possible (there will be some, in the final solution), avoid 'partial', and functools as much as possible.

More specifically, I have the feeling that some indirection can be avoided.

For instance, a line like 'iff_estimator_has_method = make_delegation_decorator('estimator')' that actually defines a a decorator using a factory, which is in a different file, is very hard to follow. Multiple indirections are just something that the human brain is not good at.

_IffHasAttrDescriptor is a class, that is used only once, inside _iff_attr_has_method, that is itself used only once, inside make_delegation_decorator (@amueller made a suggestion that removes an indirection, actually).

Here is a high-level suggestion to make the design easier: instead of decorating existing method, rename them: predict_proba would become _predict_proba. Then you can write a simpler property, because it has less work to do. Maybe it is enough to write it explicitely:

    @property
    def predict_proba(self):
            self.final_estimator.predict_proba
            return self._predict_proba
    def _predict_proba(self, ...)

I can see that the argument against this is to avoid repeating oneself: variants of these lines of code will be in many places in the code base. But, to replace these 4 lines of fairly simple code by 1 line (thus gaining only 3 lines), this PR adds a very non trivial file with 133. Thus, in terms of mere lines of code, the pattern should occur 44 times for the meta-programming to be worth it. This count is actually probably dishonest, because the code above needs at least one line of comment. However, I think that even if it happened more, I would still prefer the low-tech solution, because it is easier to understand and maintain.

@amueller
Copy link
Member

@GaelVaroquaux are you talking about the tests?
The implementation is currently 60 lines, which are mostly documentation. with my suggestion the implementation boils down to 10 lines of actual implementation.

@GaelVaroquaux
Copy link
Member

Doh! You are write, I read the wrong file. Those tests are actually super useful, they need to be kept.

Maybe I'd like to see you suggestion coded. I don't have an exact feeling of how much it clears things up.

@ogrisel
Copy link
Member

ogrisel commented Dec 18, 2014

I am also undecided on the current solution. The code is compact, well tested but hard to follow.

Using @GaelVaroquaux's explicit (property + private method) pairs on the other hand, while explicit introduce many redundant lines. I am also curious to see how the simplifications suggested by @amueller would look.

Also I wonder if introducing a generic fixed name such as @iff_delegate_has_method or even just @if_delegate_has_method would not make the name-based introspection magic go away to make this wrapper code easier to follow while the name of the decorator would still be explicit enough.

I agree that in the case of the pipeline, the fact that delegate is the final estimator of the chain is not trivial, but this can be made explicit just by adding an inline comment.

Edit: sorry the above does not make sense.

@amueller
Copy link
Member

See #3982.
I haven't understood why we need attr_getter yet :-/
ok never mind, it is simply for resolving the .

@jnothman
Copy link
Member Author

We don't need attrgetter. We should probably just store attr on the descriptor, and then do getattr(obj, attr). Its purpose is to raise an appropriate AttributeError. It could be written more explicitly (but perhaps not passing on as much information):

if not hasattr(obj, self.attr):
    raise AttributeError

@amueller
Copy link
Member

As far as I know, that doesn't work.
attr is something like estimator.decision_function, and hasattr and getattr don't work with nested names. You would need to split on the dot and do two calls, as far as I can see.

@amueller
Copy link
Member

Curiously the decision_function of pipelines is documented without call signature, while the decision_function of GridSearchCV has decision_function(*args, **kwargs)
The link to the source points to https://github.com/scikit-learn/scikit-learn/blob/a1510a1/sklearn/utils/metaestimators.py#L37 for both, unfortunately. I'm not sure why update_wrapper didn't take care of this.

@jnothman
Copy link
Member Author

Oh of course. The attrgetter was intended to handle nested names. Forgot.

And yes, this sphinx thing is has the potential to get messy :(

On 20 December 2014 at 07:56, Andreas Mueller notifications@github.com
wrote:

Curiously the decision_function of pipelines is documented without call
signature, while the decision_function of GridSearchCV has decision_function(_args,
*_kwargs)
The link to the source points to
https://github.com/scikit-learn/scikit-learn/blob/a1510a1/sklearn/utils/metaestimators.py#L37
for both, unfortunately. I'm not sure why update_wrapper didn't take care
of this.


Reply to this email directly or view it on GitHub
#2854 (comment)
.

@larsmans
Copy link
Member

Superseded by #3982, which was just merged.

@larsmans larsmans closed this Dec 27, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: check_scoring fails for GridSearchCV Problem with ducktyping and meta-estimators
8 participants