[MRG+1] Ensure delegated ducktyping in MetaEstimators #2854

jnothman · 2014-02-12T10:02:00Z

Supersedes #2019 to fix #1805, with a more readable (?) reimplementation, and an extension of the test to fix #2853.

This patch ensures that:

GridSearchCV
RandomizedSearchCV
Pipeline
RFE and
RFECV

have hasattr(metaest, method) == True iff the sub-estimator does for the set of standard methods:

inverse_transform
transform
predict
predict_proba
predict_log_proba
decision_function
score

(with some exceptions where delegation doesn't apply).

To fix #2853, hasattr must be True before the metaestimator is fit, and if the delegating method is called before fit, an exception will be raised, as with other un-fit estimators.

Some alternatives are posed here and below:

(A) custom descriptor as in this PR (https://github.com/scikit-learn/scikit-learn/pull/2854/files#diff-589a953cf2ee991da252f7da86c1e5b9R176)
(B) property with access and nested function ([MRG+1] Ensure delegated ducktyping in MetaEstimators #2854 (comment))
(C) property with if hasattr and nested function ([MRG+1] Ensure delegated ducktyping in MetaEstimators #2854 (comment))

coveralls · 2014-02-12T10:11:23Z

Coverage remained the same when pulling 81096ef on jnothman:meta-ducktyping-new into 998a57a on scikit-learn:master.

agramfort · 2014-02-13T16:14:49Z

isn't there simple and stupid solution to this problem? meta-ducktyping and decorators are a huge code complexity overhead.

jnothman · 2014-02-13T20:40:43Z

Yes: don't use hasattr, and specify capabilities some more explicit way,
with a method to list capabilities, or an __instancecheck__.

On 14 February 2014 03:14, Alexandre Gramfort notifications@github.comwrote:

isn't there simple and stupid solution to this problem? meta-ducktyping
and decorators are a huge code complexity overhead.

Reply to this email directly or view it on GitHubhttps://github.com//pull/2854#issuecomment-34994426
.

jnothman · 2014-02-13T20:50:53Z

And it's possible to implement without a custom descriptor, just using @property as I did in #2019. I thought giving it a name, as I've done here, and allowing the methods to be implemented as methods (not getters), documented the code better.

A final alternative that avoids decorators is implementing __hasattribute__ but I think that's worst of all.

agramfort · 2014-02-14T17:15:53Z

I would not have expected that the GridSearch I initially wrote in 2010 would become such as beast (monster?). I am really -1 on this PR. There must be a simple solution. Why is using hasattr so bad for you?

GaelVaroquaux · 2014-02-14T17:19:48Z

I would not have expected that the GridSearch I initially wrote in 2010 would
become such as beast (monster?).

It's really too big. It's a problem. The code is hard to follow and bugs
keep poping up.

I think that their is a general problem of scope and organization in the
model selection and grid search code. It is growing organically and it is
going to collapse under its own weight.

jnothman · 2014-02-15T10:16:55Z

Why is using hasattr so bad for you?

hasattr only works with this sort of patch, sorry to say. Hasattr is all fine when the functional attributes of a class do not change depending on its instance. See #1805.

The code is hard to follow and bugs keep poping up.

Not bugs so much as edge cases.

I think that their is a general problem of scope and organization in the model selection and grid search code.

I think some recent refactoring has benefited, but there is certainly room for improvement. But things like scorers which are really there to patch over issues in API design certainly help to make appearances of a mess.

jnothman · 2014-02-15T10:21:51Z

I would not have expected that the GridSearch I initially wrote in 2010 would become such as beast (monster?).

Your original implementation used:

        if hasattr(best_estimator, 'predict'):
            self.predict = best_estimator.predict
        if hasattr(best_estimator, 'score'):
            self.score = best_estimator.score

Good luck pickling that object. Identify a better serialisation method and the code can be cleaner.

jnothman · 2014-02-15T23:57:01Z

This is a bug I've been considering how to fix for most of a year. If there "must be a simple solution", @agramfort, please help me find it, rather than rejecting for cosmetic/sentimental reasons.

agramfort · 2014-02-17T08:39:04Z

rather than having GridSearchCV adding constraints on the estimators implementations, which ultimately creates frameworkish code, I would patch GridSearchCV with properties. It might be more verbose / less elegant but your solution, although pythonic, has the downside that new comers who want to contribute need to be python experts. scikit-learn needs more people able to write algorithms and boost training speed than python experts.

so I would have done:

@ property
def predict(self, X):
      return self.best_estimator_.predict(X)

by reading this I know what it's doing immediately.

jnothman · 2014-02-17T10:42:35Z

Thanks. I wrote this descriptor because it made the property's purpose clearer by giving it a name. I'm happy to remove it, but your solution doesn't work.

I think what you meant to write is:

@property
def predict(self, X):
      return self.best_estimator_.predict

This is what we have in master (and which I wrote when I thought it was sufficient, but still unclear). But #2853 is broken still: hasattr(gscv, 'predict') must return True as long as gscv.estimator, not gscv.best_estimator_, has predict.

So a minimal fix with property is:

@property
def predict(self):
    self.estimator.predict  # determine hasattr response
    return lambda X: self.best_estimator_.predict(X)

Although explicit, I find this exceedingly verbose and hard to understand both its effect and its motivation at a glance. But it certainly works, and I'm happy to use this solution if there's some consensus that it's the better choice.

mblondel · 2014-02-18T02:33:01Z

self.estimator.predict # determine hasattr response

I'm not sure if this line is gonna work. Python may optimize it as a no-op.

jnothman · 2014-02-18T04:32:34Z

Is there anywhere you know of that suggests attribute lookup side-effects shouldn't be taken for granted?

I don't have a problem with the following equivalent:

@property
def predict(self):
    if hasattr(self.estimator, 'predict'):
        return lambda X: self.best_estimator_.predict(X)
    else:
        raise AttributeError('estimator cannot predict')

If that suits @agramfort, I'm happy to implement it, but it is somewhat repetitious.

jnothman · 2014-02-18T04:40:52Z

The main problem with this approach is it seems an awkward way to write it. A custom decorator makes it look a bit magic, but it seems to me a bit more straightforward to read.

larsmans · 2014-02-18T09:49:46Z

@mblondel It isn't a no-op since it can raise an exception, so Python cannot optimize it away without breaking its own semantics (this is why writing optimizing Python compilers is hard). I actually find that solution quite readable.

jnothman · 2014-02-18T09:54:17Z

Which one is "that one"? I guess I need to number them now...

On 18 February 2014 20:49, Lars Buitinck notifications@github.com wrote:

@mblondel https://github.com/mblondel It isn't a no-op since it can
raise an exception, so Python cannot optimize it away without breaking its
own semantics (this is why writing optimizing Python compilers is hard). I
actually find that solution quite readable.

Reply to this email directly or view it on GitHubhttps://github.com//pull/2854#issuecomment-35367876
.

larsmans · 2014-02-18T09:56:26Z

This one.

@property
def predict(self):
    self.estimator.predict  # determine hasattr response
    return lambda X: self.best_estimator_.predict(X)

Maybe change the comment to "trigger AttributeError before call"

jnothman · 2014-02-18T09:59:21Z

I also should note that a solution like that's not entirely appropriate in Pipeline where lambda can't be used.

And I don't think "trigger AttributeError before call" is sufficient for people who don't realise triggering AttributeError will affect the hasattr response for a property... which is most Python devs, I would guess.

agramfort · 2014-02-18T10:04:27Z

awkward but explicit and simple :) magic often ends up bringing more problems than expected

jnothman · 2014-02-18T10:30:51Z

I think that depends on what you mean by explicit. I think we're talking
about a confusing property of Python attribute access, as a result of which
the behaviour is most explicit by naming a decorator. But that's my
opinion, which seems in the minority.

On 18 February 2014 21:04, Alexandre Gramfort notifications@github.comwrote:

awkward but explicit and simple :) magic often ends up bringing more
problems than expected

Reply to this email directly or view it on GitHubhttps://github.com//pull/2854#issuecomment-35368948
.

mblondel · 2014-08-13T04:23:27Z

It seems like @agramfort and @larsmans like the solution you suggested in #2854 (comment). Instead of abstracting things away, perhaps you should use this solution wherever needed (at the cost of some code duplication).

I also should note that a solution like that's not entirely appropriate in Pipeline where lambda can't be used.

Can't you use private methods in this case?

jnothman · 2014-08-13T06:12:15Z

Can't you use private methods in this case?

Yes, in one of my previous attempts to fix this bug in Pipeline I did exactly that, using a helper method, and received a critique along the lines of "flat is better than nested".

I think it has to be admitted that someone will dislike the solution to this because it can't be done super-neatly. Either we:

give up support for ducktyping and indicate an estimator's traits some other way (bad in terms of compatibility/extensibility)
use property as above (and as in my previous PR), the main cosmetic disadvantages being:
- the reason for requiring a property (rather than what was used in the original implementation of GridSearchCV et al) remains implicit, or else needs to be documented in each application;
- by returning a method rather than implementing the method directly, it looks like an unidiomatic method definition.
use a custom descriptor that can be named to make its purpose clear, the main disadvantage of which is that the precise operation of that descriptor is locally unclear (but documented elsewhere).

Ultimately I think we need to fix this bug, and not worry too much about the cosmetics until someone can come up with a much better solution.

jnothman · 2014-11-13T10:34:14Z

(rebased)

larsmans · 2014-12-15T13:55:00Z

+1 for merge. I just got bit by this again, and it needs fixing.

amueller · 2014-12-15T15:02:17Z

sklearn/pipeline.py

 from .externals.six import iteritems

 __all__ = ['Pipeline', 'FeatureUnion']


-# One round of beers on me if someone finds out why the backslash
-# is needed in the Attributes section so as not to upset sphinx.
+iff_final_estimator_has_method = make_delegation_decorator('_final_estimator')


Do we need to give this name instead of just using the function?

amueller · 2014-12-15T15:14:57Z

I am also in favor of the general approach, but I am not 100% sure I understand why the implementation looks like it does.

amueller · 2014-12-15T16:29:39Z

sklearn/utils/metaestimators.py

+class _IffHasAttrDescriptor(object):
+    def __init__(self, fn, attr):
+        self.fn = fn
+        self.get_attr = attrgetter(attr)


what is the benefit of this way compared to storing attr and calling getattr(obj, attr)?

amueller · 2014-12-15T16:40:06Z

It seems to me you can simplify _iff_attr_has_method to

def _iff_attr_has_method(prefix, fn):
    attr = '%s.%s' % (prefix, fn.__name__)
    out = _IffHasAttrDescriptor(fn, attr)
    update_wrapper(out, fn)
    return out

Other than that 👍 for merge from me, too.
I find it a bit odd that we need to call update_wrapper twice, but I don't see away around it.
Maybe say that _IffHasAttrDescriptor implements the descriptor protocol.

amueller · 2014-12-15T16:45:53Z

Actually, the update_wrapper in _if_attr_has_method is not necessary, is it?

amueller · 2014-12-15T16:48:56Z

Doing

return lambda fn: _IffHasAttrDescriptor(fn, '%s.%s' % (prefix, fn.__name__))

in make_delegation_decorator and dropping _iff_attr_has_method seems more readable to me.

amueller · 2014-12-18T18:29:32Z

@jnothman what do you think about that simplification?
I can do it if you like.

GaelVaroquaux · 2014-12-18T20:53:36Z

sklearn/utils/metaestimators.py

+        return out
+
+
+def _iff_attr_has_method(prefix, fn, name=None):


I have the impression that this function is used only once. Thus maybe not all the genericity is required (for instance, name is always None, so it can be removed). I think that by focusing on little things like this, the code can be made simpler.

GaelVaroquaux · 2014-12-18T21:38:56Z

Lots of great ideas here, and impressive feat of design.

However, I must say that I am worried about the code of the PR: it really is too much meta-programming code, that is hard to understand (I spotted in the discussion at least one person who said he was not following well).

I think that going along this direction is dangerous, and I am -1 on merging. This is the kind of code that we wrote a lot in mayavi2. The core devs (2 people, including me) were really enjoying it. We had a lot of features from such a meta-programming model. That got us a lot of users. But nobody converted to being a core dev. Such type of code is not good for including new people.

To give specific general advice, I would strive to: avoid decorators as much as possible (there will be some, in the final solution), avoid 'partial', and functools as much as possible.

More specifically, I have the feeling that some indirection can be avoided.

For instance, a line like 'iff_estimator_has_method = make_delegation_decorator('estimator')' that actually defines a a decorator using a factory, which is in a different file, is very hard to follow. Multiple indirections are just something that the human brain is not good at.

_IffHasAttrDescriptor is a class, that is used only once, inside _iff_attr_has_method, that is itself used only once, inside make_delegation_decorator (@amueller made a suggestion that removes an indirection, actually).

Here is a high-level suggestion to make the design easier: instead of decorating existing method, rename them: predict_proba would become _predict_proba. Then you can write a simpler property, because it has less work to do. Maybe it is enough to write it explicitely:

    @property
    def predict_proba(self):
            self.final_estimator.predict_proba
            return self._predict_proba
    def _predict_proba(self, ...)

I can see that the argument against this is to avoid repeating oneself: variants of these lines of code will be in many places in the code base. But, to replace these 4 lines of fairly simple code by 1 line (thus gaining only 3 lines), this PR adds a very non trivial file with 133. Thus, in terms of mere lines of code, the pattern should occur 44 times for the meta-programming to be worth it. This count is actually probably dishonest, because the code above needs at least one line of comment. However, I think that even if it happened more, I would still prefer the low-tech solution, because it is easier to understand and maintain.

amueller · 2014-12-18T21:45:17Z

@GaelVaroquaux are you talking about the tests?
The implementation is currently 60 lines, which are mostly documentation. with my suggestion the implementation boils down to 10 lines of actual implementation.

GaelVaroquaux · 2014-12-18T21:47:34Z

Doh! You are write, I read the wrong file. Those tests are actually super useful, they need to be kept.

Maybe I'd like to see you suggestion coded. I don't have an exact feeling of how much it clears things up.

ogrisel · 2014-12-18T22:13:48Z

I am also undecided on the current solution. The code is compact, well tested but hard to follow.

Using @GaelVaroquaux's explicit (property + private method) pairs on the other hand, while explicit introduce many redundant lines. I am also curious to see how the simplifications suggested by @amueller would look.

Also I wonder if introducing a generic fixed name such as @iff_delegate_has_method or even just @if_delegate_has_method would not make the name-based introspection magic go away to make this wrapper code easier to follow while the name of the decorator would still be explicit enough.

~~I agree that in the case of the pipeline, the fact that delegate is the final estimator of the chain is not trivial, but this can be made explicit just by adding an inline comment.~~

Edit: sorry the above does not make sense.

amueller · 2014-12-18T22:14:20Z

See #3982.
~~I haven't understood why we need attr_getter yet :-/~~
ok never mind, it is simply for resolving the .

jnothman · 2014-12-18T23:37:29Z

We don't need attrgetter. We should probably just store attr on the descriptor, and then do getattr(obj, attr). Its purpose is to raise an appropriate AttributeError. It could be written more explicitly (but perhaps not passing on as much information):

if not hasattr(obj, self.attr):
    raise AttributeError

amueller · 2014-12-19T18:09:41Z

As far as I know, that doesn't work.
attr is something like estimator.decision_function, and hasattr and getattr don't work with nested names. You would need to split on the dot and do two calls, as far as I can see.

amueller · 2014-12-19T20:56:20Z

Curiously the decision_function of pipelines is documented without call signature, while the decision_function of GridSearchCV has decision_function(*args, **kwargs)
The link to the source points to https://github.com/scikit-learn/scikit-learn/blob/a1510a1/sklearn/utils/metaestimators.py#L37 for both, unfortunately. I'm not sure why update_wrapper didn't take care of this.

jnothman · 2014-12-20T10:15:53Z

Oh of course. The attrgetter was intended to handle nested names. Forgot.

And yes, this sphinx thing is has the potential to get messy :(

On 20 December 2014 at 07:56, Andreas Mueller notifications@github.com
wrote:

Curiously the decision_function of pipelines is documented without call
signature, while the decision_function of GridSearchCV has decision_function(_args,
*_kwargs)
The link to the source points to
https://github.com/scikit-learn/scikit-learn/blob/a1510a1/sklearn/utils/metaestimators.py#L37
for both, unfortunately. I'm not sure why update_wrapper didn't take care
of this.

—
Reply to this email directly or view it on GitHub
#2854 (comment)
.

larsmans · 2014-12-27T13:00:30Z

Superseded by #3982, which was just merged.

jnothman mentioned this pull request Jul 26, 2014

GridSearchCV doesn't have decision_function before fit #3484

Closed

jnothman mentioned this pull request Aug 12, 2014

[MRG+2] Bugfix: Clone-safe vectorizers with custom vocabulary #3552

Closed

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

jnothman mentioned this pull request Nov 13, 2014

BUG: cross_val_score ignores scoring when estimator is a GridSearchCV object #3848

Closed

jnothman added 3 commits November 13, 2014 21:33

TST/FIX ensure correct ducktyping for metaestimators

b3259ee

TST extend ducktype testing to handle scikit-learn#2853 case

be9edb7

FIX ducktyping for meta-estimators

262ab91

TST more rigorous testing of delegation

52f2812

jnothman force-pushed the meta-ducktyping-new branch from 81096ef to 52f2812 Compare November 13, 2014 10:34

larsmans changed the title ~~[MRG] Ensure delegated ducktyping in MetaEstimators~~ [MRG+1] Ensure delegated ducktyping in MetaEstimators Dec 15, 2014

amueller reviewed Dec 15, 2014
View reviewed changes

GaelVaroquaux reviewed Dec 18, 2014
View reviewed changes

amueller mentioned this pull request Dec 18, 2014

[MRG+2?] Metaestimator delegation #3982

Merged

larsmans closed this Dec 27, 2014

Uh oh!

[MRG+1] Ensure delegated ducktyping in MetaEstimators #2854

[MRG+1] Ensure delegated ducktyping in MetaEstimators #2854

Uh oh!

Conversation

jnothman commented Feb 12, 2014

Uh oh!

coveralls commented Feb 12, 2014

Uh oh!

agramfort commented Feb 13, 2014

Uh oh!

jnothman commented Feb 13, 2014

Uh oh!

jnothman commented Feb 13, 2014

Uh oh!

agramfort commented Feb 14, 2014

Uh oh!

GaelVaroquaux commented Feb 14, 2014

Uh oh!

jnothman commented Feb 15, 2014

Uh oh!

jnothman commented Feb 15, 2014

Uh oh!

jnothman commented Feb 15, 2014

Uh oh!

agramfort commented Feb 17, 2014

Uh oh!

jnothman commented Feb 17, 2014

Uh oh!

mblondel commented Feb 18, 2014

Uh oh!

jnothman commented Feb 18, 2014

Uh oh!

jnothman commented Feb 18, 2014

Uh oh!

larsmans commented Feb 18, 2014

Uh oh!

jnothman commented Feb 18, 2014

Uh oh!

larsmans commented Feb 18, 2014

Uh oh!

jnothman commented Feb 18, 2014

Uh oh!

agramfort commented Feb 18, 2014

Uh oh!

jnothman commented Feb 18, 2014

Uh oh!

mblondel commented Aug 13, 2014

Uh oh!

jnothman commented Aug 13, 2014

Uh oh!

jnothman commented Nov 13, 2014

Uh oh!

larsmans commented Dec 15, 2014

Uh oh!

amueller Dec 15, 2014

Choose a reason for hiding this comment

Uh oh!

amueller commented Dec 15, 2014

Uh oh!

amueller Dec 15, 2014

Choose a reason for hiding this comment

Uh oh!

amueller commented Dec 15, 2014

Uh oh!

amueller commented Dec 15, 2014

Uh oh!

amueller commented Dec 15, 2014

Uh oh!

amueller commented Dec 18, 2014

Uh oh!

GaelVaroquaux Dec 18, 2014

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Dec 18, 2014

Uh oh!

amueller commented Dec 18, 2014

Uh oh!

GaelVaroquaux commented Dec 18, 2014

Uh oh!