Skip to content

WIP GridSearch with advanced score functions. #1198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

amueller
Copy link
Member

@amueller amueller commented Oct 2, 2012

Follow up on #1014.
This PR does the handling of score functions in the regressor and classifier mixins with minimal intrusion into the code.
This enables estimator to overwrite the score function and do whatever they see fit - though that shouldn't be necessary atm.

To complete this PR, I will probably also need to do some minor adjustments to the other CV classes,

I'm a bit tempted to add a test to the "common tests" to see if cross_valdation_score with AUC works.

@@ -447,11 +444,4 @@ def _fit(self, X, y):
return self

def score(self, X, y=None):
if hasattr(self.best_estimator_, 'score'):
return self.best_estimator_.score(X, y)
if self.score_func is None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was untested and also unreachable afaik.
How should the grid-search be done without a score function any way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm forgot about the loss_function case. should fix that.

@amueller
Copy link
Member Author

amueller commented Oct 2, 2012

Btw I think this solution is more elegant than #1014 and I am quite happy with it.
cc @ogrisel @GaelVaroquaux :)

@GaelVaroquaux
Copy link
Member

cc @ogrisel @GaelVaroquaux :)

Give me till the week end :}

def requires_thresholds(f):
"""Decorator to tag scores that need continuous inputs."""
f.requires_thresholds = True
return f
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could even check the input in the wrapper and yield a meaningful error message if the second input array has a 1D shape.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does that work? Write a wrapper function for f that does the check and then applies f?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@amueller
Copy link
Member Author

Ok so two things:

@GaelVaroquaux
Copy link
Member

On Sun, Oct 14, 2012 at 01:40:52PM -0700, Andreas Mueller wrote:

• From #1210: should we remove the unsupervised case from gridsearchcv?
It doesn't really make any sense.

Really. It seems to me that as long as there is a well-defined 'score',
GridSearchCV makes sens in unsupervised settings. In the case of things
like a mixture model, the likelyhood makes sens as a score.

• for score-functions to work on SVC, it either needs to reimplement
score or it should return a sensibly-shaped decision_function. Which
way should we go?

I'd say return a sensibly-shaped decision_function, as the problem will
rise in other situations.

G

@agramfort
Copy link
Member

Really. It seems to me that as long as there is a well-defined 'score',
GridSearchCV makes sens in unsupervised settings. In the case of things
like a mixture model, the likelyhood makes sens as a score.

take the FactorAnalysis now that you have a score function you can optimize
the number of components with GridSearchCV :)

@amueller
Copy link
Member Author

@agramfort @GaelVaroquaux My reasoning was that you don't need to use a validation set for unsupervised tasks. You can just optimize parameters on the training set.
And GridSearchCV doesn't evaluate the score in the training set.

How I would do it is GridSearch on the training set, then optionally test once on the hold-out set so see how it generalizes.

Would you use another pattern? And do you have any examples where this was useful?

@GaelVaroquaux could you please also comment on the inheritance issue I raised?

@agramfort
Copy link
Member

@agramfort @GaelVaroquaux My reasoning was that you don't need to use a validation set for unsupervised tasks. You can just optimize parameters on the training set.

I agree if the training set is also splitted in 2.

And GridSearchCV doesn't evaluate the score in the training set.

How I would do it is GridSearch on the training set, then optionally test once on the hold-out set so see how it generalizes.

I agree but you still need GridSearch with unsupervised.

@amueller
Copy link
Member Author

sorry, bad working. I would grid search on the training set - not use GridSearchCV.

There is no cross-validation going on.

I am not saying we don't need unsupervised grid search. What I was thinking was that supporting unsupervised grid search in GridSearchCV is weird, as you don't need cross-validation. I'd rather have a separate estimator that does grid search without cross-validation and evaluates the score on the training set.

@amueller
Copy link
Member Author

I am not so experienced here and unsupervised model selection is not that well explored - so if you think cross-validation is appropriate, ok. Then I'd like a reference and an example ;)

@agramfort
Copy link
Member

I am lost. To me GridSearchCV is doing exactly this when fit although it
can after be used to predict which is not relevant for unsupervised.

@GaelVaroquaux
Copy link
Member

I am not so experienced here and unsupervised model selection is not that well
explored - so if you think cross-validation is appropriate, ok. Then I'd like a
reference and an example ;)

http://www.compstat2012.org/Proceedings_COMPSTAT2012.pdf

search for 'varoquaux'.

You also want an example in the scikit's examples, right? I have in mind
of course setting the sparsity of a sparse PCA. The problem is that it
tends to take a lot of time. Also, The sparsePCA object that we have in
the scikit does not have a score method, as it is not universaly accepted.

I'll try to think of something...

@ogrisel
Copy link
Member

ogrisel commented Oct 17, 2012

Even on an unsupervised setting you need to compute the score (e.g. negative log-likelihood) on a held out test set to avoid overfitting (yet "overfitting" can happen in an unsupervised setting). So let's keep the unsupervised setting and write a test with FA instead.

Also we should improve the documentation of the KMeans.score method to express that it should not be used for grid searching of k as it will always select the highest value irrespective of the data.

@ogrisel
Copy link
Member

ogrisel commented Oct 17, 2012

But even for FA, the largest the number of components, the highest the neg log-likelihood will be, isn't it?

@amueller
Copy link
Member Author

@agramfort Sorry, I must have been unclear. GridSearchCV splits the training data for each iteration. Trains on one, scores on the other. I would rather fit and score on the same. In particular, for most clustering algorithms, you can not predict and therefore not score on anything but the test set - ergo GridSearchCV can not be used.

If you think that this grid search is helpful, then let's just do a FA example as @ogrisel suggests.

We still need a different estimator for the clustering algorithms that don't have a predict.

@GaelVaroquaux
Copy link
Member

But even for FA, the largest the number of components, the highest the neg
log-likelihood will be, isn't it?

Most often yes, I believe. That's why we need optional penalizations (eg
BIC, AIC). We could add a constructor argument for this. It would also be
useful for clustering, or other settings.

@GaelVaroquaux
Copy link
Member

On Wed, Oct 17, 2012 at 05:39:40AM -0700, Andreas Mueller wrote:

@agramfort Sorry, I must have been unclear. GridSearchCV splits the training
data for each iteration. Trains on one, scores on the other. I would rather fit
and score on the same.

No. You will never be able to control for overfitting, i.e. fitting
noise, when you do this.

We still need a different estimator for the clustering algorithms that don't
have a predict.

Agreed.

G

@amueller
Copy link
Member Author

Coming back to the real content of the PR:
It requires that all estimators we want to grid-search have a score function and that it takes a score_func argument (which can be None).

Is this ok with everybody?

@GaelVaroquaux
Copy link
Member

On Wed, Oct 17, 2012 at 05:42:38AM -0700, Andreas Mueller wrote:

It requires that all estimators we want to grid-search have a score function
and that it takes a score_func argument (which can be None).

Is this ok with everybody?

No, I don't think so. I may want to apply my own score_func to an
estimator that has no score. I do this often.

I don't understand why you want to remove this possibility.

@amueller
Copy link
Member Author

Because then different score-functions would be handled very differently.
The complicated score-functions like auc_score would be handled by the score of the estimator, whereas other score functions would be handled by GridSearchCV.
This seems like a weird fragmentation.

Can you give an example of using grid-search on an estimator without a score function?

@amueller
Copy link
Member Author

Btw, we could put a score function into the BaseEstimator that takes a mandatory score_func argument.
Then you could use it on all estimators.

Btw, do you really use GridSearchCV on estimators that don't inherit from BaseEstimator?

@GaelVaroquaux
Copy link
Member

On Wed, Oct 17, 2012 at 05:48:21AM -0700, Andreas Mueller wrote:

Can you give an example of using grid-search on an estimator without a score
function?

Any clustering algorithm, if I have a specific logic to set the number of
clusters.

If you really don't like 'score_func', we can depreciate it and remove
it, with the message that people should subclass the estimator for this
usecase. I remember that I wasn't too excited about it in the first
place, but that was a long time ago, and now it is somewhat part of the
design.

@amueller
Copy link
Member Author

Sorry, I think there is really a misunderstanding. I like score_func and I want to keep it.
I just want the estimator to handle it, not grid-search.
Maybe have a look at the code.

about your example: sorry, that doesn't exist.
There is no clustering in sklearn that defines a predict (which is necessary for GridSearchCV) but doesn't define a score.

@amueller
Copy link
Member Author

Btw, I am totally against subclassing for different score functions.
Having the estimator handle the score function was your idea. I just wanted to have the score function handled the same way for all estimators.

@GaelVaroquaux
Copy link
Member

On Wed, Oct 17, 2012 at 02:34:58PM +0200, Gael Varoquaux wrote:

I am not so experienced here and unsupervised model selection is not
that well explored - so if you think cross-validation is appropriate,
ok. Then I'd like a reference and an example ;)

http://www.compstat2012.org/Proceedings_COMPSTAT2012.pdf

search for 'varoquaux'.

Other example:
http://books.nips.cc/papers/files/nips23/NIPS2010_1054.pdf

I do this all the time.

@GaelVaroquaux
Copy link
Member

On Wed, Oct 17, 2012 at 05:50:36AM -0700, Andreas Mueller wrote:

Btw, we could put a score function into the BaseEstimator that takes a
mandatory score_func argument.
Then you could use it on all estimators.

I don't understand the logic. How would 'score_func' be used in the score
of the estimator. This would violate the API of 'score'.

Btw, do you really use GridSearchCV on estimators that don't inherit from
BaseEstimator?

I don't often. However, it is good design not to enforce inheritence.
Think about it: I want to have a library that provides objects useable
with the scikit-learn, but without a scikit-learn dependency.

That's what duck typing is about. It's used all over Python: you can use
the following function on numpy, eventhought it knows nothing about numpy
arrays:

def f(x):
return (x**2 + 1)/3.

Inheritence is what kills a large C++ codebase (VTK, I am looking at
you). It means that all your design decisions have huge impacts all over
the codebase.

In my experience, relying on inheritence for a codebase is bad design.
One of the impacts it has, is that it leads to overly complex objects
with many methods, as there is no cost to adding methods to classes in
the inheritence diagram. It makes it really hard to fit in from the
outside, and introduces tight coupling in the codebase. While I like
object-oriented programming, I think that inheritence is the wrong answer
to a problem that is really a problem of interface. I think that
object-oriented design should be done by defining how an object behave,
i.e. its 'surface', and not what it inherits from.

@amueller
Copy link
Member Author

Btw, we could put a score function into the BaseEstimator that takes a
mandatory score_func argument.
Then you could use it on all estimators.

I don't understand the logic. How would 'score_func' be used in the score
of the estimator. This would violate the API of 'score'.

By default, score would call predict and then the score_func.
Why would that violate the API of score?
Have you had a look at this PR?

Btw, do you really use GridSearchCV on estimators that don't inherit from
BaseEstimator?

I don't often.

So you did reimplement get_params and set_params?

However, it is good design not to enforce inheritence.
I though we already enforced inheritance.
Think about it: I want to have a library that provides objects useable
with the scikit-learn, but without a scikit-learn dependency.

I think this is impossible already as pointed out above.
We can try to get rid of the get_params and set_params
but that is a different story altogether.

I just wanted to get auc_score to work in a way that is not totally hacky.

@amueller
Copy link
Member Author

So the solution to the problem is then:

try:
    if score_func is None:
        score = estimator.score(X[test], y[test])
    else:
        score = estimator.score(X[test], y[test], score_func=score_func)
except TypeError, AttributeError:
    if score_func.requires_threshold:
          raise TypeError("score func passed that needs threshold but estimator can't accept it.")
   else:
          score = score_func(y[test], estimator.predict(X[test])

The only problem here is that GridSearchCV needs to know about requires_threshold. Which I wanted to avoid
because of your comments on #1014.

So I don't really see how it is possible to leave GridSearchCV unaware of requires_threshold without requiring a more strict interface for score.

@amueller
Copy link
Member Author

The above "solution" has the disadvantage to break if using auc_score on an estimator without score, while working with f1_score on the same estimator. Which I find very confusing.

@GaelVaroquaux
Copy link
Member

On Wed, Oct 17, 2012 at 05:56:50AM -0700, Andreas Mueller wrote:

Sorry, I think there is really a misunderstanding. I like score_func and I want
to keep it.
I just want the estimator to handle it, not grid-search.
Maybe have a look at the code.

OK, sorry for not looking at the code and discussing the PR based on the
comments. I am struggling between an ever-filling mailbox, the need to
work on my day-job with my colleagues, and the scikit-learn.

I now looked at the code. I am afraid that I disagree with the PR: I
think that it is overfitting specific needs and that a lot of the things
that I do with the scikit would no longer be possible. In addition, I
dislike the decorator logic that is creating a mini-language that feels
very magic and will not scale with different usecases.

Below follows a very long email that tries to argue why I think that it
is a bad design, and suggests other options. However, the gist of my
point is that you are basically hard coding a specific usecase
(classification and score requiring non-thresholded logic), and not
resolving problems for other usecases (unsupervised learning,
multi-output...).

I think that by asking the score method of an estimator to take a
'score_func' argument, you are over-specifying what happens in the score
method. The reason behind this is that the score method of an estimator
has object-specific logic. As a consequence, I don't see a way to specify
the signature of 'score_func' that could be used inside an estimator's
score method and wouldn't be geared toward a specific use (e.g.
classification).

That said, I agree with you that we have a design problem with the way
that score_func is implemented. I personnally think that we rushed a bit
when we specified this early on (you weren't around, and I'll take the
full blame). As a result, that 'score_func' argument is not as useful as
it could/should.

This is a much bigger discussion, but I would personnally much prefer if
score_func's signature would be 'estimator, X_test, y_test'. That would
open the door to many many usage patterns. It has the drawback of
breaking backwards compatibility and requiring a adaptation layer to be
able to use functions like auc_score in grid search. This is maybe a
problem in itself, but I think that it is a core problem: this functions
are not rich enough and need an adaptation layer to be useful as black
boxes.

Quoting another of your comments, about using
'score_func.requires_threshold' in GridSearchCV:

the only problem here is that GridSearchCV needs to know about
requires_threshold. which i wanted to avoid because of your comments on
#1014.

I think that the core problem s that auc_score, like any score in
metrics, knows nothing of how the score should be obtained. By design,
they cannot, because they take only predictions as an input. I believe
that the right way to address this limitation relies an imperative
mechanism that can interact with the estimator to produce the scores. For
instance, if I want to do a BIC-like approach and penalize my number of
errors by a measure of the complexity of the model, I need access to the
estimator. However, this seems very estimator-specific to me, and this is
why I would favor an approach based on encoding as much as possible this
logic in the estimator.

Is there something you don't like with the idea of having an
estimator-level parameter that would switch the 'score' function between
the zero_one, the auc_score, or the precision_score? It does add a
parameter to the estimator, but it would be both discoverable and very
general. For the FA, or clustering estimators, the mixture models, using
the same pattern, we could switch between penalized likelihoods (BIC,
AIC) and unpenalized.

As a side note, to justify this last usecase, for such models, it does
make sens to penalize, even in a cross-validation setting, as the
cross-validated likelihood tends to monotonically increase with the
complexity of the model. See for instance the Minka paper on PCA
dimensionality selection:
http://vismod.media.mit.edu/pub/tech-reports/TR-514.pdf figure 4, where
it does increase, but barely, or my paper on similar problem:
http://hal.archives-ouvertes.fr/docs/00/58/88/98/PDF/paper.pdf figure 3,
where I show that a bit of mismodeling (here not modeling the smoothness
of the data) induces a monotonically increasing cross-validated score.

about your example: sorry, that doesn't exist.
There is no clustering in sklearn that defines a predict (which is necessary
for GridSearchCV) but doesn't define a score.

Good. I am happy. Indeed, I'd like as many objects as possible to define score
functions. However, should we make this behavior mandatory? My policy is
always to assume as little as possible on the usage patterns of the
scikit. In my research group, we work a lot on how to set parameters, and
our conclusion is often that it depends on the end goal, and that there
is not one-size-fit-all type of solution.

Sorry for the long email. I hope that my arguments are somewhat
understandable.

@amueller
Copy link
Member Author

Thanks for the response. Sorry, I was a bit frustrated after my second proposal being rejected.
This seems a bit harder to do right than I initially hoped.
I am not totally happy with the decorator either, but I couldn't come up with a much better way.

Do I understand the "estimator parameter" idea correctly:
the estimator gets a score_func (which would be a function as there are quite a lot of classification metrics), remembers that, and will call it in "score"?
It seems to me the required change to this PR would mainly be move the score_func from score to __init__.
We would still need a way to tell the estimator how to use the score function.
And it would still be implemented in the Mixin ;)

If you specify a score-func in the estimator, and one in GridSearch, it is not clear at all to the user which one will be used. But following your argument, we can not drop the ability to use score functions in GridSearch, as you want to use score functions with estimators that don't know anything about score functions.

@amueller
Copy link
Member Author

Btw, I totally agree that we should aim at minimizing requirements on estimators and the API.
Also, we shouldn't make our mixins to bloated and should keep methods to a minimum.
I just didn't / don't see any workable alternatives.

About making some applications impossible: I will always take great care not to break anything within the scikit.
Not breaking anything you build with your estimators with the scikit is much harder. I have no idea what assumptions you made - and I don't want to overfit to your homebrew code to much ;)
If we do want to support homebrew code, we have to specify an API. Otherwise I have nothing to work with.

@agramfort
Copy link
Member

here is a quick an dirty demo script if it can help:

agramfort@4ba3d3b

HTH

@amueller
Copy link
Member Author

@agramfort Thanks :)
You guys convinced me that it is a good idea. Sorry for mixing up this issue with this PR.

@amueller
Copy link
Member Author

@GaelVaroquaux Btw, this PR is the third iteration of trying to integrate the score function into the estimator. I tried making it a constructor argument, but after adding it to half of the estimators in sklearn I was convinced that it is not a good solution and dropped the stash.

I'd really like to include grid search with auc_score and probably also logistic_score (there is a PR).
One of the reasons is that everybody that used sklearn for kaggle needed to hack this functionality into sklearn.
That doesn't seem right.

@agramfort
Copy link
Member

@agramfort Thanks :)

no problem. If you feel like adding this test please do so.

maybe @jaquesgrobler can also take a look at it :)

@amueller
Copy link
Member Author

amueller commented Feb 3, 2013

Closed via #1381.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants