Skip to content

[WIP] Estimator.iter_fits for efficient parameter search #2000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

jnothman
Copy link
Member

This implements an approach to generalised cross validation (#1626) by allowing each estimator to control its fitting over a sequence of parameter settings. estimator.iter_fits returns an iterator (usually a generator) over parameters and models fit with those parameters.

This PR is to open comment, rather than to finalize the implementation in the near term.

The default implementation allows an estimator to specify parameters which when changed do not require fit to be called again. Estimators where it is possible to warm start from the model attributes learnt with the previous parameters may utilise that fact; in particular, this approach emphasises that the same data is being used, where multiple calls to fit does not. A Pipeline resolves its ordering through a depth-first traversal of the parameters affecting each pipeline step (ordered by that step's estimator), and the use of generators means the output of transform from higher levels is retained in the stack, which adds a memory cost (it may be possible to optionally reduce this cost later) but cuts a lot of repeated work. The model iterator approach means regularization paths can be easily incorporated if rewritten as generators.

Grid search (cv=3, n_jobs=1) over a simple pipeline of (StandardScaler, SelectKBest, ElasticNet) ran in 67% of the baseline time (repeated calls to Pipeline.fit) using this implementation, with the following call counts:

Method calls@master calls@iter_fits
StandardScaler.fit 241 7
StandardScaler.transform 481 247
SelectKBest.fit 241 7
SelectKBest.transform 481 265
ElasticNet.fit 241 241

(Note: many calls to transform are at predict time, which is not affected by this change.)

Caveats and comments:

  • iter_fits will often overwrite attributes of the same object in each iteration. It's therefore necessary to copy any data off the models or call any of their methods while iterating. This may make it a dangerous method for public use, and it may be worthwhile to require that each iteration yields a clone of the estimator, at some added expense and complication.
  • a common pattern for implementing iter_fits is to group the list of parameter settings by the values of some "higher-order" parameters (those requiring substantial work to be done at each change) within which groups lower-order parameters are iterated. Thus:
    • the parameter settings need to be grouped by each estimator, making the dict form in which parameters are transmitted cumbersome
    • parameter values may be objects that are unorderable and/or unhashable and/or un-boolean-comparable (ndarray is all of these), making the grouping operation and other generic parameter equality evaluations not trivial to code
    • in many cases it should be possible instead to order merely on the basis of parameter names and some simple ordering rules for each parameter, resulting in less repeated reordering work (and on the scale of number of parameter names, not number of parameter value combinations). Setting a pipeline step ([MRG + 2] ENH enable setting pipeline components as parameters #1769) presents a case where parameter names are not sufficient information for an estimator to return an ordering.
  • user-intended candidate ordering, and the manner of their generation, is ignored. This makes the solution generic, but it's harder to harness the properties of a grid, for example. The reordering also means parameter search may return a different best candidate with the same top score.
  • since the ordering is performed at fit time, each fold of a cv-search process may operate in parallel. To recombine the outputs, it is easiest if the reordering is deterministic, and therefore the same across all folds. Also because the reordering is coupled with fitting, parallelisation beyond one-process-per-fold may require arbitrary splits in the candidate sequence.

@GaelVaroquaux
Copy link
Member

I would like to stress the need to avoid methods proliferation. Every
problem can be solved by adding an object or a method. In the short term,
it seems like a viable solution. However, in the long term, all it does
is move the package further away from the reach of users and developers.
People who know the package and already use it are not to hindered by the
gradual increase in sophistication, but the fraction of newcomers willing
to invest in it will decrease. Sitting inside, it is easy not to see the
step curve.

More sophistication can appear to hide sophistication from users, by
making it 'just work', but in fact all it does is make it harder to move
away from the prototypic usage pattern, and in fact making the package
less flexible for the layman. I have not always though this way, and one
example is Mayavi. Mayavi is used a lot: we have managed to make simple
use cases very easy, despite a complex object-oriented framework.
However, we have no contributions outside of the core team, because it is
too hard to get into it. I am now very worried of compromising a
project's development.

For all the reasons above, I would like to raise the question: can the
problem be solved without adding a new method? I am sure that at first
sight it does not seem so. However, it helps breaking down the problem,
into smaller problems and patterns that are recurring. I agree with you
that the idea of a regularization path is important and recurring. It is
under-used in the scikit-learn codebase. Another recurrent pattern is not
wanting to fit twice the same thing. In particular with transformers in a
pipeline. Here a pattern that works fairly well is that of using a joblib
memory object. I'd love to see it expanded in scikit-learn, with many
transformers taking a 'mem' optional argument (currently only feature
agglomeration does that, but things like the feature selection would very
much benefit from it).

Also, it is important to weight the gains with the costs. A reduction by
.67 of the running time is not that much compared to the cost of adding a
new method. I usually do not bother for such an improvement. I expect
that adding strong rules to elastic net would give a 2 to 5 times speedup
to your example, without requiring any addition to the interfaces of the
estimator objects.

Finally, the way I like to work is usually to work as hard as I can with
the framework I already have, making sure that they are no low hanging
fruits, before I add some sophistication. Cleaning up the path pattern to
make it more uniform in the codebase (for instance as started by PR
#1947), making a better use of memory would be the first things that I
attempt.

I hate sending a long email like this: you have been incredibly active,
with high-quality contributions. I just notice a trend toward complexity,
while there is still a lot of work to be done on the simple things.

@jnothman
Copy link
Member Author

Thanks for your comments, Gaël. As far as this PR goes, I mostly wanted to record some thoughts related to the implementation I hacked up, without suggesting it was the best solution.

You are worried about adding another method. I don't think this method needs to be public, nor does it frequently need a specialised implementation.

You are probably right that a lot of this can be done by clever memoization, with some additional benefits. But it does not handle warm starts that require particular ordering. In any case, it requires classes to annotate the parameters that should be used as keys, much as in this model. Still, I concur, it would be a much cleaner, simpler interface if Pipeline added a memoize option.

Finally, I appreciate -- and did so before your comment -- that I have tended dangerously towards complexity in a few instances, and I hope to learn to temper it. I have been interested in playing with some of the problems in the parameter search space, because they are broadly applicable to learning work; however, they don't fit neatly inside the Estimator API.

@jnothman
Copy link
Member Author

PS: I don't think this would add sophistication for the end-user; none for the implementer of a new model, and marginally more (the addition of an attribute) if they wanted an estimator that didn't need to refit for some parameter changes.

@GaelVaroquaux
Copy link
Member

I have been interested in playing with some of the problems in the
parameter search space, because they are broadly applicable to learning
work; however, they don't fit neatly inside the Estimator API.

I completely agree that parameter search is critically important. I just
think that we have to thread carefully when designing the scikit-learn
framework in this direction. What I would suggest is to prototype outside
of the scikit-learn codebase (but publicly), and to use the prototypes in
your work. I am sure that things will come out.

We are doing the same in my lab, but we have always found that designing
a specific estimator for a given problem was the best compromise. For
instance, I am really looking forward to a LogisticRegressionCV, and I
think that we may now know how to code out. More in a few days...

@GaelVaroquaux
Copy link
Member

PS: I don't think this would add sophistication for the end-user; none for the
implementer of a new model, and marginally more (the addition of an attribute)
if they wanted an estimator that didn't need to refit for some parameter
changes.

But it would make the GridSearchCV code even more complex, and thus
harder for people to implement their own strategy.

@jnothman
Copy link
Member Author

But it would make the GridSearchCV code even more complex, and thus harder for people to implement their own strategy.

Actually, the SearchCV code is hardly affected as long as no fussy parallelisation is needed. Alternative search strategies will land up treating a search as a sequence of smaller searches. In this case, memoization is a far-superior approach. To this extent, I think you're absolutely right it's the approach that should be taken to this problem, and I'm tempted to close this PR as a false direction.

@jnothman
Copy link
Member Author

(And I should point out that the main contribution I've made out of a real need in my work was my first, which still hasn't been merged... perhaps because it's too complex.)

@GaelVaroquaux
Copy link
Member

and I'm tempted to close this PR as a false direction.

I don't think that it should be closed, but maybe tagged with a 'CLOSED'
tag, instead of a WIP tag, so that the code is still visible. It's always
useful to come back to a PR when brainstorming later down the line.

@jnothman
Copy link
Member Author

But for that reason it is related to Andy's issue on the subject. It's always possible to come back to, and to reopen. Thanks again for a great alternative idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants