Skip to content

[MRG+3] ENH Restructure grid_scores_ into a dict of 1D (numpy) (masked) arrays that can be imported into pandas as a DataFrame. #6697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 16, 2016

Conversation

raghavrv
Copy link
Member

@raghavrv raghavrv commented Apr 22, 2016

Overrides/Fixes #6686, #1034, #1787, #1768

Also related #1742, #1020

Tangentially related #1850, #1837, #2759

Also adds a name to all the scorers

  • _get_scorer_name(checked_scoring_function). Reverted based on consensus.
  • Test _get_scorer_name. Reverted based on consensus.
  • Restructure grid_scores_ into dict of (masked) numpy arrays, which can be easily imported into pandas.
  • Old tests pass
  • Modify old tests + New tests
  • _get_candidate_params() --> search.candidate_params_ --> search.results_['params']
  • Compute weighted std (for iid data.) + test
  • Try vectorizing in bulk using reshape
  • Examples with nice plots? (not for this PR)
  • Doc + Whatsnew

Reviews welcome!

The sandbox notebook

@amueller @vene @jnothman @agramfort @GaelVaroquaux @mblondel @hlin117 @ogrisel

@raghavrv raghavrv force-pushed the multiple_metric_grid_search branch from 413404e to 12d8261 Compare April 22, 2016 17:55
@raghavrv raghavrv changed the title [WIP] ENH Restructure grid_scores_ into more efficient data structure + Simultaneous Multiple Metric Grid / Random Search / cross-validation [WIP] ENH Restructure grid_scores_ into more efficient data structure Apr 25, 2016
@raghavrv raghavrv force-pushed the multiple_metric_grid_search branch 5 times, most recently from cad5e6c to e153cbb Compare April 27, 2016 18:29
@raghavrv
Copy link
Member Author

raghavrv commented Apr 27, 2016

Sorry for the delay! I was trying to attack multiple metric support along with this and later realized the Scorer interface discussion needs to take place before that!

So I've just restructured the grid_scores into search_results which is a dict of numpy masked arrays for parameters / reg arrays for scores/means and ranks.

* ``mean_validation_score``, the mean score over the
cross-validation folds
* ``cv_validation_scores``, the list of scores for each fold
search_results_ : dict of numpy (masked) ndarrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By renaming grid_scores_ to search_results_, you're changing the API, right? You probably don't want to do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model_selection module has not been publicly released, so there's an argument to be made that this is fair game.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather, this sort of change has been proposed and sought by the core devs for years; the fact that model_selection has not been released maybe makes it the best time to do it. However, I think we must consider this a deprecation case, not just drop grid_scores_ altogether. And that's because users will initially not rewrite all their old code but will change some imports, and will be much less angry if they get a DeprecationWarning at the end of a long fit rather than an AttributeError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm... We plan on adding multiple metric support subsequently. How would grid_scores_ look for that use case? Do we raise an error when grid_scores_ is accessed when multiple metrics are required. Or do we return the first metric alone? or all the metrics as a dict of lists (of _CVScoreTuples)?

Also the old grid_search module is available for all the angry users who can tolerate a DeprecationWarning. I have a humble opinion that trying to support old code here will constrain us, not now but in subsequent PRs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jnothman is right, especially because this is likely to be an issue at the end of a long fit. I guess grid_scores_ should raise an exception if (when) multiple metrics are used. They're not "angry users", if anything, they will be users who embark in the process of updating their projects to keep up with scikit-learn changes. We want to make it smooth, such that they can stop in the middle of the rewriting effort, if they have to, and things will still work, right?

Maybe in such a case we could use a shorter deprecation cycle, if you think that would help keep things cleaner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users who embark in the process of updating their projects to keep up with scikit-learn changes.

Okay with +3 I'm adding grid_scores_ as a property function. Also I was wondering if we should store like we did before as computing them from search_results_ (especially the params) is not very clean.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only has to work for cases formerly supported. Either way of producing them is fine.

@MechCoder
Copy link
Member

Ping again when this is ready for review..

mask = [ True True False False]...),
'degree' : masked_array(data = [2.0 3.0 -- --],
mask = [False False True True]...),
'accuracy_score_split_0' : [0.8, 0.7, 0.8, 0.9],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is annoying that we are pandering to pandas (ha!) to not allow a 2d array here :(

Copy link
Member Author

@raghavrv raghavrv Apr 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! The other option is to store the scores in compact 2D numpy float arrays and have a sorted list of column headers (the dict keys of search_results_). (only the scores, not the parameters).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for some functions in networkx, they allow you to return pandas dataframes when possible by flipping a return_pandas parameter to True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I suppose the most idiomatic Pandas representation is using a MultiIndex. But I don't think it's necessary our job to produce that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I'm a proficient enough Pandas user to really say what is idiomatic or useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be practical, what I am a little concerned about is: how easy is it to calculate the standard deviations of scores across k-fold CV? It's easy in the current storage; how easy is it in the proposed? How easy is it in the Pandas rep?

Copy link
Contributor

@hlin117 hlin117 May 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman Assuming that the folds in question are named accuracy_score_split_0, ..., accuracy_score_split_(k-1), and the pandas dataframe with the scores is called search_results_, it's as easy as

fold_scores = search_results_.select(lambda name: name.startswith("accuracy_score_split_"), axis=1)

# Placing data into the dataframe
search_results_["std"] = fold_scores.std(axis=1)
search_results_["mean"] = fold_scores.mean(axis=1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's not terrible...

On 1 May 2016 at 16:29, Henry Lin notifications@github.com wrote:

In sklearn/model_selection/_search.py
#6697 (comment)
:

  •    kernel|gamma|degree|accuracy_score_split_0...|accuracy_score_mean ...|
    
  •    =====================================================================
    
  •    'poly'|  -  |  2   |           0.8           |         0.81          |
    
  •    'poly'|  -  |  3   |           0.7           |         0.60          |
    
  •    'rbf' | 0.1 |  -   |           0.8           |         0.75          |
    
  •    'rbf' | 0.2 |  -   |           0.9           |         0.82          |
    
  •    will be represented by a search_results_ dict of :
    
  •    {'kernel' : masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],
    
  •                             mask = [False False False False]...)
    
  •     'gamma' : masked_array(data = [-- -- 0.1 0.2],
    
  •                            mask = [ True  True False False]...),
    
  •     'degree' : masked_array(data = [2.0 3.0 -- --],
    
  •                             mask = [False False  True  True]...),
    
  •     'accuracy_score_split_0' : [0.8, 0.7, 0.8, 0.9],
    

@jnothman https://github.com/jnothman Assuming that the folds in
question are named accuracy_score_split_0, ..., accuracy_score_split_(k-1),
and the pandas dataframe with the scores is called search_results_, it's
as easy as

fold_scores = search_results_.select(lambda name: name.startswith("accuracy_score_split_"), axis=1)
search_results_["std"] = fold_scores.std(axis=1) # To place it into the dataframe


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/6697/files/e153cbbbe7fc8ae5edbb427ca6f3e22d63bd6f0a#r61681497

Copy link
Member Author

@raghavrv raghavrv May 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think get_candidates(...) addresses this problem?

@jnothman
Copy link
Member

I think, @rvraghav93, you would've benefited from writing this in a more test-driven way. Indeed, we would be able to critique your expected format before you write the implementation! Please get that test written up so I don't need to keep referring to an outdated example, when I know whether or not the test passes.

@jnothman
Copy link
Member

This is otherwise going in the right direction and I look forward to its acceptance!

I think fixing #1020 directly here (visualising and marginalising over parameters) is out of scope, though you're right that enabling Pandas access allows users to groupby and visualise.

I also think #1742, adding training scores/times should probably also remain out of scope.

@raghavrv
Copy link
Member Author

Thanks heaps @jnothman, @hlin117 and @vene for the review!! I'll clean up a bit and add the tests soon.

@vene
Copy link
Member

vene commented Apr 28, 2016

Sure, but I was just weighing in. I'll do my best to squeeze in an actual review.

@raghavrv
Copy link
Member Author

raghavrv commented May 1, 2016

Ok I've re-added the grid_scores_ as a property function.

I've also added a get_candidates(<int>/<list of ints>/None(for all candidates)) method that will get us a (list of) dict of candidate parameter dict and scores to help users perform row wise operations.

Not sure if this is helpful or superfluous. I just went ahead and implemented this. We could remove if you feel this is superfluous and confuses search_results_.

Refer the 2nd cell of the sandbox notebook for a sample usage. I'll clean up docs and add tests soon.

@jnothman
Copy link
Member

jnothman commented May 1, 2016

I don't think get_candidates is helpful. I see it provides:

  • indexing, but this is far from exciting enough to justify multiple access paradigms
  • parameters as dict, but this can be done within search_results_ too.
  • fold scores as array, but this could be done by providing fold_search_results_ or similar with one entry per fold rather than one entry per parameter setting; it does not not require an entirely different data structure

Separate;y I suggest we don't delimit everything in column names with underscores. We have no constraint that these be valid Python identifiers. Go for broke with colons, for instance.

@jnothman
Copy link
Member

jnothman commented May 1, 2016

Though I guess Pandas has special attribute accessors for Python identifier names... Maybe I'm wrong about the colons, but I dislike the long underscored names aesthetically and in terms of the potential for naming conflicts...

@jnothman
Copy link
Member

jnothman commented May 2, 2016

Having said that #1742 is a separate concern, I remember that when I was designing an equivalent solution, I realised that our results need to distinguish between train scores and test scores. Is it overkill to already label columns "test_score" rather than "score"?

"""Generate the metric name given the scoring parameter"""
if callable(scoring):
if scoring.__name__ == "_passthrough_scorer":
return "estimator_default_scorer"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if "_score" is being appended, isn't "estimator" sufficient? i.e. "estimator_score"...?

@raghavrv
Copy link
Member Author

All done.

I think circle ci failure is unrelated. Could you confirm @ogrisel?

If so merge? @jnothman @MechCoder @agramfort

@jnothman
Copy link
Member

+1 for merge.

@raghavrv raghavrv force-pushed the multiple_metric_grid_search branch from 18ac6a1 to a3b1eb9 Compare June 15, 2016 22:54
@raghavrv
Copy link
Member Author

Don't merge yet

@jnothman
Copy link
Member

I meant that as : I will not merge because there may yet be minor issues, but I'm generally satisfied with this

@raghavrv raghavrv force-pushed the multiple_metric_grid_search branch from a3b1eb9 to 18ac6a1 Compare June 15, 2016 23:01
@raghavrv
Copy link
Member Author

raghavrv commented Jun 15, 2016

Sorry for the short uninformative comment. I had pushed a superfluous commit (unstashed from my multiple metric search work) by mistake and commented so to avoid accidental merge.

I just used reflog to fix it. Now its all good. Thanks heaps for the reviews!!

@raghavrv
Copy link
Member Author

Interesting... Force pushing a previous version (restored via reflog) which was tested already here does not trigger a CI rebuild. Cool!

@MechCoder
Copy link
Member

merge?

@raghavrv
Copy link
Member Author

Yea... All good from my side!!

@jnothman jnothman merged commit afd5d18 into scikit-learn:master Jun 16, 2016
@jnothman
Copy link
Member

Well, well. Congratulations and thank you, @raghavrv!

@MechCoder
Copy link
Member

Yay!! Congratulations and well done 👍 🍷 🍷

Or as they say over here, மகிழ்ச்சி

@raghavrv raghavrv deleted the multiple_metric_grid_search branch June 16, 2016 07:39
@TomDLT
Copy link
Member

TomDLT commented Jun 16, 2016

🍻

@amueller
Copy link
Member

Awesome! Great job!

for params, mean_score, scores in clf.grid_scores_:
means = clf.results_['test_mean_score']
stds = clf.results_['test_std_score']
for i in range(len(clf.results_['params'])):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not zip?

Copy link
Member Author

@raghavrv raghavrv Jun 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this from the now removed _get_candidate_scores. zip should have indeed been a better choice!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the reviewers got lazy towards the end :)

@amueller
Copy link
Member

amueller commented Jun 16, 2016

Sorry for coming late to the party, but why was results_ introduced instead of using grid_scores_? We just moved to a different module anyhow. Why should we have any deprecations in a new module?

@raghavrv
Copy link
Member Author

raghavrv commented Jun 16, 2016

Thanks everyone! :)

Why should we have any deprecations in a new module?

#6697 (comment)

name results_ was chosen instead of grid_scores_ as

  • It will generically apply to GridSearchCV as well as RandomizedSearchCV and other non-grid based *SearchCV that maybe added in the future.
  • the results_, will store more than scores (very soon times/n_test_samples etc too).

@amueller
Copy link
Member

amueller commented Jun 16, 2016

Thanks. then maybe remove grid_scores_ instead of having a deprecated one?

Ok, saw #6697 (comment) by @jnothman.
I guess it's not that much of a pain so let's keep it.

@raghavrv raghavrv changed the title [MRG+3] ENH Restructure grid_scores_ into a hashmap of 1D (numpy) (masked) arrays that can be imported into pandas as a DataFrame. [MRG+3] ENH Restructure grid_scores_ into a dict of 1D (numpy) (masked) arrays that can be imported into pandas as a DataFrame. Aug 18, 2016
@amueller
Copy link
Member

amueller commented Sep 6, 2016

the docstring doesn't render nicely :-/

@jnothman
Copy link
Member

jnothman commented Sep 6, 2016

@amueller
Copy link
Member

amueller commented Sep 8, 2016

true, it renders ok. Though the rank is a link. And the thing throws a bunch of warnings when building.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants