-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] EXA Adding cv indices example #11475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] EXA Adding cv indices example #11475
Conversation
One extra thought: Instead of putting all of these axes in a single figure, we could make a different figure for each one so it creates a different PNG for each cross-validation object. Then we could embed those images in the cross-validation section here: http://scikit-learn.org/stable/modules/cross_validation.html#repeated-k-fold (one image per object). |
This example is rendered here: https://27249-843222-gh.circle-artifacts.com/0/doc/auto_examples/model_selection/plot_cv_indices.html |
Thanks! Firstly we need to call `labels` `groups`. Secondly, should we also
consider stratification in these plots (assuming a binary problem 80:20
positive ratio)?
|
I'm happy to add in some more CVs if it'd be useful (I agree a stratified example is a good idea) (travis error is Flake8 I think) |
Great example, I'll try to mark it as 0.20. |
Hey all - I just pushed another version that slightly modifies the old one - it adds 3 different types of data groupings (no grouping, even groups, and imbalanced groups) and modifies the viz just slightly. Let me know what you think! |
I agree with separate plotting to be able to include it inside the docstring. |
@choldgraf can you post an updated image? I think this is a great example btw but as @jnothman said I think it's very important to distinguish labels and groups (and show both). |
I would think that the tab10 colors would be great instead of the colormap. |
@amueller good point about the groups vs. the labels...that should be an easy update. @glemaitre agree re: different colormap re: putting a helper viz function into a module, is that something you'd like to see in this PR, or something you'd like incorporated into this PR once it's been merged in another PR? (I'm happy to give it a go in this PR, just asking for clarification) |
Right now in another PR :) |
Not sure re in a module. I guess we can always do that later. It's a bit controversial and I don't want it to block this PR. |
The output of the documentation is available there |
I don't understand the point of the first plot that shows there being no groups. |
Shall we remove the case that there are no groups and only have 2 cases: one with even groups one with uneven groups? Alternatively, we could only use the one with uneven groups. |
No groups should be equivalent to every sample in its own group. So you should not portray it as a solid colour, but rather a rainbow. Or just don't portray it at all. |
the more I think about it, the more I wonder if we should just have the "imbalanced groups" dataset in there for simplicity. It is a bit redundant with the "balanced groups" case, and users could assume that any CVs that don't take group affiliation into account would behave similarly for un-grouped data. |
|
||
cvs = [ShuffleSplit(n_splits=5), GroupShuffleSplit(n_splits=5), | ||
KFold(n_splits=5), GroupKFold(n_splits=5), StratifiedKFold(n_splits=5), | ||
TimeSeriesSplit(n_splits=5)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change the order of cvs
like this:
- cvs = [ShuffleSplit(n_splits=5), GroupShuffleSplit(n_splits=5),
- KFold(n_splits=5), GroupKFold(n_splits=5), StratifiedKFold(n_splits=5),
- TimeSeriesSplit(n_splits=5)]
+ cvs = [KFold(n_splits=5), GroupKFold(n_splits=5), ShuffleSplit(n_splits=5),
+ StratifiedKFold(n_splits=5), GroupShuffleSplit(n_splits=5),
+ TimeSeriesSplit(n_splits=5)]
So that it apears:
- KFold
- GroupKFold
- ShuffleSplit
- StratifiedKFold
- GroupShuffleSplit
- TimeSeriesSplit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @choldgraf I think that the example should be done only with unbalanced data. It reduces a lot the example and the concepts in "balanced groups" or "single group" are already present in imbalanced.
I would reorder the cvs
see
maybe we can also tweak a bit percentiles
to get a nicer groupshufflesplit or tweak the random seed, or better yet, comment why the output is like it is.
OK latest push brings it down to a single dataset with both labels and groups visualized. It also makes each plot as a separate figure (rather than all in one figure) so they could be visualized separately elsewhere if desired. |
The example is failing with minimum dependencies:
tab10 is new in matplotlib 2? |
Overall, this looks really nice. plt.cm.tab10 is not available in the old versions of matplotlib that we support. Can you use something like paired. |
We should probably add StratifiedShuffleSplit? I also wonder whether it would look better without the groups repeated, and perhaps with fewer classes to stratify. |
latest push implements the latest set of comments! |
So I think n_splits=3 might be a better choice for better visualization, though I'm fine with n_splits=4. |
Why would there be samples that we don't see? That's weird, right? |
We cannot see sample 0, see the example below
So choose a specific random_state might be a solution. |
ah I bet some of the scatter marks are slightly overlapping others, making them hidden. It's tricky to find the right combination of size/linewidth to make this work. If folks have suggestions I'd love to hear 'em! re: random state, I am setting a seed here: https://github.com/scikit-learn/scikit-learn/pull/11475/files#diff-08f164de208553ac1d74c4b37667fd1aR20 do you mean re-setting this for each CV object? |
Oops, sorry. Then maybe choose another random state to keep sample 0 in training set. Or I'm also fine if you have other ways to solve the problem that first split in the StratifiedShuffleSplit has no test samples in class 0. |
Can we change "labels" -> "class" and "groups" -> "group"? |
I also think it would be great to see these in model_selection.html |
ok, latest push fixes @GaelVaroquaux 's comments and uses a better RNG so the stratified splits show at least 1 datapoint from each class. I will get to the docs embedding next, but this will take a second since I need to get my sklearn docs build up-and-running. |
…kit-learn into cv_indices_example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
FYI @choldgraf You can rely on Circle CI (with [doc build] if needed) to build the doc :)
I'm fine with it (except that I might prefer to put it at the end of the section) |
I'm happy to move it to the end if that's what folks prefer. |
I don't think you need to say "with both group and multiple classes". I
think it would be better to say "Note that KFold is not affected by classes
or groups."
…On 23 July 2018 at 11:21, Chris Holdgraf ***@***.***> wrote:
I'm happy to move it to the end if that's what folks prefer.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11475 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yAjATqeG6p2qodIW562Nfae4ujZks5uJSURgaJpZM4VKSXc>
.
|
@jnothman that sounds good - you're OK with its location with the page though? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that location looks great
I like it, though I'm not sure about the color choice. It looks a bit low contrast. How about just using C0, C1? |
@choldgraf: any chance that you can finish this PR? To me, to one important thing that remains to be done is adding the images to all relevant places in cross_validation.rst. Changing colors might be nice too, but they are not necessary for merge. |
@GaelVaroquaux yep, my plan is to get to it this week. I was in NYC for a conference all of last week so I didn't have time for much "checking off to-dos" activity, only "create more to-dos for myself" activity :-P |
latest push adds in a figure for each of the CV objects that we run. Let's see how it looks! |
Looks really great: I wanted to merge, but unfortunately, the legend is cut on the outside. Any chance that you could do something about this? Thanks! |
+1
|
ok, I think I fixed it, but if this doesn't work I can try to downgrade to an older matplotlib version, maybe that's the problem. yay matplotlib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @choldgraf
wooo, thanks! |
Hey all - in honor of scipy 2018 I figured I'd make good on my promise and add an example per the comments in #11362 :-)
Reference Issues/PRs
Closes #11362
What does this implement/fix?
Adds an example that demonstrates the train/test split behavior of several cross-validation iterators. (see #11362 for some discussion)
Would love feedback on:
Example of what the final plot looks like: