[MRG+1] EXA Adding cv indices example #11475

choldgraf · 2018-07-10T23:53:13Z

Hey all - in honor of scipy 2018 I figured I'd make good on my promise and add an example per the comments in #11362 :-)

Reference Issues/PRs

Closes #11362

What does this implement/fix?

Adds an example that demonstrates the train/test split behavior of several cross-validation iterators. (see #11362 for some discussion)

Would love feedback on:

How the information is visually presented. I riffed off of @amueller 's examples here!
Any cross-validation objects I should add or subtract?

Example of what the final plot looks like:

choldgraf · 2018-07-11T00:19:25Z

One extra thought: Instead of putting all of these axes in a single figure, we could make a different figure for each one so it creates a different PNG for each cross-validation object. Then we could embed those images in the cross-validation section here: http://scikit-learn.org/stable/modules/cross_validation.html#repeated-k-fold (one image per object).

wenhaoz-fengcai · 2018-07-11T00:27:40Z

This example is rendered here: https://27249-843222-gh.circle-artifacts.com/0/doc/auto_examples/model_selection/plot_cv_indices.html

jnothman · 2018-07-11T03:01:01Z

Thanks! Firstly we need to call `labels` `groups`. Secondly, should we also consider stratification in these plots (assuming a binary problem 80:20 positive ratio)?

choldgraf · 2018-07-11T04:41:48Z

I'm happy to add in some more CVs if it'd be useful (I agree a stratified example is a good idea)

(travis error is Flake8 I think)

qinhanmin2014 · 2018-07-14T12:47:08Z

Great example, I'll try to mark it as 0.20.
+1 to add stratified strategies.
Also, will it be better to reduce number of classes from 5 to e.g., 3?

choldgraf · 2018-07-14T15:28:41Z

Hey all - I just pushed another version that slightly modifies the old one - it adds 3 different types of data groupings (no grouping, even groups, and imbalanced groups) and modifies the viz just slightly. Let me know what you think!

glemaitre · 2018-07-14T15:29:01Z

One extra thought: Instead of putting all of these axes in a single figure, we could make a different figure for each one so it creates a different PNG for each cross-validation object. Then we could embed those images in the cross-validation section here: http://scikit-learn.org/stable/modules/cross_validation.html#repeated-k-fold (one image per object).

I agree with separate plotting to be able to include it inside the docstring.
We add exactly this thoughts with @ogrisel while going through the material of @amueller for the SciPy tutorial.

amueller · 2018-07-14T15:30:20Z

@choldgraf can you post an updated image?

I think this is a great example btw but as @jnothman said I think it's very important to distinguish labels and groups (and show both).

glemaitre · 2018-07-14T15:33:29Z

I would think that the tab10 colors would be great instead of the colormap.

amueller · 2018-07-14T15:33:54Z

given the usefulness in tutorials, I wonder if we should have the plotting routine in utils or utils.plot maybe? or plot? I wanted to have the 2d decision function in the plot module and it has similar use-cases. See #5070 and maybe #9173?

choldgraf · 2018-07-14T15:46:52Z

@amueller good point about the groups vs. the labels...that should be an easy update.

@glemaitre agree re: different colormap

re: putting a helper viz function into a module, is that something you'd like to see in this PR, or something you'd like incorporated into this PR once it's been merged in another PR?

(I'm happy to give it a go in this PR, just asking for clarification)

glemaitre · 2018-07-14T15:51:53Z

re: putting a helper viz function into a module, is that something you'd like to see in this PR, or something you'd like incorporated into this PR once it's been merged in another PR?

Right now in another PR :)

amueller · 2018-07-14T15:51:58Z

Not sure re in a module. I guess we can always do that later. It's a bit controversial and I don't want it to block this PR.

glemaitre · 2018-07-14T15:53:59Z

The output of the documentation is available there

amueller · 2018-07-14T15:58:21Z

I don't understand the point of the first plot that shows there being no groups.
Also, I think that stratification is a much more common thing than groups and we should definitely include it.

choldgraf · 2018-07-15T17:30:48Z

Shall we remove the case that there are no groups and only have 2 cases: one with even groups one with uneven groups? Alternatively, we could only use the one with uneven groups.

jnothman · 2018-07-16T12:51:06Z

No groups should be equivalent to every sample in its own group. So you should not portray it as a solid colour, but rather a rainbow. Or just don't portray it at all.

choldgraf · 2018-07-16T16:48:30Z

the more I think about it, the more I wonder if we should just have the "imbalanced groups" dataset in there for simplicity. It is a bit redundant with the "balanced groups" case, and users could assume that any CVs that don't take group affiliation into account would behave similarly for un-grouped data.

massich · 2018-07-16T19:59:17Z

examples/model_selection/plot_cv_indices.py

+
+cvs = [ShuffleSplit(n_splits=5), GroupShuffleSplit(n_splits=5),
+       KFold(n_splits=5), GroupKFold(n_splits=5), StratifiedKFold(n_splits=5),
+       TimeSeriesSplit(n_splits=5)]


I would change the order of cvs like this:

- cvs = [ShuffleSplit(n_splits=5), GroupShuffleSplit(n_splits=5), - KFold(n_splits=5), GroupKFold(n_splits=5), StratifiedKFold(n_splits=5), - TimeSeriesSplit(n_splits=5)] + cvs = [KFold(n_splits=5), GroupKFold(n_splits=5), ShuffleSplit(n_splits=5), + StratifiedKFold(n_splits=5), GroupShuffleSplit(n_splits=5), + TimeSeriesSplit(n_splits=5)]

So that it apears:

KFold

GroupKFold

ShuffleSplit

StratifiedKFold

GroupShuffleSplit

TimeSeriesSplit

massich

I agree with @choldgraf I think that the example should be done only with unbalanced data. It reduces a lot the example and the concepts in "balanced groups" or "single group" are already present in imbalanced.

I would reorder the cvs see

maybe we can also tweak a bit percentiles to get a nicer groupshufflesplit or tweak the random seed, or better yet, comment why the output is like it is.

choldgraf · 2018-07-16T21:38:06Z

OK latest push brings it down to a single dataset with both labels and groups visualized. It also makes each plot as a separate figure (rather than all in one figure) so they could be visualized separately elsewhere if desired.

jnothman · 2018-07-17T06:43:52Z

The example is failing with minimum dependencies:


  File "/home/circleci/miniconda/envs/testenv/lib/python2.7/site-packages/sphinx_gallery/gen_gallery.py", line 313, in sumarize_failing_examples
    "\n" + "-" * 79)
ValueError: Here is a summary of the problems encountered when running the examples

Unexpected failing examples:
/home/circleci/project/examples/model_selection/plot_cv_indices.py failed leaving traceback:
Traceback (most recent call last):
  File "/home/circleci/project/examples/model_selection/plot_cv_indices.py", line 21, in <module>
    cmap_data = plt.cm.tab10
AttributeError: 'module' object has no attribute 'tab10'

tab10 is new in matplotlib 2?

GaelVaroquaux · 2018-07-17T12:46:06Z

Overall, this looks really nice.

plt.cm.tab10 is not available in the old versions of matplotlib that we support. Can you use something like paired.

jnothman · 2018-07-17T21:02:31Z

We should probably add StratifiedShuffleSplit?

I also wonder whether it would look better without the groups repeated, and perhaps with fewer classes to stratify.

choldgraf · 2018-07-18T05:29:46Z

latest push implements the latest set of comments!

qinhanmin2014 · 2018-07-21T04:01:01Z

There're samples in class 0, the problem is that we can't see them in the plot.

So I think n_splits=3 might be a better choice for better visualization, though I'm fine with n_splits=4.

amueller · 2018-07-21T04:45:43Z

Why would there be samples that we don't see? That's weird, right?

qinhanmin2014 · 2018-07-21T08:23:23Z

@amueller

Why would there be samples that we don't see? That's weird, right?

We cannot see sample 0, see the example below

indices = [1] * 100
indices[0] = 0
plt.figure()
plt.scatter(range(len(indices)), [0] * len(indices),
            c=indices, marker='_', lw=10, vmin=-.2, vmax=1.2)
plt.xlim(0, 100)
plt.show()

indices = [1] * 100
indices[1] = 0
plt.figure()
plt.scatter(range(len(indices)), [0] * len(indices),
            c=indices, marker='_', lw=10, vmin=-.2, vmax=1.2)
plt.xlim(0, 100)
plt.show()

So choose a specific random_state might be a solution.

choldgraf · 2018-07-21T16:01:20Z

ah I bet some of the scatter marks are slightly overlapping others, making them hidden. It's tricky to find the right combination of size/linewidth to make this work. If folks have suggestions I'd love to hear 'em!

re: random state, I am setting a seed here: https://github.com/scikit-learn/scikit-learn/pull/11475/files#diff-08f164de208553ac1d74c4b37667fd1aR20

do you mean re-setting this for each CV object?

qinhanmin2014 · 2018-07-22T01:41:29Z

re: random state, I am setting a seed here: https://github.com/scikit-learn/scikit-learn/pull/11475/files#diff-08f164de208553ac1d74c4b37667fd1aR20

Oops, sorry. Then maybe choose another random state to keep sample 0 in training set. Or I'm also fine if you have other ways to solve the problem that first split in the StratifiedShuffleSplit has no test samples in class 0.

jnothman · 2018-07-22T05:26:15Z

Can we change "labels" -> "class" and "groups" -> "group"?

jnothman · 2018-07-22T05:26:41Z

I also think it would be great to see these in model_selection.html

choldgraf · 2018-07-22T13:52:52Z

ok, latest push fixes @GaelVaroquaux 's comments and uses a better RNG so the stratified splits show at least 1 datapoint from each class.

I will get to the docs embedding next, but this will take a second since I need to get my sklearn docs build up-and-running.

…kit-learn into cv_indices_example

qinhanmin2014

LGTM
FYI @choldgraf You can rely on Circle CI (with [doc build] if needed) to build the doc :)

choldgraf · 2018-07-22T17:46:02Z

the latest push plays around with how to insert this into the docs. Here's an example:

what do people think about this? I can copy/paste the text and choose the proper images for the other sections, but I wanna make sure folks are on board before I do a bunch of copypasting. Any suggestions?

qinhanmin2014 · 2018-07-23T01:01:48Z

I'm fine with it (except that I might prefer to put it at the end of the section)

choldgraf · 2018-07-23T01:21:19Z

I'm happy to move it to the end if that's what folks prefer.

jnothman · 2018-07-23T08:09:30Z

I don't think you need to say "with both group and multiple classes". I think it would be better to say "Note that KFold is not affected by classes or groups."

…

On 23 July 2018 at 11:21, Chris Holdgraf ***@***.***> wrote: I'm happy to move it to the end if that's what folks prefer. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11475 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6yAjATqeG6p2qodIW562Nfae4ujZks5uJSURgaJpZM4VKSXc> .

choldgraf · 2018-07-23T17:28:40Z

@jnothman that sounds good - you're OK with its location with the page though?

jnothman

Yes, that location looks great

jona-sassenhagen · 2018-07-29T11:17:38Z

I like it, though I'm not sure about the color choice. It looks a bit low contrast. How about just using C0, C1?

GaelVaroquaux · 2018-07-30T07:13:49Z

@choldgraf: any chance that you can finish this PR? To me, to one important thing that remains to be done is adding the images to all relevant places in cross_validation.rst. Changing colors might be nice too, but they are not necessary for merge.

choldgraf · 2018-07-30T16:13:45Z

@GaelVaroquaux yep, my plan is to get to it this week. I was in NYC for a conference all of last week so I didn't have time for much "checking off to-dos" activity, only "create more to-dos for myself" activity :-P

choldgraf · 2018-07-30T21:46:23Z

latest push adds in a figure for each of the CV objects that we run. Let's see how it looks!

GaelVaroquaux · 2018-07-30T22:27:38Z

Looks really great:
https://30383-843222-gh.circle-artifacts.com/0/doc/modules/cross_validation.html

I wanted to merge, but unfortunately, the legend is cut on the outside. Any chance that you could do something about this? Thanks!

jnothman · 2018-07-30T22:40:20Z

+1

choldgraf · 2018-07-31T00:04:43Z

ok, I think I fixed it, but if this doesn't work I can try to downgrade to an older matplotlib version, maybe that's the problem. yay matplotlib

qinhanmin2014

LGTM, thanks @choldgraf

choldgraf · 2018-07-31T16:04:13Z

wooo, thanks!

adding cv indices example

9b3d660

qinhanmin2014 added this to the 0.20 milestone Jul 14, 2018

updating cv indices example for multiple data groups

577975a

massich reviewed Jul 16, 2018

View reviewed changes

massich suggested changes Jul 16, 2018

View reviewed changes

cv indices example down to one dataset

f6a3b4a

new colormap and fewer classes in example

5a1bd3b

better RNG and fixing labeling

c903d9f

Merge branch 'cv_indices_example' of https://github.com/choldgraf/sci…

a4b5848

…kit-learn into cv_indices_example

qinhanmin2014 approved these changes Jul 22, 2018

View reviewed changes

test adding cv indices images to docs

686f466

jnothman reviewed Jul 24, 2018

View reviewed changes

adding CV viz images to the docs

793ee4d

get the legend to fit

900566b

qinhanmin2014 approved these changes Jul 31, 2018

View reviewed changes

qinhanmin2014 merged commit 1641f31 into scikit-learn:master Jul 31, 2018

[MRG+1] EXA Adding cv indices example #11475

[MRG+1] EXA Adding cv indices example #11475

Conversation

choldgraf commented Jul 10, 2018

Reference Issues/PRs

What does this implement/fix?

Example of what the final plot looks like:

choldgraf commented Jul 11, 2018

wenhaoz-fengcai commented Jul 11, 2018

jnothman commented Jul 11, 2018 via email

choldgraf commented Jul 11, 2018

qinhanmin2014 commented Jul 14, 2018

choldgraf commented Jul 14, 2018

glemaitre commented Jul 14, 2018

amueller commented Jul 14, 2018

glemaitre commented Jul 14, 2018

amueller commented Jul 14, 2018

choldgraf commented Jul 14, 2018 • edited Loading

glemaitre commented Jul 14, 2018

amueller commented Jul 14, 2018

glemaitre commented Jul 14, 2018

amueller commented Jul 14, 2018

choldgraf commented Jul 15, 2018

jnothman commented Jul 16, 2018

choldgraf commented Jul 16, 2018

massich Jul 16, 2018 • edited Loading

Choose a reason for hiding this comment

massich left a comment

Choose a reason for hiding this comment

choldgraf commented Jul 16, 2018

jnothman commented Jul 17, 2018

GaelVaroquaux commented Jul 17, 2018

jnothman commented Jul 17, 2018

choldgraf commented Jul 18, 2018

qinhanmin2014 commented Jul 21, 2018

amueller commented Jul 21, 2018

qinhanmin2014 commented Jul 21, 2018

choldgraf commented Jul 21, 2018

qinhanmin2014 commented Jul 22, 2018

jnothman commented Jul 22, 2018

jnothman commented Jul 22, 2018

choldgraf commented Jul 22, 2018

qinhanmin2014 left a comment

Choose a reason for hiding this comment

choldgraf commented Jul 22, 2018

qinhanmin2014 commented Jul 23, 2018

choldgraf commented Jul 23, 2018

jnothman commented Jul 23, 2018 via email

choldgraf commented Jul 23, 2018

jnothman left a comment

Choose a reason for hiding this comment

jona-sassenhagen commented Jul 29, 2018

GaelVaroquaux commented Jul 30, 2018

choldgraf commented Jul 30, 2018

choldgraf commented Jul 30, 2018

GaelVaroquaux commented Jul 30, 2018

jnothman commented Jul 30, 2018 via email

choldgraf commented Jul 31, 2018

qinhanmin2014 left a comment

Choose a reason for hiding this comment

choldgraf commented Jul 31, 2018

choldgraf commented Jul 14, 2018 •

edited

Loading

massich Jul 16, 2018 •

edited

Loading