Skip to content

[MRG+1] EXA Adding cv indices example #11475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jul 31, 2018

Conversation

choldgraf
Copy link
Contributor

Hey all - in honor of scipy 2018 I figured I'd make good on my promise and add an example per the comments in #11362 :-)

Reference Issues/PRs

Closes #11362

What does this implement/fix?

Adds an example that demonstrates the train/test split behavior of several cross-validation iterators. (see #11362 for some discussion)

Would love feedback on:

  • How the information is visually presented. I riffed off of @amueller 's examples here!
  • Any cross-validation objects I should add or subtract?

Example of what the final plot looks like:

indices

@choldgraf
Copy link
Contributor Author

One extra thought: Instead of putting all of these axes in a single figure, we could make a different figure for each one so it creates a different PNG for each cross-validation object. Then we could embed those images in the cross-validation section here: http://scikit-learn.org/stable/modules/cross_validation.html#repeated-k-fold (one image per object).

@wenhaoz-fengcai
Copy link

@jnothman
Copy link
Member

jnothman commented Jul 11, 2018 via email

@choldgraf
Copy link
Contributor Author

I'm happy to add in some more CVs if it'd be useful (I agree a stratified example is a good idea)

(travis error is Flake8 I think)

@qinhanmin2014
Copy link
Member

Great example, I'll try to mark it as 0.20.
+1 to add stratified strategies.
Also, will it be better to reduce number of classes from 5 to e.g., 3?

@qinhanmin2014 qinhanmin2014 added this to the 0.20 milestone Jul 14, 2018
@choldgraf
Copy link
Contributor Author

Hey all - I just pushed another version that slightly modifies the old one - it adds 3 different types of data groupings (no grouping, even groups, and imbalanced groups) and modifies the viz just slightly. Let me know what you think!

@glemaitre
Copy link
Member

One extra thought: Instead of putting all of these axes in a single figure, we could make a different figure for each one so it creates a different PNG for each cross-validation object. Then we could embed those images in the cross-validation section here: http://scikit-learn.org/stable/modules/cross_validation.html#repeated-k-fold (one image per object).

I agree with separate plotting to be able to include it inside the docstring.
We add exactly this thoughts with @ogrisel while going through the material of @amueller for the SciPy tutorial.

@amueller
Copy link
Member

@choldgraf can you post an updated image?

I think this is a great example btw but as @jnothman said I think it's very important to distinguish labels and groups (and show both).

@glemaitre
Copy link
Member

I would think that the tab10 colors would be great instead of the colormap.

@amueller
Copy link
Member

given the usefulness in tutorials, I wonder if we should have the plotting routine in utils or utils.plot maybe? or plot? I wanted to have the 2d decision function in the plot module and it has similar use-cases. See #5070 and maybe #9173?

@choldgraf
Copy link
Contributor Author

choldgraf commented Jul 14, 2018

@amueller good point about the groups vs. the labels...that should be an easy update.

@glemaitre agree re: different colormap

re: putting a helper viz function into a module, is that something you'd like to see in this PR, or something you'd like incorporated into this PR once it's been merged in another PR?

(I'm happy to give it a go in this PR, just asking for clarification)

@glemaitre
Copy link
Member

re: putting a helper viz function into a module, is that something you'd like to see in this PR, or something you'd like incorporated into this PR once it's been merged in another PR?

Right now in another PR :)

@amueller
Copy link
Member

Not sure re in a module. I guess we can always do that later. It's a bit controversial and I don't want it to block this PR.

@glemaitre
Copy link
Member

The output of the documentation is available there

@amueller
Copy link
Member

I don't understand the point of the first plot that shows there being no groups.
Also, I think that stratification is a much more common thing than groups and we should definitely include it.

@choldgraf
Copy link
Contributor Author

Shall we remove the case that there are no groups and only have 2 cases: one with even groups one with uneven groups? Alternatively, we could only use the one with uneven groups.

@jnothman
Copy link
Member

No groups should be equivalent to every sample in its own group. So you should not portray it as a solid colour, but rather a rainbow. Or just don't portray it at all.

@choldgraf
Copy link
Contributor Author

the more I think about it, the more I wonder if we should just have the "imbalanced groups" dataset in there for simplicity. It is a bit redundant with the "balanced groups" case, and users could assume that any CVs that don't take group affiliation into account would behave similarly for un-grouped data.


cvs = [ShuffleSplit(n_splits=5), GroupShuffleSplit(n_splits=5),
KFold(n_splits=5), GroupKFold(n_splits=5), StratifiedKFold(n_splits=5),
TimeSeriesSplit(n_splits=5)]
Copy link
Contributor

@massich massich Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change the order of cvs like this:

- cvs = [ShuffleSplit(n_splits=5), GroupShuffleSplit(n_splits=5),
-        KFold(n_splits=5), GroupKFold(n_splits=5), StratifiedKFold(n_splits=5),
-        TimeSeriesSplit(n_splits=5)]
+ cvs = [KFold(n_splits=5), GroupKFold(n_splits=5), ShuffleSplit(n_splits=5),
+        StratifiedKFold(n_splits=5), GroupShuffleSplit(n_splits=5),
+        TimeSeriesSplit(n_splits=5)]

So that it apears:

  1. KFold
  2. GroupKFold
  3. ShuffleSplit
  4. StratifiedKFold
  5. GroupShuffleSplit
  6. TimeSeriesSplit

Copy link
Contributor

@massich massich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @choldgraf I think that the example should be done only with unbalanced data. It reduces a lot the example and the concepts in "balanced groups" or "single group" are already present in imbalanced.

I would reorder the cvs see

maybe we can also tweak a bit percentiles to get a nicer groupshufflesplit or tweak the random seed, or better yet, comment why the output is like it is.

@choldgraf
Copy link
Contributor Author

OK latest push brings it down to a single dataset with both labels and groups visualized. It also makes each plot as a separate figure (rather than all in one figure) so they could be visualized separately elsewhere if desired.

@jnothman
Copy link
Member

The example is failing with minimum dependencies:


  File "/home/circleci/miniconda/envs/testenv/lib/python2.7/site-packages/sphinx_gallery/gen_gallery.py", line 313, in sumarize_failing_examples
    "\n" + "-" * 79)
ValueError: Here is a summary of the problems encountered when running the examples

Unexpected failing examples:
/home/circleci/project/examples/model_selection/plot_cv_indices.py failed leaving traceback:
Traceback (most recent call last):
  File "/home/circleci/project/examples/model_selection/plot_cv_indices.py", line 21, in <module>
    cmap_data = plt.cm.tab10
AttributeError: 'module' object has no attribute 'tab10'

tab10 is new in matplotlib 2?

@GaelVaroquaux
Copy link
Member

Overall, this looks really nice.

plt.cm.tab10 is not available in the old versions of matplotlib that we support. Can you use something like paired.

@jnothman
Copy link
Member

We should probably add StratifiedShuffleSplit?

I also wonder whether it would look better without the groups repeated, and perhaps with fewer classes to stratify.

@choldgraf
Copy link
Contributor Author

latest push implements the latest set of comments!

@qinhanmin2014
Copy link
Member

There're samples in class 0, the problem is that we can't see them in the plot.

So I think n_splits=3 might be a better choice for better visualization, though I'm fine with n_splits=4.

@amueller
Copy link
Member

Why would there be samples that we don't see? That's weird, right?

@qinhanmin2014
Copy link
Member

@amueller

Why would there be samples that we don't see? That's weird, right?

We cannot see sample 0, see the example below

indices = [1] * 100
indices[0] = 0
plt.figure()
plt.scatter(range(len(indices)), [0] * len(indices),
            c=indices, marker='_', lw=10, vmin=-.2, vmax=1.2)
plt.xlim(0, 100)
plt.show()

1

indices = [1] * 100
indices[1] = 0
plt.figure()
plt.scatter(range(len(indices)), [0] * len(indices),
            c=indices, marker='_', lw=10, vmin=-.2, vmax=1.2)
plt.xlim(0, 100)
plt.show()

2

So choose a specific random_state might be a solution.

@choldgraf
Copy link
Contributor Author

ah I bet some of the scatter marks are slightly overlapping others, making them hidden. It's tricky to find the right combination of size/linewidth to make this work. If folks have suggestions I'd love to hear 'em!

re: random state, I am setting a seed here: https://github.com/scikit-learn/scikit-learn/pull/11475/files#diff-08f164de208553ac1d74c4b37667fd1aR20

do you mean re-setting this for each CV object?

@qinhanmin2014
Copy link
Member

re: random state, I am setting a seed here: https://github.com/scikit-learn/scikit-learn/pull/11475/files#diff-08f164de208553ac1d74c4b37667fd1aR20

Oops, sorry. Then maybe choose another random state to keep sample 0 in training set. Or I'm also fine if you have other ways to solve the problem that first split in the StratifiedShuffleSplit has no test samples in class 0.

@jnothman
Copy link
Member

Can we change "labels" -> "class" and "groups" -> "group"?

@jnothman
Copy link
Member

I also think it would be great to see these in model_selection.html

@choldgraf
Copy link
Contributor Author

ok, latest push fixes @GaelVaroquaux 's comments and uses a better RNG so the stratified splits show at least 1 datapoint from each class.

I will get to the docs embedding next, but this will take a second since I need to get my sklearn docs build up-and-running.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
FYI @choldgraf You can rely on Circle CI (with [doc build] if needed) to build the doc :)

@choldgraf
Copy link
Contributor Author

the latest push plays around with how to insert this into the docs. Here's an example:

image

what do people think about this? I can copy/paste the text and choose the proper images for the other sections, but I wanna make sure folks are on board before I do a bunch of copypasting. Any suggestions?

@qinhanmin2014
Copy link
Member

I'm fine with it (except that I might prefer to put it at the end of the section)

@choldgraf
Copy link
Contributor Author

I'm happy to move it to the end if that's what folks prefer.

@jnothman
Copy link
Member

jnothman commented Jul 23, 2018 via email

@choldgraf
Copy link
Contributor Author

@jnothman that sounds good - you're OK with its location with the page though?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that location looks great

@jona-sassenhagen
Copy link
Contributor

I like it, though I'm not sure about the color choice. It looks a bit low contrast. How about just using C0, C1?

@GaelVaroquaux
Copy link
Member

@choldgraf: any chance that you can finish this PR? To me, to one important thing that remains to be done is adding the images to all relevant places in cross_validation.rst. Changing colors might be nice too, but they are not necessary for merge.

@choldgraf
Copy link
Contributor Author

@GaelVaroquaux yep, my plan is to get to it this week. I was in NYC for a conference all of last week so I didn't have time for much "checking off to-dos" activity, only "create more to-dos for myself" activity :-P

@choldgraf
Copy link
Contributor Author

latest push adds in a figure for each of the CV objects that we run. Let's see how it looks!

@GaelVaroquaux
Copy link
Member

Looks really great:
https://30383-843222-gh.circle-artifacts.com/0/doc/modules/cross_validation.html

I wanted to merge, but unfortunately, the legend is cut on the outside. Any chance that you could do something about this? Thanks!

@jnothman
Copy link
Member

jnothman commented Jul 30, 2018 via email

@choldgraf
Copy link
Contributor Author

ok, I think I fixed it, but if this doesn't work I can try to downgrade to an older matplotlib version, maybe that's the problem. yay matplotlib

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @choldgraf

@qinhanmin2014 qinhanmin2014 merged commit 1641f31 into scikit-learn:master Jul 31, 2018
@choldgraf
Copy link
Contributor Author

wooo, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC proposal: visualizing cross-validation iterators
9 participants