-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Data independent CV and model_selection module #3340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Data independent CV and model_selection module #3340
Conversation
Thanks, @pignacio. I can't look right now, but would like to note that we designate PRs with [WIP] in their title as "Work in progress (comments welcome)", as opposed to [MRG]("Ready to review for merge") and [MRG+1]("One reviewer has approved for merge")... So I'm popping [WIP] in front of this PR's title. |
I don't think I get the need for
If |
Still not looking in detail, but I think we want to remove data-dependent parameters (e.g. |
|
||
from sklearn.model_selection.search import GridSearchCV, RandomizedSearchCV |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use relative imports
I note that there will also be a raft of changes to the documentation |
You are right.
Right now I'm using defaults so that the same class can be used both ways: for train, test in KFold(n, n_folds=3):
print train, test
for train, test in KFold(n_folds=3).split(target):
print train, test Once the backwards compatibility is not needed anymore, we should be able to remove those parameters. Also,
Yes, of course, also tests. Once we get the nasty details of this right, I'll get into that
Sorry about that, I never intended to tag this like ready to merge or anything. |
No problem, just letting you know why it changed. |
So, presumably, we should be raising a I am worried this will lay too many usability traps, and I think @GaelVaroquaux was hoping that in renaming the path to this module, we could keep two separate copies of the functionality while deprecating, and not worry about backwards compatibility in the new one. Similarly, the new version could ignore any handling of the deprecated |
@@ -388,7 +389,7 @@ def graph_lasso_path(X, alphas, cov_init=None, X_test=None, mode='cd', | |||
|
|||
|
|||
class GraphLassoCV(GraphLasso): | |||
"""Sparse inverse covariance w/ cross-validated choice of the l1 penalty | |||
"""Sparse inverse covariance w/ cross-from .partition import _check_cvvalidated choice of the l1 penalty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you have a spurious line here.
Yes, this is exactly what I had in mind. It would indeed probably be a bit simpler, although it would temporarily lead to duplicated code. That said, this approach is also very interesting. I need to think it over a bit more. I must say that in terms of names, I am not terribly happy with importing the cross_validation objects from model_selection.partition, as they are core objects and it makes them hard to discover. We should import them directly from model_selection (and thus put them in the init of the module). Thanks a lot @pignacio for tackling this issue. It's a very important one, and you are doing it in a very elegant way. |
from .externals.six import with_metaclass | ||
from .externals.six.moves import zip | ||
from .metrics.scorer import check_scoring | ||
from .model_selection.partition import LeaveOneOut, LeavePOut, KFold, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we split this into separate lines, rather than continuation? i.e.
from .model_selection.partition import LeaveOneOut, LeavePOut, KFold
from .model_selection.partition import StratifiedKFold, LeaveOneLabelOut
etc.
I made a few small nitpick comments but this looks really great! I also agree on the "have two versions, deprecate old" approach. |
+1, I think this is especially possible thanks to our descriptive function and object names. We can keep a nested structure for code organization, but expose a flat interface: Currently, even for the simplest of realistic examples, you'd need to import from 4-5 different submodules. This would be a step in the right direction. |
This PR is getting hard to review, because the renaming, coupled with the significant change, makes it hard to read the diff. It's a pity, because there is a lot of interest in this enhancement. I thought for a bit what the best way to proceed would be, and I suggest this:
What do you think? |
Yes, I think others have been voting for that deprecation strategy in any Ping @pignacio, are you still with us? On 24 July 2014 19:25, Vlad Niculae notifications@github.com wrote:
|
Yep, right here. Been a little busy these last days. If we all agree, we can split this PR like vene suggests. I'll work on 1) for now so we can close that.
+1 on comments of any kind. The more eyes looking this the better :) |
Well, reading it again, I'm not actually sure that there's much point in @vene's two-phase approach. Copying the old versions only means that we're giving users of the development version a false deprecation warning, or rather telling them to use something that doesn't have the right interface yet. The deprecations are module-level: we can warn when the user imports the module, so it's actually just a two-line header in each module, not a great change. Rather, I think:
Even if we'd like this move to be released in one version, putting it all in one PR is just too overwhelming. And any/most changes outside of the cross_validation module should be backwards-compatible, as far as I can tell. |
@pignacio Would you be able to find time to continue the PR? Or if its okay by you could I continue with your work by cherry-picking your commits? |
@ragv, just to note this is probably not as simple as it looks... On 16 February 2015 at 19:28, ragv notifications@github.com wrote:
|
Hi all, since this pull request was mentioned in the mailing list, and @GaelVaroquaux showed interest on seeing this closed, I'd like to share my thoughts respecting this changes. A brief recap first:This pull request intended to fix two issues:
What's done so farThis pull request handles the migration from The implementation is not bulletproof, but looks promising so far. But, as @vene points out, having these two changes implemented in the same pull request makes looking at diffs and reviewing the code difficult. What do you think about splitting this changeset into three parts, as in:
I think these would make the changes more simple to implement and review. @ragv has shown interest into finishing this, and I can help too. Splitting the changeset should help us make progress. Thanks for reading up to here, and feel free to comment your thoughts :) |
def __init__(self, n, n_folds, indices, shuffle, random_state): | ||
super(_BaseKFold, self).__init__(n, indices) | ||
|
||
if abs(n_folds - int(n_folds)) >= np.finfo('f').eps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller Wouldn't a simple if type(n_folds) is not int:
suffice here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type(n_folds)
is not good, as it could be a numpy int. A isinstance(n_folds, numeric.Integer)
would work, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! Thanks :)
when it's merged |
yay! On 6 November 2015 at 07:25, Manoj Kumar notifications@github.com wrote:
|
:D |
Hi guys,
I'm working on #1848 and #2904, and would like to get some feedback on my changes.
Here's a writeup about the issues/changes: http://pignacio.github.io/jwoc-blog/stories/stuff-about-scikit-learn18482904.html
Any thoughts / suggestions / improvements / rants on the changes are greatly appreciated