-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Use more natural class_weight="auto" heuristic #4347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
518ca64
to
3d7ad59
Compare
logreg0 = LogisticRegression(class_weight="auto").fit(X_0, y_0) | ||
logreg = LogisticRegression(class_weight="auto").fit(X_, y_) | ||
assert_array_almost_equal(logreg1.coef_, logreg0.coef_) | ||
assert_array_almost_equal(logreg.coef_, logreg0.coef_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true in master.
3d7ad59
to
76badb1
Compare
76badb1
to
71e2e88
Compare
Hey @amueller & @ogrisel , as I mentioned on the Gitters, the change (while I do not dispute it being much more intuitive) could break reproducible results between scikit-learn versions. I ran a couple of quick toy dataset tests and there are some differences in results between implementations. I think some comment about the effect in the what's new would be a good idea.
Logit:
Old version:
New version:
Forests:
Old version:
New version:
|
@trevorstephens do you think the whatsnew entry that I added is not sufficient? This will definitely change results, but as I remarked elsewhere, if the user grid-searches |
Sorry to have been silent (was at a conference). Yes, if the user grid searches C in any of the linear models they can theoretically get the same optimized parameters as before, depending on how many C values they search over. Specifically, if w0, ..., wM are the old class weights and w'0, ..., w'M are the new class weights, such that w'm = lambda * wm where lambda is the constant described above, then the optimized parameters of the model w/ class weights w0, ..., wM and an inverse regularization paremter C will be identical to the optimized parameters of the model w/ class weights w'0, ..., w'M and an inverse regularization parameter C / lambda. So, without grid searching over C (or setting C to whatever value was used previously divided by lambda), this change to class weights will produce different results. So I think there are two options: leave "auto" as it is and just be much clearer about what it does (even acknowleding that maybe it's not the most obvious thing to do, but differs only in the "effective" number of data points) or change it. If we go the latter route, we also need to give a much clearer explanation in the documentation of what "auto" does and how it relates to the value of C, but we should also make very clear somewhere (and I don't know where) exactly how the change will affect people's results with a fixed C value and how they should set C to recreate their old results. (I think this is important, not least because changes to default parameters can be very confusing and easy to overlook/miss. I ran into this last week w/ the change to the default min_df value in CountVectorizer() -- ugh.) Anyway, all that said, I'd like to hold off on changing anything for another 24 hours or so. Two reasons: 1) I have no idea how this change will affect random forests or other (nonlinear) classifiers. Someone should investigate this before changing anything and should really do so in terms of writing down math, in addition to plugging in values and seeing what happens. 2) I'd like to find out what other weighting schemes (if any) other people (e.g., Hal, John, etc.) can think of just to check that there aren't other, even more natural, options that we're overlooking. |
Thanks for your feedback. We do document all changes in the We could also make the |
@amueller @hannawallach ... I guess I just meant that if people had hard-coded a 'good' set of parameters, through grid search or heuristics, their models might change a bit. Some comment about this in the what's new is what I was suggesting. As for forests and trees, |
c25e335
to
b2cdc55
Compare
I'm slightly surprised that the tree output changes. have you checked if the tree changes or only the probability estimates? |
@trevorstephens what was the data you compared the trees on? |
Never mind, I reproduced the tree things here: #4366
I'm not super happy about behavior changes, as they can surprise users and we can't really assume users read whatsnew.rst. |
Was in the first code block of the post, just a rather small unbalanced
|
Personally, I'd prefer option number 2 (deprecate "auto" and introduce "balance" or some similar name) just so that it's very clear to everyone that there's been a change. I'm open to being convinced otherwise though :-) |
If we go with number 2, does anyone have opinions on whether we should have the old tests for |
I'd prefer the 2nd, deprecate, option as well. From a Kaggler POV, I want to be able to reproduce results wherever possible for verification, or just for blending runs. Plus 'balance' or something seems more informative than 'auto' IMO. As requested, looking into the two forests, there's definitely some strange stuff going on. The first tree in the ensemble even diverges, albeit only slightly. Not necessarily on gini, samples, possibly not even the outcome, etc... But you can see a different variable split at the far left node. Other ones I've looked at further down the ensemble diverge even more than this one with differing gini's down the tree. Using:
Master: New Version: |
Beyond Kaggle, which is a just an eloborate game, people won't trust
scikit-learn if the behavior changes between versions. I think we should
strive to go through a depreciation cycle for this too
|
@GaelVaroquaux a deprecation cycle meaning a name change and removing the old name (just to confirm). |
Ok, will head on over there. Sorry to pollute the waters. |
1920ff0
to
e99705c
Compare
This has the deprecations and additional parameter now. Any reviews? |
well I don't see a different way, so maybe it will be "balanced_subsample" |
Ok should be good now. |
6f320fe
to
df3d97b
Compare
1: 1. / 3 / mean_weight, | ||
-1: 1. / 2 / mean_weight, | ||
1: 5. / (2 * 3), | ||
-1: 5. / (2 * 2) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this test easier to read I would do:
y = np.array([1, 1, 1, -1, -1])
n_samples = len(y)
n_classes = len(np.unique(y))
class_weight = {
1: n_samples / (np.sum(y == 1) * n_classes),
-1: n_samples / (np.sum(y == -1) * n_classes),
}
Besides minor comments, LGTM. |
@@ -56,6 +56,9 @@ Enhancements | |||
:class:`linear_model.LogisticRegression`, by avoiding loss computation. | |||
By `Mathieu Blondel`_ and `Tom Dupre la Tour`_. | |||
|
|||
- Improved heuristic for ``class_weight="auto"`` for classifiers supporting | |||
``class_weight`` by Hanna Wallach and `Andreas Müller`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a bit more details here: at least explain that class_weight="auto"
has been deprecated in favor of class_weight="balanced"
that provides invariance to class imbalance introduced by re-sampling with replacement for linear models.
Maybe also mention class_weight='balanced_subsample'
for forest models.
Aside the small comments by @ogrisel and myself, 👍 for merge. Thank you |
df3d97b
to
2ddb503
Compare
Have you seen that the tests are failing? |
fixed. |
OK. Merging! |
[MRG+1] Use more natural class_weight="auto" heuristic
thanks :) 🍻 |
rebase on top of scikit-learn#4347 improve error message
rebase on top of scikit-learn#4347 improve error message update error msg
Fix for #4324.
No test yet. Travis passes, so we don't have very strict tests for this ^^ (apart from the manual reimplementation in the test).