-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Sample weight consistency #5515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ck whether test should apply to all of the estimators
sklearn/tests/test_common.py
Outdated
set_random_state(estimator_aug, random_state=random_state) | ||
estimator_aug.fit(X_aug_train, y_aug_train) | ||
|
||
except ValueError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant instead of try catch block, it would be safer to have a list of Estimators that have sample_weight in the fit param but might not support it. (which in this case is just LogisticRegression)
wdyt
see if you can make the SGDClassifier and Perceptron working maybe? might be a convergence issue. Or we ask to high a precision on the coef. |
…for SGD estimators to change number of iterations or precision.
I couldn't make SGDClassifier and Perceptron work, so I exclude them also. The excluded are Tests are now passing. Please take a look at the exclusion lists and see if you are OK with that |
Problems with Travis:
different errors on different machines I guess. |
Do I exclude them too? test_get_params_invariance has a typo |
sklearn/tests/test_common.py
Outdated
n_samples, n_features = 20, 5 | ||
rng = check_random_state(random_state) | ||
|
||
sample_weight = (rng.permutation(n_samples) < (n_samples / 2.)) * 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this casting more explicit, maybe with astype(np.int)
?
Is there any way to cover the case of float (positive) weights? |
I guess one could work with rationals and then use the common denominator. On Thu, Oct 22, 2015 at 3:07 PM, Giorgio Patrini notifications@github.com
|
set_random_state(estimator_sw, random_state=random_state) | ||
estimator_sw.fit(X[train], y[train], sample_weight=sample_weight[train]) | ||
X_aug_train, y_aug_train = aug((X[train], y[train]), | ||
sample_weight[train]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think a test for aug
would make sense here? Something like
assert_equal(X_aug_train.shape[0], np.sum(sample_weight[train]))
@eickenberg Right, but that would fall on an implementation with integer numbers again. I agree that the tests with integers will uncover the biggest issues. However, it's true that most if not all the algorithms can take positive real number as weights. |
n_samples, n_features = 20, 5 | ||
rng = check_random_state(random_state) | ||
|
||
sample_weight = (rng.permutation(n_samples) < (n_samples / 2.)).astype(np.float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not suggesting to do that. I think int
makes more sense in this case.
My argument is that your tests will not cover the situation in which learning algorithms are weighted with float
sample weights, as there is not a corresponding interpretation of "weight = number of sample copy". I simply do not think there is a way to cover this situation though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, you could have a dataset with some duplicate samples and use sample-weight 0.5 on it and compare against the data without duplicates? Or use 1.5 and compare it with data that has triples?
ok I didn't understand it then. Would it make sense to multiply the weights by the number of decimals used and then do the same (augment data)? Is this what you mean? It doesn't change much though. |
I think solving weighting with rescaling (the multiplication by the weights) is only appropriate with linear models. |
What do you suggest then? |
Your idea is the way to go with this test, I would not do anything else :) |
@ogrisel do you know why travis doesn't pass with python 3 but it does with python 2? |
We were discussing in #5526 (comment) the addition of another test
The code should look like this one, looping over the appropriate regressors and testing both dense and sparse inputs. |
How does it work sample weight in MultinomialNB ?, Is there any documentation or equation? Someone can help me please, I appreciate it! |
I think the issue is the same I saw in #7618. There should be an AttributeError, not a ValueError. That's a bug in the SVC. |
Feel free to fix this here or do a separate PR. |
fixes #5367. |
Sorry I'm a bit out of the loop with this one. What's the status? Can you rebase? |
Hmm... I'd forgotten this was here. Should this be closed given #11558, @sergulaydore? I've not yet checked if it's completely redundant. |
1 similar comment
Hmm... I'd forgotten this was here. Should this be closed given #11558, @sergulaydore? I've not yet checked if it's completely redundant. |
Supersedes #5461
Added a 0/1 sample weight test. Failing estimators are
For the previous test,
The difference between the 2 are:
How do I proceed? Do I create an exclusion list or an inclusion list (p.ex. linear models) for the previous test?
ping @eickenberg, @amueller, @glouppe, @arjoly, @agramfort, @GaelVaroquaux