Skip to content

[WIP] RidgeGCV with sample weights is broken #4490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

eickenberg
Copy link
Contributor

@eickenberg eickenberg commented Apr 2, 2015

What the title says.

The way it is right now, the sample weights weight the eigenspaces of the Gram matrix, which doesn't seem sensible.


This change is Reviewable

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling 2496476 on eickenberg:ridge_gcv_sample_weights into 6e54079 on scikit-learn:master.

@amueller
Copy link
Member

amueller commented Apr 2, 2015

Should we add a common test that sample weights correspond to duplicated data points?

@eickenberg
Copy link
Contributor Author

Good point! How global can this be made? (I am not sure what exceptions there are. Right now it looks to me that estimators finding an exact optimum generally qualify for this. SGDRegressor with its class weights is not and although repeating each sample class_weight times consecutively should yield the same result at batch_size=1, I think this type of estimator should be excluded. We should take this discussion somewhere else, I'll try to submit a PR of what I imagine it could look like.)

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.01% when pulling 89299dc on eickenberg:ridge_gcv_sample_weights into 6e54079 on scikit-learn:master.

def test_ridge_gcv_with_sample_weights():

n_samples, n_features, n_targets = 20, 5, 7
X, Y, W, _, _ = make_noisy_forward_data(n_samples, n_features, n_targets)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this use sklearn.datasets.make_regression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, sorry. i copied this in from an ages old pr where i didn't know
about that ;)

will update.

On Thursday, April 2, 2015, Vlad Niculae notifications@github.com wrote:

In sklearn/linear_model/tests/test_ridge.py
#4490 (comment)
:

  • test = slice(n_train, None)
  • X = rng.randn(n_samples, n_features)
  • W = rng.randn(n_features, n_targets)
  • Y_clean = X.dot(W)
  • if noise_levels is None:
  •    noise_levels = rng.randn(n_targets) *\* 2
    
  • noise_levels = np.atleast_1d(noise_levels) * np.ones(n_targets)
  • noise = rng.randn(*Y_clean.shape) * noise_levels * Y_clean.std(0)
  • Y = Y_clean + noise
  • return X, Y, W, train, test

+def test_ridge_gcv_with_sample_weights():
+

  • n_samples, n_features, n_targets = 20, 5, 7
  • X, Y, W, _, _ = make_noisy_forward_data(n_samples, n_features, n_targets)

Couldn't this use sklearn.datasets.make_regression?


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/4490/files#r27685564.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) to 95.16% when pulling 5e6c2ea on eickenberg:ridge_gcv_sample_weights into 6e54079 on scikit-learn:master.

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.02% when pulling 5e6c2ea on eickenberg:ridge_gcv_sample_weights into 6e54079 on scikit-learn:master.

@jnothman
Copy link
Member

Should we add a common test that sample weights correspond to duplicated data points?

We have such invariance testing of sample weights in metrics (also scale invariance of weights, and that sample_weight=None <=> sample_weight=np.ones(...), but it's possible that there are algorithms, such as clusterers, where this is not true). I'd like to see more invariance testing of sample weights and class weights, the test from #4838 included.

@amueller
Copy link
Member

I'm no expert here but this seems like something that would be good to have. Maybe not for the RC but for the release?

@eickenberg
Copy link
Contributor Author

was thinking of sprinting on adding some more of these. is that too late?

On Wednesday, October 14, 2015, Andreas Mueller notifications@github.com
wrote:

I'm no expert here but this seems like something that would be good to
have. Maybe not for the RC but for the release?


Reply to this email directly or view it on GitHub
#4490 (comment)
.

@amueller
Copy link
Member

some more tests? I guess only the RC will be before the sprint, so we can still cherry pick bugfixes from the sprint into the release.

@eickenberg eickenberg force-pushed the ridge_gcv_sample_weights branch from 5e6c2ea to d3a44e7 Compare October 19, 2015 08:49
@eickenberg
Copy link
Contributor Author

have rebased this. Will attempt a common test for a certain number of classifiers / regressors concerning the effect of sample_weight in a different PR

@@ -715,3 +716,57 @@ def test_ridge_fit_intercept_sparse():
assert_warns(UserWarning, sparse.fit, X_csr, y)
assert_almost_equal(dense.intercept_, sparse.intercept_)
assert_array_almost_equal(dense.coef_, sparse.coef_)


def make_noisy_forward_data(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment should be at the top and start with #

@amueller
Copy link
Member

@agramfort do you want to review this?

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2015

The appveyor failure looks real:

======================================================================
FAIL: tests if the sag regressor performs well
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python35-x64\lib\site-packages\nose\case.py", line 198, in runTest
    self.test(*self.arg)
  File "C:\Python35-x64\lib\site-packages\sklearn\utils\testing.py", line 317, in wrapper
    return fn(*args, **kwargs)
  File "C:\Python35-x64\lib\site-packages\sklearn\linear_model\tests\test_sag.py", line 442, in test_sag_regressor
    assert_greater(score2, 0.99)
AssertionError: 0.98925947044880735 not greater than 0.99
----------------------------------------------------------------------

n_targets=10,
train_frac=.8,
noise_levels=None,
random_state=42):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use a more classical formarting for the args. It's confusing this way.

For instance:

def make_noisy_forward_data(n_samples=100, n_features=200, n_targets=10,
                            train_frac=.8, noise_levels=None, random_state=42):
     ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase must have gone wrong. I had removed this in favor of a make_regression

@ogrisel ogrisel changed the title [BUG] RidgeGCV with sample weights is broken [WIP] RidgeGCV with sample weights is broken Oct 22, 2015
@lesteve lesteve modified the milestones: 0.17, 0.18 Jul 27, 2016
@lesteve lesteve modified the milestones: 0.18, 0.17 Jul 27, 2016
@amueller amueller modified the milestones: 0.18, 0.19 Sep 22, 2016
@jnothman
Copy link
Member

@eickenberg are you intending to complete this?

@ogrisel
Copy link
Member

ogrisel commented May 10, 2019

Fixed by #13350.

@ogrisel ogrisel closed this May 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants