[MRG+1] Enabling sample weights for 3 of the 4 logistic regression solvers #5204

vstolbunov · 2015-09-02T20:51:58Z

I observed that logistic regression could not be trained with sample weights. By digging deeper I found that two of the three solvers (lbfgs and newton-cg) could handle sample weights, while the default solver (liblinear) could not and as a result sample weights were hidden from the user completely. To remedy this, I have:

added a ValueError when the liblinear solver is used with sample weights
allowed for sample_weight to be a parameter in both LogisticRegression.fit() and LogisticRegressionCV.fit()
documented the parameter everywhere it can be passed
updated the documentation for the class_weight parameter to note that it will be multiplied by sample_weight if sample_weight is provided in .fit()
handled sample weights appropriately throughout the fitting by passing them to logistic_regression_path and _log_reg_scoring_path
added a new test function in test_logistic.py with four different tests to make sure:
- that a ValueError is thrown when liblinear is used with sample weights
- that passing sample weights as np.ones is the same as not passing them (default None)
- passing sample weights to both lbfgs and newton-cg solvers yields the same results
- passing class weights to scale classes is the same as passing sample weights that scale the training examples of those classes

MechCoder · 2015-09-02T22:32:58Z

sklearn/linear_model/logistic.py

+    # If sample weights are not provided, set them to 1 for all examples
+    if sample_weight is None:
+        sample_weight = np.ones(X.shape[0])
+


you might want to do

sample_weight = np.array(sample_weight) here, to support lists, etc

Also you can add a check_consistent_length to check that they are the same size.

vstolbunov · 2015-09-02T22:44:41Z

Currently fails in test_probability() of ensemble/tests/test_bagging.py as a BaggingClassifier is created with base_estimator=LogisticRegression and then fit. During .fit(), _parallel_build_estimators() in ensemble/bagging.py (line 104) uses the default solver (liblinear) to fit with sample weights. This raises the ValueError I implemented to catch this scenario.

Possible solutions:

somehow make the BaggingClassifier in this test use a different LogisticRegression solver
switch default solver for linear regression to lbfgs or newton-cg because they support sample weights
implement some kind of exception in _parallel_build_estimators

UPDATE:

forced BaggingClassifier in test_bagging.py to use the lbfgs solver when base_estimator=LogisticRegression (see 9d3becf)

MechCoder · 2015-09-02T23:44:56Z

forced BaggingClassifier in test_bagging.py to use the lbfgs solver.

This update will break existing code. The quick workaround I can think of is to change this line

estimator.fit(X[:, features], y, sample_weight=curr_sample_weight)

to

if sample_weight is None:
    estimator.fit(X[:, features], y)
else:
    estimator.fit(X[:, features], y, sample_weight=curr_sample_weight)

MechCoder · 2015-09-02T23:49:51Z

Thanks for the PR. It looks good except for a few minor comments.

By the way, you might be interested in this #5207

vstolbunov · 2015-09-03T00:20:33Z

Currently experiencing test failure in test_sample_weight_missing() of ensemble/tests/test_weight_boosting.py as a ValueError is not being raised when LogisticRegression is the regressor.

Possible solution:

remove the LogisticRegression tests from this function
once [MRG+1] [BUG] AdaBoostRegressor should not raise errors if the base_estimator does not support sample_weights #5207 is passed, I believe this will not fail anymore

vstolbunov · 2015-09-03T00:42:06Z

sklearn/ensemble/bagging.py

+            if sample_weight is None:
+                estimator.fit(X[:, features], y)
+            else:
+                estimator.fit(X[:, features], y, sample_weight=curr_sample_weight)


This change appears to result in failures of test_bootstrap_samples and test_oob_score_regression in test_bagging.py. (cc: @MechCoder)

EDIT: see https://travis-ci.org/scikit-learn/scikit-learn/builds/78498412

hmm. I'll have a detailed look tomorrow.

Temporary: I will re-implement the changes in 9d3becf and also comment out the tests with LogisticRegression in test_sample_weight_missing of test_weight_boosting.py - to see if there are any other issues beyond these two.

EDIT: All checks passed.

MechCoder · 2015-09-09T18:35:33Z

can you post the git reflog output? Thanks

vstolbunov · 2015-09-09T19:14:20Z

Sure, I dumped into a misc folder in my personal repos:
https://github.com/vstolbunov/misc/blob/master/reflog.txt

Not sure what happened, help appreciated :)

MechCoder · 2015-09-09T19:56:24Z

I think you did a merge and a rebase. You can try

git reset --hard HEAD@{1} # Restores your branch to what happened before the merge.
git push -f origin lr-sampleweights

(Also can you squash your commits into 1 or 2, it makes the history cleaner, unless the LOC's changed are huge)

vstolbunov · 2015-09-09T20:00:16Z

Great, thanks! Back to 3 files.

MechCoder · 2015-09-09T20:19:48Z

+1 for merge apart from the comment regarding "liblinear"

amueller · 2015-09-09T20:22:29Z

Ok maybe lets leave the if in for the moment then.

vstolbunov · 2015-09-09T20:33:01Z

Also squashed down to two commits now.

* Updated _check_solver_option to include sample_weight check * Updated all calls to _check_solver_option() * Updated documentation of class_weight throughout logistic.py * Added sample_weight parameter to logistic_regression_path. * Added handling of sample weights to logistic_regression_path. * Added sample_weight parameter to _log_reg_scoring_path. * Added handling of sample weights to _log_reg_scoring_path. * Added sample_weight parameter to fit() in the LogisticRegression class. * Added handling of sample sample weights in LogisticRegression.fit() * Added sample_weight parameter to fit() in the LogisticRegressionCV class. * Added handling of sample weights in LogisticRegressionCV.fit() * Added test_logistic_regressioncv_sample_weights, which: * tests that a ValueError is raised if liblinear is used with sample weights * tests that passing sample weights as np.ones(y.shape[0]) is the same as not passing them (default None) * tests that using both lbfgs and newton-cg solvers with sample weights yields the same results * tests that passing class weights to scale one class is the same as passing sample weights for the training data of just that class * Fixed bug with *= in logistic_regression_path. * Fixed bug in test_logistic_regressioncv_sample_weights where no data was created prior to fitting. * Changes to accepted sample_weight type. * Fixed bug in naming of sample_weight when passed from _log_reg_scoring_path to logistic_regression_path. * Fixed issue of sample_weight=None being converted to np.array() and then not being reconigzed as None. * Added tests for LogisticRegression * Attempting to fix same issue as 9d3becf by instead implementing if statement in bagging.py. * Added TODO to eliminate check for liblinear w/ sample weights in bagging.py

vstolbunov · 2015-09-11T20:58:31Z

Looks like the sag solver was added to LogisticRegression so I had to rebase and to make it easier decided to squash down to one commit for this PR.

Looks like there is still some sag work going on, so it would be good if this PR was wrapped up soon :)

cc for 2nd review: @glouppe @ndawe

amueller · 2015-09-11T21:12:34Z

LGTM. Whatsnew?

vstolbunov · 2015-09-11T21:20:27Z

Nothing has changed since 2 days ago. Just had conflicts in logistic.py with recently merged PRs so I had to rebase today. The Travis check is done and AppVeyor should be done soon.

MechCoder · 2015-09-11T21:24:14Z

I think Andy meant whatsnew.rst as in sklearn/doc/whatsnew.rst rather than the Oxford dictionary definition. :)

vstolbunov · 2015-09-11T21:25:54Z

Oh yeah I'm working on it now, I thought he was actually wondering why there was a new commit haha.

amueller · 2015-09-11T21:33:53Z

sorry for being terse from time to time ;) that is what I meant.

vstolbunov · 2015-09-11T21:36:06Z

Looks like @TomDLT added tests with sag into test_logistic.py, so bear with me while I include a sag in test_logistic_regression_sample_weights and test_logistic_regressioncv_sample_weights.

vstolbunov · 2015-09-12T07:42:21Z

Hey @TomDLT does the sag solver support sample weights?

EDIT: my sandbox tests show that it does, so I'm not sure what's going on with the current error we are experiencing here.

File "sklearn/utils/seq_dataset.pyx", line 158, in sklearn.utils.seq_dataset.ArrayDataset.__cinit__ (sklearn/utils/seq_dataset.c:2514)
    def __cinit__(self, np.ndarray[double, ndim=2, mode='c'] X,
ValueError: Buffer dtype mismatch, expected 'double' but got 'long'

EDIT 2: looks like it expects the sample weights to be floats and doesn't like them being ints (which is the case in the test)

EDIT 3: fixed the error, but now not sure why the coefficients don't match up (failed test) when sag is used

EDIT 4: all done, just had to decrease the tolerance so sag could converge

vstolbunov · 2015-09-12T20:05:13Z

All done. Also updated whatsnew. Changing PR name as well since now 3/4 solvers can use the sample_weight feature.

TomDLT · 2015-09-14T08:51:39Z

sklearn/linear_model/tests/test_logistic.py

+    tol = 1e-7
+    clf_sw_lbfgs = LogisticRegression(solver='lbfgs', fit_intercept=False, 
+                                      tol=tol)
+    clf_sw_lbfgs.fit(X, y, sample_weight=y+1)


y+1 -> y + 1

TomDLT · 2015-09-14T16:35:01Z

sklearn/linear_model/tests/test_logistic.py

@@ -570,6 +570,53 @@ def test_logistic_regressioncv_class_weights():
    assert_array_almost_equal(clf_lib.coef_, clf_sag.coef_, decimal=4)


+def test_logistic_regression_sample_weights():
+    X, y = make_classification(n_samples=20, n_features=5, n_informative=3,
+                               n_classes=3, random_state=0)


I am not sure that n_classes=3 brings something interesting in this test.
n_classes=2 would reduce the run-time which is still quite long (0.76 to 0.29 sec).

Sure, changed. That actually eliminates the need for two for loops.

TomDLT · 2015-09-14T16:37:22Z

This looks good to me.
Thanks @vstolbunov !

vstolbunov · 2015-09-15T15:25:31Z

Let's get this merged?

MechCoder · 2015-09-15T16:41:58Z

just did, thanks.

amueller · 2015-09-24T20:33:01Z

I really recommend using flake8 instead of pep8 as that also checks for unused or undefined variables etc.

MechCoder reviewed Sep 2, 2015
View reviewed changes

vstolbunov mentioned this pull request Sep 3, 2015

[MRG+1] [BUG] AdaBoostRegressor should not raise errors if the base_estimator does not support sample_weights #5207

Merged

vstolbunov reviewed Sep 3, 2015
View reviewed changes

vstolbunov force-pushed the lr-sampleweights branch from e36b4cc to f80e6a6 Compare September 9, 2015 19:58

MechCoder changed the title ~~Enabling sample weights for 2 of the 3 logistic regression solvers~~ [MRG+1] Enabling sample weights for 2 of the 3 logistic regression solvers Sep 9, 2015

vstolbunov force-pushed the lr-sampleweights branch from f80e6a6 to 7d8f0ba Compare September 9, 2015 20:31

vstolbunov force-pushed the lr-sampleweights branch from 7d8f0ba to c9234b4 Compare September 9, 2015 21:44

vstolbunov force-pushed the lr-sampleweights branch from c9234b4 to 61ce3a2 Compare September 11, 2015 20:40

vstolbunov force-pushed the lr-sampleweights branch 3 times, most recently from 7c7179d to 910ca83 Compare September 12, 2015 19:52

vstolbunov changed the title ~~[MRG+1] Enabling sample weights for 2 of the 3 logistic regression solvers~~ [MRG+1] Enabling sample weights for 3 of the 4 logistic regression solvers Sep 12, 2015

TomDLT reviewed Sep 14, 2015
View reviewed changes

Updated logistic regression tests with sag solver

427e6ab

vstolbunov force-pushed the lr-sampleweights branch from 910ca83 to 67e0059 Compare September 14, 2015 16:03

TomDLT reviewed Sep 14, 2015
View reviewed changes

Fixed syntax and combined two test functions

f1e0a68

vstolbunov force-pushed the lr-sampleweights branch from 67e0059 to f1e0a68 Compare September 14, 2015 16:40

MechCoder closed this Sep 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] Enabling sample weights for 3 of the 4 logistic regression solvers #5204

[MRG+1] Enabling sample weights for 3 of the 4 logistic regression solvers #5204

vstolbunov commented Sep 2, 2015

MechCoder Sep 2, 2015

vstolbunov Sep 2, 2015

vstolbunov commented Sep 2, 2015

MechCoder commented Sep 2, 2015

MechCoder commented Sep 2, 2015

vstolbunov commented Sep 3, 2015

vstolbunov Sep 3, 2015

MechCoder Sep 3, 2015

vstolbunov Sep 3, 2015

MechCoder commented Sep 9, 2015

vstolbunov commented Sep 9, 2015

MechCoder commented Sep 9, 2015

vstolbunov commented Sep 9, 2015

MechCoder commented Sep 9, 2015

amueller commented Sep 9, 2015

vstolbunov commented Sep 9, 2015

vstolbunov commented Sep 11, 2015

amueller commented Sep 11, 2015

vstolbunov commented Sep 11, 2015

MechCoder commented Sep 11, 2015

vstolbunov commented Sep 11, 2015

amueller commented Sep 11, 2015

vstolbunov commented Sep 11, 2015

vstolbunov commented Sep 12, 2015

vstolbunov commented Sep 12, 2015

TomDLT Sep 14, 2015

TomDLT Sep 14, 2015

vstolbunov Sep 14, 2015

TomDLT commented Sep 14, 2015

vstolbunov commented Sep 15, 2015

MechCoder commented Sep 15, 2015

amueller commented Sep 24, 2015

[MRG+1] Enabling sample weights for 3 of the 4 logistic regression solvers #5204

[MRG+1] Enabling sample weights for 3 of the 4 logistic regression solvers #5204

Conversation

vstolbunov commented Sep 2, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vstolbunov commented Sep 2, 2015

MechCoder commented Sep 2, 2015

MechCoder commented Sep 2, 2015

vstolbunov commented Sep 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MechCoder commented Sep 9, 2015

vstolbunov commented Sep 9, 2015

MechCoder commented Sep 9, 2015

vstolbunov commented Sep 9, 2015

MechCoder commented Sep 9, 2015

amueller commented Sep 9, 2015

vstolbunov commented Sep 9, 2015

vstolbunov commented Sep 11, 2015

amueller commented Sep 11, 2015

vstolbunov commented Sep 11, 2015

MechCoder commented Sep 11, 2015

vstolbunov commented Sep 11, 2015

amueller commented Sep 11, 2015

vstolbunov commented Sep 11, 2015

vstolbunov commented Sep 12, 2015

vstolbunov commented Sep 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomDLT commented Sep 14, 2015

vstolbunov commented Sep 15, 2015

MechCoder commented Sep 15, 2015

amueller commented Sep 24, 2015