Skip to content

[MRG+1] Enabling sample weights for 3 of the 4 logistic regression solvers #5204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

vstolbunov
Copy link
Contributor

I observed that logistic regression could not be trained with sample weights. By digging deeper I found that two of the three solvers (lbfgs and newton-cg) could handle sample weights, while the default solver (liblinear) could not and as a result sample weights were hidden from the user completely. To remedy this, I have:

  • added a ValueError when the liblinear solver is used with sample weights
  • allowed for sample_weight to be a parameter in both LogisticRegression.fit() and LogisticRegressionCV.fit()
  • documented the parameter everywhere it can be passed
  • updated the documentation for the class_weight parameter to note that it will be multiplied by sample_weight if sample_weight is provided in .fit()
  • handled sample weights appropriately throughout the fitting by passing them to logistic_regression_path and _log_reg_scoring_path
  • added a new test function in test_logistic.py with four different tests to make sure:
    • that a ValueError is thrown when liblinear is used with sample weights
    • that passing sample weights as np.ones is the same as not passing them (default None)
    • passing sample weights to both lbfgs and newton-cg solvers yields the same results
    • passing class weights to scale classes is the same as passing sample weights that scale the training examples of those classes

# If sample weights are not provided, set them to 1 for all examples
if sample_weight is None:
sample_weight = np.ones(X.shape[0])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to do

sample_weight = np.array(sample_weight) here, to support lists, etc

Also you can add a check_consistent_length to check that they are the same size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@vstolbunov
Copy link
Contributor Author

Currently fails in test_probability() of ensemble/tests/test_bagging.py as a BaggingClassifier is created with base_estimator=LogisticRegression and then fit. During .fit(), _parallel_build_estimators() in ensemble/bagging.py (line 104) uses the default solver (liblinear) to fit with sample weights. This raises the ValueError I implemented to catch this scenario.

Possible solutions:

  • somehow make the BaggingClassifier in this test use a different LogisticRegression solver
  • switch default solver for linear regression to lbfgs or newton-cg because they support sample weights
  • implement some kind of exception in _parallel_build_estimators

UPDATE:

  • forced BaggingClassifier in test_bagging.py to use the lbfgs solver when base_estimator=LogisticRegression (see 9d3becf)

@MechCoder
Copy link
Member

forced BaggingClassifier in test_bagging.py to use the lbfgs solver.

This update will break existing code. The quick workaround I can think of is to change this line

estimator.fit(X[:, features], y, sample_weight=curr_sample_weight)

to

if sample_weight is None:
    estimator.fit(X[:, features], y)
else:
    estimator.fit(X[:, features], y, sample_weight=curr_sample_weight)

@MechCoder
Copy link
Member

Thanks for the PR. It looks good except for a few minor comments.

By the way, you might be interested in this #5207

@vstolbunov
Copy link
Contributor Author

Currently experiencing test failure in test_sample_weight_missing() of ensemble/tests/test_weight_boosting.py as a ValueError is not being raised when LogisticRegression is the regressor.

Possible solution:

if sample_weight is None:
estimator.fit(X[:, features], y)
else:
estimator.fit(X[:, features], y, sample_weight=curr_sample_weight)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change appears to result in failures of test_bootstrap_samples and test_oob_score_regression in test_bagging.py. (cc: @MechCoder)

EDIT: see https://travis-ci.org/scikit-learn/scikit-learn/builds/78498412

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. I'll have a detailed look tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary: I will re-implement the changes in 9d3becf and also comment out the tests with LogisticRegression in test_sample_weight_missing of test_weight_boosting.py - to see if there are any other issues beyond these two.

EDIT: All checks passed.

@MechCoder
Copy link
Member

can you post the git reflog output? Thanks

@vstolbunov
Copy link
Contributor Author

Sure, I dumped into a misc folder in my personal repos:
https://github.com/vstolbunov/misc/blob/master/reflog.txt

Not sure what happened, help appreciated :)

@MechCoder
Copy link
Member

I think you did a merge and a rebase. You can try

git reset --hard HEAD@{1} # Restores your branch to what happened before the merge.
git push -f origin lr-sampleweights

(Also can you squash your commits into 1 or 2, it makes the history cleaner, unless the LOC's changed are huge)

@vstolbunov
Copy link
Contributor Author

Great, thanks! Back to 3 files.

@MechCoder
Copy link
Member

+1 for merge apart from the comment regarding "liblinear"

@MechCoder MechCoder changed the title Enabling sample weights for 2 of the 3 logistic regression solvers [MRG+1] Enabling sample weights for 2 of the 3 logistic regression solvers Sep 9, 2015
@amueller
Copy link
Member

amueller commented Sep 9, 2015

Ok maybe lets leave the if in for the moment then.

@vstolbunov
Copy link
Contributor Author

Also squashed down to two commits now.

* Updated _check_solver_option to include sample_weight check
* Updated all calls to _check_solver_option()
* Updated documentation of class_weight throughout logistic.py
* Added sample_weight parameter to logistic_regression_path.
* Added handling of sample weights to logistic_regression_path.
* Added sample_weight parameter to _log_reg_scoring_path.
* Added handling of sample weights to _log_reg_scoring_path.
* Added sample_weight parameter to fit() in the LogisticRegression class.
* Added handling of sample sample weights in LogisticRegression.fit()
* Added sample_weight parameter to fit() in the LogisticRegressionCV class.
* Added handling of sample weights in LogisticRegressionCV.fit()
* Added test_logistic_regressioncv_sample_weights, which:
  * tests that a ValueError is raised if liblinear is used with
    sample weights
  * tests that passing sample weights as np.ones(y.shape[0]) is
    the same as not passing them (default None)
  * tests that using both lbfgs and newton-cg solvers with
    sample weights yields the same results
  * tests that passing class weights to scale one class is the
    same as passing sample weights for the training data of just
    that class
* Fixed bug with *= in logistic_regression_path.
* Fixed bug in test_logistic_regressioncv_sample_weights where
  no data was created prior to fitting.
* Changes to accepted sample_weight type.
* Fixed bug in naming of sample_weight when passed from
  _log_reg_scoring_path to logistic_regression_path.
* Fixed issue of sample_weight=None being converted to np.array()
  and then not being reconigzed as None.
* Added tests for LogisticRegression
* Attempting to fix same issue as 9d3becf by instead implementing
  if statement in bagging.py.
* Added TODO to eliminate check for liblinear w/ sample weights
  in bagging.py
@vstolbunov
Copy link
Contributor Author

Looks like the sag solver was added to LogisticRegression so I had to rebase and to make it easier decided to squash down to one commit for this PR.

Looks like there is still some sag work going on, so it would be good if this PR was wrapped up soon :)

cc for 2nd review: @glouppe @ndawe

@amueller
Copy link
Member

LGTM. Whatsnew?

@vstolbunov
Copy link
Contributor Author

Nothing has changed since 2 days ago. Just had conflicts in logistic.py with recently merged PRs so I had to rebase today. The Travis check is done and AppVeyor should be done soon.

@MechCoder
Copy link
Member

I think Andy meant whatsnew.rst as in sklearn/doc/whatsnew.rst rather than the Oxford dictionary definition. :)

@vstolbunov
Copy link
Contributor Author

Oh yeah I'm working on it now, I thought he was actually wondering why there was a new commit haha.

@amueller
Copy link
Member

sorry for being terse from time to time ;) that is what I meant.

@vstolbunov
Copy link
Contributor Author

Looks like @TomDLT added tests with sag into test_logistic.py, so bear with me while I include a sag in test_logistic_regression_sample_weights and test_logistic_regressioncv_sample_weights.

@vstolbunov
Copy link
Contributor Author

Hey @TomDLT does the sag solver support sample weights?

EDIT: my sandbox tests show that it does, so I'm not sure what's going on with the current error we are experiencing here.

File "sklearn/utils/seq_dataset.pyx", line 158, in sklearn.utils.seq_dataset.ArrayDataset.__cinit__ (sklearn/utils/seq_dataset.c:2514)
    def __cinit__(self, np.ndarray[double, ndim=2, mode='c'] X,
ValueError: Buffer dtype mismatch, expected 'double' but got 'long'

EDIT 2: looks like it expects the sample weights to be floats and doesn't like them being ints (which is the case in the test)

EDIT 3: fixed the error, but now not sure why the coefficients don't match up (failed test) when sag is used

EDIT 4: all done, just had to decrease the tolerance so sag could converge

@vstolbunov vstolbunov force-pushed the lr-sampleweights branch 3 times, most recently from 7c7179d to 910ca83 Compare September 12, 2015 19:52
@vstolbunov
Copy link
Contributor Author

All done. Also updated whatsnew. Changing PR name as well since now 3/4 solvers can use the sample_weight feature.

@vstolbunov vstolbunov changed the title [MRG+1] Enabling sample weights for 2 of the 3 logistic regression solvers [MRG+1] Enabling sample weights for 3 of the 4 logistic regression solvers Sep 12, 2015
tol = 1e-7
clf_sw_lbfgs = LogisticRegression(solver='lbfgs', fit_intercept=False,
tol=tol)
clf_sw_lbfgs.fit(X, y, sample_weight=y+1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y+1 -> y + 1

@@ -570,6 +570,53 @@ def test_logistic_regressioncv_class_weights():
assert_array_almost_equal(clf_lib.coef_, clf_sag.coef_, decimal=4)


def test_logistic_regression_sample_weights():
X, y = make_classification(n_samples=20, n_features=5, n_informative=3,
n_classes=3, random_state=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that n_classes=3 brings something interesting in this test.
n_classes=2 would reduce the run-time which is still quite long (0.76 to 0.29 sec).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, changed. That actually eliminates the need for two for loops.

@TomDLT
Copy link
Member

TomDLT commented Sep 14, 2015

This looks good to me.
Thanks @vstolbunov !

@vstolbunov
Copy link
Contributor Author

Let's get this merged?

@MechCoder
Copy link
Member

just did, thanks.

@MechCoder MechCoder closed this Sep 15, 2015
@amueller
Copy link
Member

I really recommend using flake8 instead of pep8 as that also checks for unused or undefined variables etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants