Liblinear Sample Weights #2784

HapeMask · 2014-01-21T17:01:50Z

As discussed in #409, I have patched the included liblinear sources with the update to scale C for each sample individually.

I mostly just copied the libsvm python wrapper to expose the new parameter. The existing tests didn't fully cover the LinearSVC+sample_weights case, so I modified them.

I'm admittedly not familiar with the liblinear internals, but the patching was straightforward and preserved the existing sklearn modifications. The previous SVM tests still pass as well.

HapeMask · 2014-01-21T20:37:45Z

I see that one of the sklearn.ensemble tests is failing with this PR. I didn't change any files in that package but they pass on the master branch. Sorry about that, I'll re-open the PR when I figure out what I broke.

jnothman · 2014-01-21T21:00:01Z

So are you not working from the liblinear fork that supports sample
weights? Maybe you should compare your implementation to it.

On 22 January 2014 07:37, Gabe Schwartz notifications@github.com wrote:

Closed #2784 #2784.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2784
.

HapeMask · 2014-01-21T21:04:51Z

I'm using the version with weights as provided here: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances

I'm looking through the changes to linear.cpp now to see what's going on with it, my guess is that I undid one of the sklearn patches by accident.

jnothman · 2014-01-21T21:11:04Z

Right. You said LibSVM in your original post, but I think you meant
liblinear.

Thanks for working on this!

On 22 January 2014 08:04, Gabe Schwartz notifications@github.com wrote:

I'm using the version with weights as provided here:
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances

I'm looking through the changes to linear.cpp now to see what's going on
with it, my guess is that I undid one of the sklearn patches by accident.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2784#issuecomment-32963202
.

jnothman · 2014-01-21T21:11:53Z

Btw, once you get this working on LinearSVC, you should patch it into (and
test) sklearn.linear_model.LogisticRegression.

On 22 January 2014 08:11, Joel Nothman jnothman@student.usyd.edu.au wrote:

Right. You said LibSVM in your original post, but I think you meant
liblinear.

Thanks for working on this!

On 22 January 2014 08:04, Gabe Schwartz notifications@github.com wrote:

I'm using the version with weights as provided here:
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances

I'm looking through the changes to linear.cpp now to see what's going on
with it, my guess is that I undid one of the sklearn patches by accident.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2784#issuecomment-32963202
.

HapeMask · 2014-01-22T06:11:36Z

I found the bug that was making tests fail: it turns out that LogisticRegression already was using LinearSVC because it inherited from BaseLibLinear at some point.

The BaggingClassifier tests w/LogisticRegression exposed a case where if you have a class with all instances' sample_weight=0, the internal LibLinear model and the sklearn class will disagree on how many classes the estimator was trained on. This is because the patched LibLinear/LibSVM both remove any training samples with 0 weight.

I believe I've fixed the issue, and I added new test cases to cover both the new features and the old bug (it affected SVC as well, though no test cases covered it). I reopened the PR since the tests pass on my machine in a virtualenv w/numpy-1.7.1, scipy-0.13.2, MKL BLAS.

coveralls · 2014-01-22T06:32:27Z

Coverage remained the same when pulling 7258ecf on HapeMask:liblinear_weights into f8e7cf1 on scikit-learn:master.

coveralls · 2014-01-22T06:45:14Z

Coverage remained the same when pulling 79124ec on HapeMask:liblinear_weights into f8e7cf1 on scikit-learn:master.

coveralls · 2014-01-24T01:05:24Z

Coverage remained the same when pulling bc233c6 on HapeMask:liblinear_weights into f8e7cf1 on scikit-learn:master.

jnothman · 2014-01-25T22:30:48Z

sklearn/svm/src/liblinear/linear.cpp

+
+   Modified 2014:
+
+   - Patched to add support for instance weights, Gabriel Schwartz.


It might be worth citing the liblinear fork

jnothman · 2014-01-25T22:41:13Z

Thanks for this! It looks good to me, but I'd like to hear from someone who's worked with liblinear code before: @amueller, @fabianp?

I'd also be interested to see a quick benchmark of the performance change with uniform weights.

I think it might be nice if there were a page in the narrative docs that lists estimators (and metrics?) that support sample weight, but that should probably be a separate PR.

Please add an entry to doc/whats_new.rst.

jnothman · 2014-01-25T23:01:18Z

And perhaps examples/svm/plot_weighted_samples.py should use LinearSVC rather than SVC now.

And my quick benchmarks of the PR didn't show discernable difference.

HapeMask · 2014-02-07T19:55:29Z

And perhaps examples/svm/plot_weighted_samples.py should use LinearSVC rather than SVC now.

This is certainly subjective, but I think the plot with SVC's default RBF kernel looks more interesting. Sample weights can't change the shape of the LinearSVC boundary (since it's linear), so they just wind up shifting the colors a bit.

coveralls · 2014-02-14T18:13:55Z

Coverage remained the same when pulling 366d4d5 on HapeMask:liblinear_weights into e20aebe on scikit-learn:master.

jnothman · 2014-03-17T09:06:01Z

Here are three runs of bench_covtype with and without this PR:

Classifier   train-time test-time error-rate
--------------------------------------------
at master
liblinear     6.4890s    0.0297s     0.2305
liblinear     6.1431s    0.0290s     0.2305
liblinear     7.9698s    0.0288s     0.2305
with this pr
liblinear     6.9556s    0.0315s     0.2305
liblinear     7.2022s    0.0318s     0.2305
liblinear     6.2474s    0.0351s     0.2305

PS: the test failure is a joblib heisenbug.

HapeMask · 2014-04-04T18:37:31Z

I was looking through this so I can keep it up-to-date w/master and I noticed that while the tests do pass, a new warning is generated that isn't generated in master:

WARNING: class label 1 specified in weight is not found

This comes from sklearn/ensemble/tests/test_bagging.py:164 when testing LogisticRegression (which uses liblinear) with missing classes.

I'm not properly familiar with the C internals of liblinear, so I'd second the idea that it would be good for someone else to take a look if they have time.

MechCoder · 2014-07-29T11:22:26Z

@jnothman Can you please tell what is left to do in this PR?

jnothman · 2014-07-29T12:05:45Z

I can't recall there being anything substantial to do. As I said above, I was hoping for another review, if not mere confirmation that this is a desired feature.

MechCoder · 2014-07-29T12:10:19Z

@jnothman yes it would be useful, more than anything else, it would help to remove this Error Raise, https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L355 ie. allow class_weight of type dict for multiclass problems.

MechCoder · 2014-07-29T12:12:33Z

@jnothman If you want me to have look, I can but I doubt I would able to do much more over your reviews. :)

Versions of LibSVM and LibLinear that support sample/instance weights will remove all samples with 0 weight before training. If all samples of some classes have 0 weight, the internal LibXXX model will think it has that many less classes than the number provided to sklearn. This caused issues with, for example, predict_proba returning probabilities with the wrong shape in the last dimension.

Tests now cover cases involving the wrong sample_weight shape and classes whose samples have all 0 weight.

HapeMask · 2014-07-29T16:04:10Z

Since it's been awhile, I rebased against master. It looks like something broke between my last update and now, I'm going to take a look at it.

MechCoder · 2014-07-29T16:05:55Z

I think it must have something to do with the changes due to the new Logistic Regression CV.

HapeMask · 2014-07-29T18:58:15Z

LogisticRegression + LogisticRegressionCV looks like they expect the classes_ attribute to exist before fit() is called.

(It didn't before either, but you could set an enc object which would have its classes attribute returned when you requested it from BaseLibLinear.)

Patches LibLinear sources to liblinear-1.94 with support for instance weights.

Scaled sample weights on a linearly-separable dataset cannot cause an SVM with a linear kernel to misclassify the down-weighted points. The modified test case changes the linearly-separable dataset to contain a non-separable outlier. This outlier should be misclassified unless sample weights are in fact taken into account.

BaseLibLinear and derived classes no longer have a "classes_" attribute until fit() has been called.

coveralls · 2014-07-29T19:24:33Z

Coverage increased (+0.0%) when pulling 3818dfc on HapeMask:liblinear_weights into cbcae04 on scikit-learn:master.

coveralls · 2014-07-29T19:41:40Z

Coverage increased (+0.0%) when pulling 3818dfc on HapeMask:liblinear_weights into cbcae04 on scikit-learn:master.

HapeMask · 2014-07-29T19:52:15Z

I also just saw that a new CI build was started every time I updated my local branch. Sorry about that... I'll make sure to only push to my branch at important intervals in the future.

coveralls · 2014-07-29T19:58:19Z

Coverage increased (+0.0%) when pulling 3818dfc on HapeMask:liblinear_weights into cbcae04 on scikit-learn:master.

coveralls · 2014-07-29T20:12:35Z

Coverage increased (+0.0%) when pulling 3818dfc on HapeMask:liblinear_weights into cbcae04 on scikit-learn:master.

calmez · 2014-10-17T12:04:28Z

@HapeMask @jnothman What is the current status on this?

HapeMask · 2014-10-17T15:50:39Z

I'd need to rebase my branch to fix the merge conflicts, but perhaps it would be better to close this PR for the time being until there is more interest in LibLinear+sample weights? I won't have a chance to make the required updates until a few weeks from now either way.

calmez · 2014-10-20T09:45:22Z

I started working on this, but it's true that this merge is absolutely not straight forward. I don't want to open a new PR right now since I still have 90 tests failing (probably just 4-5 root causes) that I need to fix before. You can find my WIP at https://github.com/liquidm/scikit-learn/tree/liblinear_weights.
I wondered why there is no base class for liblinear anymore in svm/base.py on current master just as for libsvm. Was there any particular reason to move the behavior to functions?

larsmans · 2014-10-20T09:58:47Z

LogisticRegression was decoupled from Liblinear because it now has multiple solvers. It doesn't really make sense to make it a Liblinear base class anymore.

jnothman · 2014-10-20T11:46:58Z

The model is now that LinearSVC and LogisticRegression call a shared fit
function.

On 20 October 2014 20:58, Lars Buitinck notifications@github.com wrote:

LogisticRegression was decoupled from Liblinear because it now has
multiple solvers. It doesn't really make sense to make it a Liblinear
base class anymore.

—
Reply to this email directly or view it on GitHub
#2784 (comment)
.

seanv507 · 2014-12-04T17:20:07Z

LogisticRegression and sample_weights

I am trying to work on this and have come to a problem.
see HapeMask above:

The BaggingClassifier tests w/LogisticRegression exposed a case where if you have a class with all instances' sample_weight=0, the internal LibLinear model and the sklearn class will disagree on how many classes the estimator was trained on. This is because the patched LibLinear/LibSVM both remove any training samples with 0 weight.

-- to summarise, liblinear/libsvm and the wrapping scikit learn class end up having different number of target classes, causing eg prob calculations to fail.
We have two potential solutions, either
A) make scikit learn objects consistent with liblinear ( by removing such empty classes) or
B) somehow expand output of liblinear/libsvm to add these dummy classes back in.

HapeMasks solution for SVC was A) HapeMask@7f497ef
namely remove any classes with sample weight 0 from LinearSVC.

This doesn't work so well for LogisticRegression (and CV)
Firstly, now liblinear fit is in function (so doesn't have access to class members), secondly there are multiple solvers. So what should one aim to do?
a) just fix liblinear logistic regression, by removing any such dummy classes in Logistic regression class
b) remove classes with sample_weights=0 (in sklearn class) for ALL solvers (not just liblinear)
c) change liblinear raw output to re add this empty class [haven't investigated this]

I have been pursuing option a) but wanted to make sure this would be the accepted solution.
Basically, I am keen to have liblinear + sample weights in official scikit-learn.

charlesmartin14 · 2015-03-22T23:56:43Z

Can you tell me the status of this ? Has there been any progress to update the existing LinearSVC api to add sample weights

sean-violante-lendico · 2015-03-23T09:37:06Z

Hi Charles, so I am no longer actively working on this (changed company) - but I would be happy to complete. As I mentioned in my previous message, Hapemasks solution works for LinearSVC. The problem is that AFAIK, the way that the liblinear code has been turned into a function (from a class) breaks the 'design' of sklearn. ( where do all the supplementary functions go for converting labels to classes etc). So if I got some feedback from the principal developers on the approved solution I would be happy to finish.

Essentially the code works in a standard setting, but the baggingtest throws up a special case [when the number of classes input specified via scikit-learn class is different from the number of 'effective' classes in liblinear (classes with whoses instances have nonzero sample weight)]

MechCoder · 2015-08-30T17:51:39Z

Do you mind if I pick this up?

cc @GaelVaroquaux

sean-violante-lendico · 2015-08-30T21:09:01Z

@MechCoder, fine by me

jnothman · 2016-06-23T14:33:31Z

Thanks @HapeMask for your effort. C support for sample weights was provided in #5274, and while the public interface is still being patched, I think this can be closed. Sorry it didn't end up getting merged directly.

HapeMask closed this Jan 21, 2014

HapeMask reopened this Jan 22, 2014

jnothman reviewed Jan 25, 2014
View reviewed changes

HapeMask added 2 commits July 29, 2014 11:15

TST: Improve SVC tests for sample_weight.

ca0953e

Tests now cover cases involving the wrong sample_weight shape and classes whose samples have all 0 weight.

HapeMask added 3 commits July 29, 2014 15:10

ENH: Update LibLinear sources w/instance weights.

edb5816

Patches LibLinear sources to liblinear-1.94 with support for instance weights.

DOC: SVM docs show LinearSVC has sample_weight.

c903b91

FIX: Behavior of classes_ attr w/new BaseLibLinear

3818dfc

BaseLibLinear and derived classes no longer have a "classes_" attribute until fit() has been called.

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

fluctuation mentioned this pull request Jan 31, 2015

[MRG+1] Give dpgmm a covars_ property, allows for sampling #4182

Closed

vstolbunov mentioned this pull request Sep 8, 2015

[MRG+1] Enabling sample weights for 3 of the 4 logistic regression solvers #5204

Closed

jnothman mentioned this pull request Oct 20, 2015

[MRG+1] Patch liblinear for sample_weights in LogisticRegression(and CV) #5274

Merged

TomDLT mentioned this pull request Jun 6, 2016

sample_weight in LinearSVR .fit #6862

Closed

jnothman closed this Jun 23, 2016


		Modified 2014:

		- Patched to add support for instance weights, Gabriel Schwartz.

Liblinear Sample Weights #2784

Liblinear Sample Weights #2784

Conversation

HapeMask commented Jan 21, 2014

HapeMask commented Jan 21, 2014

jnothman commented Jan 21, 2014

HapeMask commented Jan 21, 2014

jnothman commented Jan 21, 2014

jnothman commented Jan 21, 2014

HapeMask commented Jan 22, 2014

coveralls commented Jan 22, 2014

coveralls commented Jan 22, 2014

coveralls commented Jan 24, 2014

jnothman Jan 25, 2014

Choose a reason for hiding this comment

jnothman commented Jan 25, 2014

jnothman commented Jan 25, 2014

HapeMask commented Feb 7, 2014

coveralls commented Feb 14, 2014

jnothman commented Mar 17, 2014

HapeMask commented Apr 4, 2014

MechCoder commented Jul 29, 2014

jnothman commented Jul 29, 2014

MechCoder commented Jul 29, 2014

MechCoder commented Jul 29, 2014

HapeMask commented Jul 29, 2014

MechCoder commented Jul 29, 2014

HapeMask commented Jul 29, 2014

coveralls commented Jul 29, 2014

coveralls commented Jul 29, 2014

HapeMask commented Jul 29, 2014

coveralls commented Jul 29, 2014

coveralls commented Jul 29, 2014

calmez commented Oct 17, 2014

HapeMask commented Oct 17, 2014

calmez commented Oct 20, 2014

larsmans commented Oct 20, 2014

jnothman commented Oct 20, 2014

seanv507 commented Dec 4, 2014

charlesmartin14 commented Mar 22, 2015

sean-violante-lendico commented Mar 23, 2015

MechCoder commented Aug 30, 2015

sean-violante-lendico commented Aug 30, 2015

jnothman commented Jun 23, 2016