Skip to content

Liblinear Sample Weights #2784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

HapeMask
Copy link

As discussed in #409, I have patched the included liblinear sources with the update to scale C for each sample individually.

I mostly just copied the libsvm python wrapper to expose the new parameter. The existing tests didn't fully cover the LinearSVC+sample_weights case, so I modified them.

I'm admittedly not familiar with the liblinear internals, but the patching was straightforward and preserved the existing sklearn modifications. The previous SVM tests still pass as well.

@HapeMask
Copy link
Author

I see that one of the sklearn.ensemble tests is failing with this PR. I didn't change any files in that package but they pass on the master branch. Sorry about that, I'll re-open the PR when I figure out what I broke.

@HapeMask HapeMask closed this Jan 21, 2014
@jnothman
Copy link
Member

So are you not working from the liblinear fork that supports sample
weights? Maybe you should compare your implementation to it.

On 22 January 2014 07:37, Gabe Schwartz notifications@github.com wrote:

Closed #2784 #2784.


Reply to this email directly or view it on GitHubhttps://github.com//pull/2784
.

@HapeMask
Copy link
Author

I'm using the version with weights as provided here: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances

I'm looking through the changes to linear.cpp now to see what's going on with it, my guess is that I undid one of the sklearn patches by accident.

@jnothman
Copy link
Member

Right. You said LibSVM in your original post, but I think you meant
liblinear.

Thanks for working on this!

On 22 January 2014 08:04, Gabe Schwartz notifications@github.com wrote:

I'm using the version with weights as provided here:
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances

I'm looking through the changes to linear.cpp now to see what's going on
with it, my guess is that I undid one of the sklearn patches by accident.


Reply to this email directly or view it on GitHubhttps://github.com//pull/2784#issuecomment-32963202
.

@jnothman
Copy link
Member

Btw, once you get this working on LinearSVC, you should patch it into (and
test) sklearn.linear_model.LogisticRegression.

On 22 January 2014 08:11, Joel Nothman jnothman@student.usyd.edu.au wrote:

Right. You said LibSVM in your original post, but I think you meant
liblinear.

Thanks for working on this!

On 22 January 2014 08:04, Gabe Schwartz notifications@github.com wrote:

I'm using the version with weights as provided here:
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances

I'm looking through the changes to linear.cpp now to see what's going on
with it, my guess is that I undid one of the sklearn patches by accident.


Reply to this email directly or view it on GitHubhttps://github.com//pull/2784#issuecomment-32963202
.

@HapeMask HapeMask reopened this Jan 22, 2014
@HapeMask
Copy link
Author

I found the bug that was making tests fail: it turns out that LogisticRegression already was using LinearSVC because it inherited from BaseLibLinear at some point.

The BaggingClassifier tests w/LogisticRegression exposed a case where if you have a class with all instances' sample_weight=0, the internal LibLinear model and the sklearn class will disagree on how many classes the estimator was trained on. This is because the patched LibLinear/LibSVM both remove any training samples with 0 weight.

I believe I've fixed the issue, and I added new test cases to cover both the new features and the old bug (it affected SVC as well, though no test cases covered it). I reopened the PR since the tests pass on my machine in a virtualenv w/numpy-1.7.1, scipy-0.13.2, MKL BLAS.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 7258ecf on HapeMask:liblinear_weights into f8e7cf1 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 79124ec on HapeMask:liblinear_weights into f8e7cf1 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling bc233c6 on HapeMask:liblinear_weights into f8e7cf1 on scikit-learn:master.


Modified 2014:

- Patched to add support for instance weights, Gabriel Schwartz.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth citing the liblinear fork

@jnothman
Copy link
Member

Thanks for this! It looks good to me, but I'd like to hear from someone who's worked with liblinear code before: @amueller, @fabianp?

I'd also be interested to see a quick benchmark of the performance change with uniform weights.

I think it might be nice if there were a page in the narrative docs that lists estimators (and metrics?) that support sample weight, but that should probably be a separate PR.

Please add an entry to doc/whats_new.rst.

@jnothman
Copy link
Member

And perhaps examples/svm/plot_weighted_samples.py should use LinearSVC rather than SVC now.

And my quick benchmarks of the PR didn't show discernable difference.

@HapeMask
Copy link
Author

HapeMask commented Feb 7, 2014

And perhaps examples/svm/plot_weighted_samples.py should use LinearSVC rather than SVC now.

This is certainly subjective, but I think the plot with SVC's default RBF kernel looks more interesting. Sample weights can't change the shape of the LinearSVC boundary (since it's linear), so they just wind up shifting the colors a bit.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 366d4d5 on HapeMask:liblinear_weights into e20aebe on scikit-learn:master.

@jnothman
Copy link
Member

Here are three runs of bench_covtype with and without this PR:

Classifier   train-time test-time error-rate
--------------------------------------------
at master
liblinear     6.4890s    0.0297s     0.2305
liblinear     6.1431s    0.0290s     0.2305
liblinear     7.9698s    0.0288s     0.2305
with this pr
liblinear     6.9556s    0.0315s     0.2305
liblinear     7.2022s    0.0318s     0.2305
liblinear     6.2474s    0.0351s     0.2305

PS: the test failure is a joblib heisenbug.

@HapeMask
Copy link
Author

HapeMask commented Apr 4, 2014

I was looking through this so I can keep it up-to-date w/master and I noticed that while the tests do pass, a new warning is generated that isn't generated in master:

WARNING: class label 1 specified in weight is not found

This comes from sklearn/ensemble/tests/test_bagging.py:164 when testing LogisticRegression (which uses liblinear) with missing classes.

I'm not properly familiar with the C internals of liblinear, so I'd second the idea that it would be good for someone else to take a look if they have time.

@MechCoder
Copy link
Member

@jnothman Can you please tell what is left to do in this PR?

@jnothman
Copy link
Member

I can't recall there being anything substantial to do. As I said above, I was hoping for another review, if not mere confirmation that this is a desired feature.

@MechCoder
Copy link
Member

@jnothman yes it would be useful, more than anything else, it would help to remove this Error Raise, https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L355 ie. allow class_weight of type dict for multiclass problems.

@MechCoder
Copy link
Member

@jnothman If you want me to have look, I can but I doubt I would able to do much more over your reviews. :)

HapeMask added 2 commits July 29, 2014 11:15
Versions of LibSVM and LibLinear that support sample/instance weights
will remove all samples with 0 weight before training. If all samples of
some classes have 0 weight, the internal LibXXX model will think it has
that many less classes than the number provided to sklearn.

This caused issues with, for example, predict_proba returning
probabilities with the wrong shape in the last dimension.
Tests now cover cases involving the wrong sample_weight shape and
classes whose samples have all 0 weight.
@HapeMask
Copy link
Author

Since it's been awhile, I rebased against master. It looks like something broke between my last update and now, I'm going to take a look at it.

@MechCoder
Copy link
Member

I think it must have something to do with the changes due to the new Logistic Regression CV.

@HapeMask
Copy link
Author

LogisticRegression + LogisticRegressionCV looks like they expect the classes_ attribute to exist before fit() is called.

(It didn't before either, but you could set an enc object which would have its classes attribute returned when you requested it from BaseLibLinear.)

HapeMask added 3 commits July 29, 2014 15:10
Patches LibLinear sources to liblinear-1.94 with support for instance
weights.
Scaled sample weights on a linearly-separable dataset cannot cause an
SVM with a linear kernel to misclassify the down-weighted points.

The modified test case changes the linearly-separable dataset to contain
a non-separable outlier. This outlier should be misclassified unless
sample weights are in fact taken into account.
BaseLibLinear and derived classes no longer have a "classes_" attribute
until fit() has been called.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 3818dfc on HapeMask:liblinear_weights into cbcae04 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 3818dfc on HapeMask:liblinear_weights into cbcae04 on scikit-learn:master.

@HapeMask
Copy link
Author

I also just saw that a new CI build was started every time I updated my local branch. Sorry about that... I'll make sure to only push to my branch at important intervals in the future.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 3818dfc on HapeMask:liblinear_weights into cbcae04 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 3818dfc on HapeMask:liblinear_weights into cbcae04 on scikit-learn:master.

@calmez
Copy link

calmez commented Oct 17, 2014

@HapeMask @jnothman What is the current status on this?

@HapeMask
Copy link
Author

I'd need to rebase my branch to fix the merge conflicts, but perhaps it would be better to close this PR for the time being until there is more interest in LibLinear+sample weights? I won't have a chance to make the required updates until a few weeks from now either way.

@calmez
Copy link

calmez commented Oct 20, 2014

I started working on this, but it's true that this merge is absolutely not straight forward. I don't want to open a new PR right now since I still have 90 tests failing (probably just 4-5 root causes) that I need to fix before. You can find my WIP at https://github.com/liquidm/scikit-learn/tree/liblinear_weights.
I wondered why there is no base class for liblinear anymore in svm/base.py on current master just as for libsvm. Was there any particular reason to move the behavior to functions?

@larsmans
Copy link
Member

LogisticRegression was decoupled from Liblinear because it now has multiple solvers. It doesn't really make sense to make it a Liblinear base class anymore.

@jnothman
Copy link
Member

The model is now that LinearSVC and LogisticRegression call a shared fit
function.

On 20 October 2014 20:58, Lars Buitinck notifications@github.com wrote:

LogisticRegression was decoupled from Liblinear because it now has
multiple solvers. It doesn't really make sense to make it a Liblinear
base class anymore.


Reply to this email directly or view it on GitHub
#2784 (comment)
.

@seanv507
Copy link

seanv507 commented Dec 4, 2014

LogisticRegression and sample_weights

I am trying to work on this and have come to a problem.
see HapeMask above:

The BaggingClassifier tests w/LogisticRegression exposed a case where if you have a class with all instances' sample_weight=0, the internal LibLinear model and the sklearn class will disagree on how many classes the estimator was trained on. This is because the patched LibLinear/LibSVM both remove any training samples with 0 weight.

-- to summarise, liblinear/libsvm and the wrapping scikit learn class end up having different number of target classes, causing eg prob calculations to fail.
We have two potential solutions, either
A) make scikit learn objects consistent with liblinear ( by removing such empty classes) or
B) somehow expand output of liblinear/libsvm to add these dummy classes back in.

HapeMasks solution for SVC was A) HapeMask@7f497ef
namely remove any classes with sample weight 0 from LinearSVC.

This doesn't work so well for LogisticRegression (and CV)
Firstly, now liblinear fit is in function (so doesn't have access to class members), secondly there are multiple solvers. So what should one aim to do?
a) just fix liblinear logistic regression, by removing any such dummy classes in Logistic regression class
b) remove classes with sample_weights=0 (in sklearn class) for ALL solvers (not just liblinear)
c) change liblinear raw output to re add this empty class [haven't investigated this]

I have been pursuing option a) but wanted to make sure this would be the accepted solution.
Basically, I am keen to have liblinear + sample weights in official scikit-learn.

@charlesmartin14
Copy link

Can you tell me the status of this ? Has there been any progress to update the existing LinearSVC api to add sample weights

@sean-violante-lendico
Copy link

Hi Charles, so I am no longer actively working on this (changed company) - but I would be happy to complete. As I mentioned in my previous message, Hapemasks solution works for LinearSVC. The problem is that AFAIK, the way that the liblinear code has been turned into a function (from a class) breaks the 'design' of sklearn. ( where do all the supplementary functions go for converting labels to classes etc). So if I got some feedback from the principal developers on the approved solution I would be happy to finish.

Essentially the code works in a standard setting, but the baggingtest throws up a special case [when the number of classes input specified via scikit-learn class is different from the number of 'effective' classes in liblinear (classes with whoses instances have nonzero sample weight)]

@MechCoder
Copy link
Member

Do you mind if I pick this up?

cc @GaelVaroquaux

@sean-violante-lendico
Copy link

@MechCoder, fine by me

@jnothman
Copy link
Member

Thanks @HapeMask for your effort. C support for sample weights was provided in #5274, and while the public interface is still being patched, I think this can be closed. Sorry it didn't end up getting merged directly.

@jnothman jnothman closed this Jun 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants