-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Liblinear Sample Weights #2784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Liblinear Sample Weights #2784
Conversation
I see that one of the sklearn.ensemble tests is failing with this PR. I didn't change any files in that package but they pass on the master branch. Sorry about that, I'll re-open the PR when I figure out what I broke. |
So are you not working from the liblinear fork that supports sample On 22 January 2014 07:37, Gabe Schwartz notifications@github.com wrote:
|
I'm using the version with weights as provided here: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances I'm looking through the changes to linear.cpp now to see what's going on with it, my guess is that I undid one of the sklearn patches by accident. |
Right. You said LibSVM in your original post, but I think you meant Thanks for working on this! On 22 January 2014 08:04, Gabe Schwartz notifications@github.com wrote:
|
Btw, once you get this working on LinearSVC, you should patch it into (and On 22 January 2014 08:11, Joel Nothman jnothman@student.usyd.edu.au wrote:
|
I found the bug that was making tests fail: it turns out that LogisticRegression already was using LinearSVC because it inherited from BaseLibLinear at some point. The BaggingClassifier tests w/LogisticRegression exposed a case where if you have a class with all instances' sample_weight=0, the internal LibLinear model and the sklearn class will disagree on how many classes the estimator was trained on. This is because the patched LibLinear/LibSVM both remove any training samples with 0 weight. I believe I've fixed the issue, and I added new test cases to cover both the new features and the old bug (it affected SVC as well, though no test cases covered it). I reopened the PR since the tests pass on my machine in a virtualenv w/numpy-1.7.1, scipy-0.13.2, MKL BLAS. |
|
||
Modified 2014: | ||
|
||
- Patched to add support for instance weights, Gabriel Schwartz. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth citing the liblinear fork
Thanks for this! It looks good to me, but I'd like to hear from someone who's worked with liblinear code before: @amueller, @fabianp? I'd also be interested to see a quick benchmark of the performance change with uniform weights. I think it might be nice if there were a page in the narrative docs that lists estimators (and metrics?) that support sample weight, but that should probably be a separate PR. Please add an entry to |
And perhaps And my quick benchmarks of the PR didn't show discernable difference. |
This is certainly subjective, but I think the plot with SVC's default RBF kernel looks more interesting. Sample weights can't change the shape of the LinearSVC boundary (since it's linear), so they just wind up shifting the colors a bit. |
Here are three runs of
PS: the test failure is a joblib heisenbug. |
I was looking through this so I can keep it up-to-date w/master and I noticed that while the tests do pass, a new warning is generated that isn't generated in master:
This comes from sklearn/ensemble/tests/test_bagging.py:164 when testing LogisticRegression (which uses liblinear) with missing classes. I'm not properly familiar with the C internals of liblinear, so I'd second the idea that it would be good for someone else to take a look if they have time. |
@jnothman Can you please tell what is left to do in this PR? |
I can't recall there being anything substantial to do. As I said above, I was hoping for another review, if not mere confirmation that this is a desired feature. |
@jnothman yes it would be useful, more than anything else, it would help to remove this Error Raise, https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L355 ie. allow class_weight of type dict for multiclass problems. |
@jnothman If you want me to have look, I can but I doubt I would able to do much more over your reviews. :) |
Versions of LibSVM and LibLinear that support sample/instance weights will remove all samples with 0 weight before training. If all samples of some classes have 0 weight, the internal LibXXX model will think it has that many less classes than the number provided to sklearn. This caused issues with, for example, predict_proba returning probabilities with the wrong shape in the last dimension.
Tests now cover cases involving the wrong sample_weight shape and classes whose samples have all 0 weight.
Since it's been awhile, I rebased against master. It looks like something broke between my last update and now, I'm going to take a look at it. |
I think it must have something to do with the changes due to the new Logistic Regression CV. |
LogisticRegression + LogisticRegressionCV looks like they expect the classes_ attribute to exist before fit() is called. (It didn't before either, but you could set an enc object which would have its classes attribute returned when you requested it from BaseLibLinear.) |
Patches LibLinear sources to liblinear-1.94 with support for instance weights.
Scaled sample weights on a linearly-separable dataset cannot cause an SVM with a linear kernel to misclassify the down-weighted points. The modified test case changes the linearly-separable dataset to contain a non-separable outlier. This outlier should be misclassified unless sample weights are in fact taken into account.
BaseLibLinear and derived classes no longer have a "classes_" attribute until fit() has been called.
I also just saw that a new CI build was started every time I updated my local branch. Sorry about that... I'll make sure to only push to my branch at important intervals in the future. |
I'd need to rebase my branch to fix the merge conflicts, but perhaps it would be better to close this PR for the time being until there is more interest in LibLinear+sample weights? I won't have a chance to make the required updates until a few weeks from now either way. |
I started working on this, but it's true that this merge is absolutely not straight forward. I don't want to open a new PR right now since I still have 90 tests failing (probably just 4-5 root causes) that I need to fix before. You can find my WIP at https://github.com/liquidm/scikit-learn/tree/liblinear_weights. |
|
The model is now that LinearSVC and LogisticRegression call a shared fit On 20 October 2014 20:58, Lars Buitinck notifications@github.com wrote:
|
LogisticRegression and sample_weights I am trying to work on this and have come to a problem.
-- to summarise, liblinear/libsvm and the wrapping scikit learn class end up having different number of target classes, causing eg prob calculations to fail. HapeMasks solution for SVC was A) HapeMask@7f497ef This doesn't work so well for LogisticRegression (and CV) I have been pursuing option a) but wanted to make sure this would be the accepted solution. |
Can you tell me the status of this ? Has there been any progress to update the existing LinearSVC api to add sample weights |
Hi Charles, so I am no longer actively working on this (changed company) - but I would be happy to complete. As I mentioned in my previous message, Hapemasks solution works for LinearSVC. The problem is that AFAIK, the way that the liblinear code has been turned into a function (from a class) breaks the 'design' of sklearn. ( where do all the supplementary functions go for converting labels to classes etc). So if I got some feedback from the principal developers on the approved solution I would be happy to finish. Essentially the code works in a standard setting, but the baggingtest throws up a special case [when the number of classes input specified via scikit-learn class is different from the number of 'effective' classes in liblinear (classes with whoses instances have nonzero sample weight)] |
Do you mind if I pick this up? |
@MechCoder, fine by me |
As discussed in #409, I have patched the included liblinear sources with the update to scale C for each sample individually.
I mostly just copied the libsvm python wrapper to expose the new parameter. The existing tests didn't fully cover the LinearSVC+sample_weights case, so I modified them.
I'm admittedly not familiar with the liblinear internals, but the patching was straightforward and preserved the existing sklearn modifications. The previous SVM tests still pass as well.