-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Support sample weights in HGBT #14696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Support sample weights in HGBT #14696
Conversation
Needs more tests, but already passes some edge case tests. @NicolasHug wanna have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Adrin! Made a first pass.
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Show resolved
Hide resolved
Apart from the minor hessians_are_constant issue I think this looks pretty good now, we just need some tests. Some basic ones I can think of are
Oh and I just realized, we need the PDPs to take SW into account now ^^ |
This test would fail, which I'm trying to figure out why: import numpy as np
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import make_classification
model = HistGradientBoostingClassifier(max_iter=2, max_depth=2,
random_state=42,
validation_fraction=None)
X, y = make_classification(n_classes=2, flip_y=.3, n_redundant=10,
random_state=41)
n_samples = X.shape[0]
X_ = np.r_[X, X[:n_samples // 2, :]]
y_ = np.r_[y, y[:n_samples // 2, ]]
sample_weight = np.ones(shape=(n_samples))
sample_weight[:n_samples // 2] = 2
no_dup_no_sw = model.fit(X, y).predict(X)
dup_no_sw = model.fit(X_, y_).predict(X)
no_dup_sw = model.fit(X, y, sample_weight=sample_weight).predict(X)
print(np.all(dup_no_sw == no_dup_sw))
print(np.all(no_dup_no_sw == dup_no_sw)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample_weight = np.ones(shape=(n_samples))
sample_weight[:n_samples // 2] = 2
This is weird for me. Instead, should it be:
sample_weight = np.zeros(shape=(n_samples,))
sample_weight[:n_samples // 2] = 1
I would also be happy with removing support for weights in the binning (wrong but still good enough), merge the PR and deal with that later. The estimators are still experimental so that's fine. I'd like to get this PR merged so we can focus on the categorical support, which I'd really like to get in for 0.23. If we want to release in May, we need to start soon. |
I'd love to move forward with this so we can get started on the categorical variables. So I see three options:
Honestly I'm happy with any of these. 3) adds a bit more complexity but it's not terrible and I'd rather move forward so we can do categorical variables for 0.21. |
@ogrisel did you have a chance to look into this and the weighted percentiles? |
I believe @ogrisel was OK with my proposal according to #14696 (comment) If that helps @adrinjalali , you can consider resampling with replacement as a general framework for handling sample_weight. This is applicable to any estimator, not just to the binner. Again, I'm fine with reverting to the previous state where we don't handle sample_weight in the binner. I'm happy to help moving this forward if you need. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adrinjalali
@ogrisel the SW support for binning has been reverted (like in LightGBM). The PR is in a solid state and I think reviewing should be reasonably easy now
doc/whats_new/v0.22.rst
Outdated
- |Feature| Estimators now support :term:`sample_weight`. :pr:`14696` by | ||
`Adrin Jalali`_ and `Nicolas Hug`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be in the 0.23 now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to 0.23
|
||
if getattr(self, '_fitted_with_sw', False): | ||
raise NotImplementedError("{} does not support partial dependence" | ||
" plots when sample weights were given " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"with the 'recursion' method"
(for some reason I can't make suggestions on github)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed comment
Maybe @thomasjpfan @glemaitre can also give it a second look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesomeee!
>>> gb.predict([[1, 0]]) | ||
array([1]) | ||
|
||
As you can see, the `[1, 0]` is comfortably classified as `1` since the first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the probability show more comfort?
gb.predict_proba([[1, 0]])[0, 1]
# 0.99...
# gradient = sign(raw_predicition - y_pred) * sample_weight | ||
gradients[i] = sample_weight[i] * (2 * | ||
(y_true[i] - raw_predictions[i] < 0) - 1) | ||
hessians[i] = sample_weight[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this works because sample_weight
is non-negative? If so, lets leave a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this works because sample_weight is non-negative?
No, this works because:
- accounting for SW means we need to multiply gradients and hessians by SW
- without SW, the hessian of this loss is constant and is equal to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the math. lightgbm does the same thing with the l1 loss.
When I see derivative of sign(x)
, I think of the dirac delta function: https://en.wikipedia.org/wiki/Sign_function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think I see what you did here: #13896 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, also, if you look at the sklearn wrapper in lightgbm, all hessians and gradients are simply multiplied by SW.
doc/whats_new/v0.22.rst
Outdated
- |Feature| Estimators now support :term:`sample_weight`. :pr:`14696` by | ||
`Adrin Jalali`_ and `Nicolas Hug`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to 0.23
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
if sample_weight == 'ones': | ||
sample_weight = np.ones(shape=n_samples, dtype=Y_DTYPE) | ||
else: | ||
sample_weight = rng.normal(size=n_samples).astype(Y_DTYPE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to include negative numbers here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't seem to be a place where we check for non-negative weights, and I'm not sure if the math requires the weights to be positive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ref, this is discussed in #15531
I think we can safely ignore the topic for now. (I'm reasonably confident that negative SW would work "as expected" here, though I've no idea what it would look like with binning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Hi, I'm experiencing an
Here is the full log: sphinx-build -D plot_gallery=0 -b html -T -d _build/doctrees -j auto . _build/html/stable
Running Sphinx v2.4.1
Traceback (most recent call last):
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 348, in eval_config_file
execfile_(filename, namespace)
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/util/pycompat.py", line 81, in execfile_
exec(code, _globals)
File "/Users/raimibinkarim/Desktop/scikit-learn/doc/conf.py", line 324, in <module>
from sklearn.experimental import enable_hist_gradient_boosting # noqa
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/experimental/enable_hist_gradient_boosting.py", line 22, in <module>
from ..ensemble._hist_gradient_boosting.gradient_boosting import (
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 23, in <module>
from .loss import _LOSSES
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/loss.py", line 21, in <module>
from ._loss import _update_gradients_hessians_least_squares
ImportError: cannot import name '_update_gradients_hessians_least_squares' from 'sklearn.ensemble._hist_gradient_boosting._loss' (/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/_loss.cpython-37m-darwin.so)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/cmd/build.py", line 275, in build_main
args.tags, args.verbosity, args.jobs, args.keep_going)
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/application.py", line 219, in __init__
self.config = Config.read(self.confdir, confoverrides or {}, self.tags)
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 193, in read
namespace = eval_config_file(filename, tags)
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 358, in eval_config_file
raise ConfigError(msg % traceback.format_exc())
sphinx.errors.ConfigError: There is a programmable error in your configuration file:
Traceback (most recent call last):
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 348, in eval_config_file
execfile_(filename, namespace)
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/util/pycompat.py", line 81, in execfile_
exec(code, _globals)
File "/Users/raimibinkarim/Desktop/scikit-learn/doc/conf.py", line 324, in <module>
from sklearn.experimental import enable_hist_gradient_boosting # noqa
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/experimental/enable_hist_gradient_boosting.py", line 22, in <module>
from ..ensemble._hist_gradient_boosting.gradient_boosting import (
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 23, in <module>
from .loss import _LOSSES
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/loss.py", line 21, in <module>
from ._loss import _update_gradients_hessians_least_squares
ImportError: cannot import name '_update_gradients_hessians_least_squares' from 'sklearn.ensemble._hist_gradient_boosting._loss' (/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/_loss.cpython-37m-darwin.so)
Configuration error:
There is a programmable error in your configuration file:
Traceback (most recent call last):
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 348, in eval_config_file
execfile_(filename, namespace)
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/util/pycompat.py", line 81, in execfile_
exec(code, _globals)
File "/Users/raimibinkarim/Desktop/scikit-learn/doc/conf.py", line 324, in <module>
from sklearn.experimental import enable_hist_gradient_boosting # noqa
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/experimental/enable_hist_gradient_boosting.py", line 22, in <module>
from ..ensemble._hist_gradient_boosting.gradient_boosting import (
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 23, in <module>
from .loss import _LOSSES
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/loss.py", line 21, in <module>
from ._loss import _update_gradients_hessians_least_squares
ImportError: cannot import name '_update_gradients_hessians_least_squares' from 'sklearn.ensemble._hist_gradient_boosting._loss' (/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/_loss.cpython-37m-darwin.so)
make: *** [html-noplot] Error 2 Is anyone experiencing the same problem or am I missing something? |
It's probably an issue with different sklearn's in your path or something. If you start with a clean environment and install the nightly build, it should work I suppose @remykarem |
It's probably just due to the fact that you haven't cythonized the new files. Before cleaning your env I'd suggest to just run |
That works, thank you for the tip! |
Fixes #14830
This PR adds sample weight support to HGBT estimators.