Skip to content

ENH Support sample weights in HGBT #14696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 79 commits into from
Feb 24, 2020

Conversation

adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Aug 20, 2019

Fixes #14830

This PR adds sample weight support to HGBT estimators.

@adrinjalali adrinjalali changed the title [WIP] Support sample weights in HGBT Support sample weights in HGBT Sep 1, 2019
@adrinjalali adrinjalali marked this pull request as ready for review September 1, 2019 10:41
@adrinjalali
Copy link
Member Author

Needs more tests, but already passes some edge case tests. @NicolasHug wanna have a look?

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Adrin! Made a first pass.

@NicolasHug
Copy link
Member

Apart from the minor hessians_are_constant issue I think this looks pretty good now, we just need some tests. Some basic ones I can think of are

  • make sure hessiasn_are_constant is always True when sw is not None
  • make sure the hessians array is not initialized to [1] when sw is not None
  • make sure the sum_hessian field of the histograms corresponds to the weighted sum of the sample weights
  • make sure hessiasn and gradients are multiplied by the sw

Oh and I just realized, we need the PDPs to take SW into account now ^^
There was a PR for SW and the PDPs of the other trees: #13193

@adrinjalali
Copy link
Member Author

This test would fail, which I'm trying to figure out why:

import numpy as np
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import make_classification

model = HistGradientBoostingClassifier(max_iter=2, max_depth=2,
                                       random_state=42,
                                       validation_fraction=None)
X, y = make_classification(n_classes=2, flip_y=.3, n_redundant=10,
                           random_state=41)
n_samples = X.shape[0]
X_ = np.r_[X, X[:n_samples // 2, :]]
y_ = np.r_[y, y[:n_samples // 2, ]]
sample_weight = np.ones(shape=(n_samples))
sample_weight[:n_samples // 2] = 2

no_dup_no_sw = model.fit(X, y).predict(X)
dup_no_sw = model.fit(X_, y_).predict(X)
no_dup_sw = model.fit(X, y, sample_weight=sample_weight).predict(X)
print(np.all(dup_no_sw == no_dup_sw))
print(np.all(no_dup_no_sw == dup_no_sw))

@glemaitre glemaitre self-requested a review September 10, 2019 07:55
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample_weight = np.ones(shape=(n_samples))
sample_weight[:n_samples // 2] = 2

This is weird for me. Instead, should it be:

sample_weight = np.zeros(shape=(n_samples,))
sample_weight[:n_samples // 2] = 1

@NicolasHug NicolasHug dismissed their stale review January 28, 2020 16:27

Too long ago

@NicolasHug
Copy link
Member

I would also be happy with removing support for weights in the binning (wrong but still good enough), merge the PR and deal with that later. The estimators are still experimental so that's fine.

I'd like to get this PR merged so we can focus on the categorical support, which I'd really like to get in for 0.23. If we want to release in May, we need to start soon.

@amueller
Copy link
Member

amueller commented Feb 3, 2020

I'd love to move forward with this so we can get started on the categorical variables. So I see three options:

  1. Keep it as it was and don't support sample-weights for the binning (which I think will be fine).
  2. Do the subsampling that @NicolasHug proposes.
  3. Implement weighted percentiles.

Honestly I'm happy with any of these. 3) adds a bit more complexity but it's not terrible and I'd rather move forward so we can do categorical variables for 0.21.

@adrinjalali
Copy link
Member Author

@ogrisel did you have a chance to look into this and the weighted percentiles?

@NicolasHug
Copy link
Member

I believe @ogrisel was OK with my proposal according to #14696 (comment)

If that helps @adrinjalali , you can consider resampling with replacement as a general framework for handling sample_weight. This is applicable to any estimator, not just to the binner.
It should be a very close approximation, provided that we have enough samples. If we sample for an infinite amount of times, it gives the true result.

Again, I'm fine with reverting to the previous state where we don't handle sample_weight in the binner.

I'm happy to help moving this forward if you need.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adrinjalali

@ogrisel the SW support for binning has been reverted (like in LightGBM). The PR is in a solid state and I think reviewing should be reasonably easy now

Comment on lines 412 to 413
- |Feature| Estimators now support :term:`sample_weight`. :pr:`14696` by
`Adrin Jalali`_ and `Nicolas Hug`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be in the 0.23 now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to 0.23


if getattr(self, '_fitted_with_sw', False):
raise NotImplementedError("{} does not support partial dependence"
" plots when sample weights were given "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"with the 'recursion' method"

(for some reason I can't make suggestions on github)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed comment

@NicolasHug
Copy link
Member

Maybe @thomasjpfan @glemaitre can also give it a second look

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesomeee!

>>> gb.predict([[1, 0]])
array([1])

As you can see, the `[1, 0]` is comfortably classified as `1` since the first
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the probability show more comfort?

gb.predict_proba([[1, 0]])[0, 1]
# 0.99...

# gradient = sign(raw_predicition - y_pred) * sample_weight
gradients[i] = sample_weight[i] * (2 *
(y_true[i] - raw_predictions[i] < 0) - 1)
hessians[i] = sample_weight[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this works because sample_weight is non-negative? If so, lets leave a comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this works because sample_weight is non-negative?

No, this works because:

  • accounting for SW means we need to multiply gradients and hessians by SW
  • without SW, the hessian of this loss is constant and is equal to 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the math. lightgbm does the same thing with the l1 loss.

https://github.com/microsoft/LightGBM/blob/4adb9ff71f41f6b5c7a51f667a8fb9adf38cf602/src/objective/regression_objective.hpp#L225-L229

When I see derivative of sign(x), I think of the dirac delta function: https://en.wikipedia.org/wiki/Sign_function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think I see what you did here: #13896 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, also, if you look at the sklearn wrapper in lightgbm, all hessians and gradients are simply multiplied by SW.

Comment on lines 412 to 413
- |Feature| Estimators now support :term:`sample_weight`. :pr:`14696` by
`Adrin Jalali`_ and `Nicolas Hug`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to 0.23

if sample_weight == 'ones':
sample_weight = np.ones(shape=n_samples, dtype=Y_DTYPE)
else:
sample_weight = rng.normal(size=n_samples).astype(Y_DTYPE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to include negative numbers here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't seem to be a place where we check for non-negative weights, and I'm not sure if the math requires the weights to be positive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ref, this is discussed in #15531

I think we can safely ignore the topic for now. (I'm reasonably confident that negative SW would work "as expected" here, though I've no idea what it would look like with binning)

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan changed the title Support sample weights in HGBT ENH Support sample weights in HGBT Feb 24, 2020
@thomasjpfan thomasjpfan merged commit e24998f into scikit-learn:master Feb 24, 2020
@adrinjalali adrinjalali deleted the hgbt/sample_weights branch February 24, 2020 21:53
@remykarem
Copy link

Hi, I'm experiencing an ImportError while building the documentation after this PR has been merged:

ImportError: cannot import name '_update_gradients_hessians_least_squares' from 'sklearn.ensemble._hist_gradient_boosting._loss' 

Here is the full log:

sphinx-build -D plot_gallery=0 -b html -T -d _build/doctrees  -j auto  . _build/html/stable
Running Sphinx v2.4.1

Traceback (most recent call last):
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 348, in eval_config_file
    execfile_(filename, namespace)
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/util/pycompat.py", line 81, in execfile_
    exec(code, _globals)
  File "/Users/raimibinkarim/Desktop/scikit-learn/doc/conf.py", line 324, in <module>
    from sklearn.experimental import enable_hist_gradient_boosting  # noqa
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/experimental/enable_hist_gradient_boosting.py", line 22, in <module>
    from ..ensemble._hist_gradient_boosting.gradient_boosting import (
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 23, in <module>
    from .loss import _LOSSES
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/loss.py", line 21, in <module>
    from ._loss import _update_gradients_hessians_least_squares
ImportError: cannot import name '_update_gradients_hessians_least_squares' from 'sklearn.ensemble._hist_gradient_boosting._loss' (/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/_loss.cpython-37m-darwin.so)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/cmd/build.py", line 275, in build_main
    args.tags, args.verbosity, args.jobs, args.keep_going)
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/application.py", line 219, in __init__
    self.config = Config.read(self.confdir, confoverrides or {}, self.tags)
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 193, in read
    namespace = eval_config_file(filename, tags)
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 358, in eval_config_file
    raise ConfigError(msg % traceback.format_exc())
sphinx.errors.ConfigError: There is a programmable error in your configuration file:

Traceback (most recent call last):
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 348, in eval_config_file
    execfile_(filename, namespace)
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/util/pycompat.py", line 81, in execfile_
    exec(code, _globals)
  File "/Users/raimibinkarim/Desktop/scikit-learn/doc/conf.py", line 324, in <module>
    from sklearn.experimental import enable_hist_gradient_boosting  # noqa
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/experimental/enable_hist_gradient_boosting.py", line 22, in <module>
    from ..ensemble._hist_gradient_boosting.gradient_boosting import (
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 23, in <module>
    from .loss import _LOSSES
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/loss.py", line 21, in <module>
    from ._loss import _update_gradients_hessians_least_squares
ImportError: cannot import name '_update_gradients_hessians_least_squares' from 'sklearn.ensemble._hist_gradient_boosting._loss' (/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/_loss.cpython-37m-darwin.so)


Configuration error:
There is a programmable error in your configuration file:

Traceback (most recent call last):
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 348, in eval_config_file
    execfile_(filename, namespace)
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/util/pycompat.py", line 81, in execfile_
    exec(code, _globals)
  File "/Users/raimibinkarim/Desktop/scikit-learn/doc/conf.py", line 324, in <module>
    from sklearn.experimental import enable_hist_gradient_boosting  # noqa
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/experimental/enable_hist_gradient_boosting.py", line 22, in <module>
    from ..ensemble._hist_gradient_boosting.gradient_boosting import (
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 23, in <module>
    from .loss import _LOSSES
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/loss.py", line 21, in <module>
    from ._loss import _update_gradients_hessians_least_squares
ImportError: cannot import name '_update_gradients_hessians_least_squares' from 'sklearn.ensemble._hist_gradient_boosting._loss' (/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/_loss.cpython-37m-darwin.so)

make: *** [html-noplot] Error 2

Is anyone experiencing the same problem or am I missing something?

@adrinjalali
Copy link
Member Author

It's probably an issue with different sklearn's in your path or something. If you start with a clean environment and install the nightly build, it should work I suppose @remykarem

@NicolasHug
Copy link
Member

It's probably just due to the fact that you haven't cythonized the new files. Before cleaning your env I'd suggest to just run make in or pip install -e . in the directory, that should fix it

@remykarem
Copy link

That works, thank you for the tip!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sample_weight support in HistGradientBoostingClassifier
7 participants