Skip to content

[MRG+2] Faster Gradient Boosting Decision Trees with binned features #12807

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 276 commits into from
Apr 26, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
276 commits
Select commit Hold shift + click to select a range
2f0c93a
Unified type imports in types
NicolasHug Jan 13, 2019
ca4d144
Tried parallelize prediction but doesn't work :(
NicolasHug Jan 13, 2019
67602e5
made tests use types in types.pyx instead of hardcoded types
NicolasHug Jan 13, 2019
498fe50
lgbm tests are passsing \o/
NicolasHug Jan 13, 2019
889d39f
Added binary classification support
NicolasHug Jan 13, 2019
722a982
Added multiclass classification support, all tests are passing \o/
NicolasHug Jan 14, 2019
1ea65e2
Parallelize predictions
NicolasHug Jan 14, 2019
e9c2509
removed get_threads_chunks
NicolasHug Jan 14, 2019
cf3f723
n_features param to test script
NicolasHug Jan 15, 2019
c6227cd
Specified array alignments in splitting and histogram
NicolasHug Jan 16, 2019
10520da
used const views where possible and used prange sum reduction
NicolasHug Jan 16, 2019
2a80af8
Directly pass sum_gradient and sum_hessians to find_node_split_subtra…
NicolasHug Jan 16, 2019
6fafd85
local variables to avoid python interactions
NicolasHug Jan 16, 2019
dac76a1
split_indices is now a method
NicolasHug Jan 16, 2019
3614a7e
find_node_split is now a method
NicolasHug Jan 16, 2019
8e8b927
find_node_split_subtraction is now a method
NicolasHug Jan 16, 2019
1fac60a
find_node_split_subtraction is now a method
NicolasHug Jan 16, 2019
f8500a2
Refactored SplittingContext into a proper Splitter
NicolasHug Jan 16, 2019
c4d00f0
lots of cosmetics
NicolasHug Jan 16, 2019
628ea61
fixed test segfault
NicolasHug Jan 17, 2019
5d8c21a
init file for tests
NicolasHug Jan 17, 2019
35343f2
renamed estimators
NicolasHug Jan 17, 2019
d0f73cd
made module private and estimators are available in ensemble
NicolasHug Jan 17, 2019
af23bec
pep8
NicolasHug Jan 17, 2019
2fd29e1
some comments
NicolasHug Jan 17, 2019
5a82534
checkpoint before changing scoring param
NicolasHug Jan 17, 2019
ae4640e
Fixed bug in update_raw_predictions
NicolasHug Jan 18, 2019
ec5128c
small optimization for root node splitting
NicolasHug Jan 18, 2019
565e936
numerically stable logsumexp
NicolasHug Jan 18, 2019
713d838
minimal splitter change
NicolasHug Jan 18, 2019
10affef
more sensible early stopping
NicolasHug Jan 18, 2019
1cd23f1
changed min_sammples_leaf default to 5
NicolasHug Jan 18, 2019
5060aee
pass feature_idx to histogram builders to avoid python interactions
NicolasHug Jan 18, 2019
1bfde2c
some doc and attribute exposition
NicolasHug Jan 18, 2019
65ac62a
remomved constant_hessian_value
NicolasHug Jan 18, 2019
27d32d6
removed f-strings
NicolasHug Jan 18, 2019
a86f0d2
Merge branch 'master' into gbm
NicolasHug Jan 18, 2019
59a7483
removed unused files
NicolasHug Jan 18, 2019
04a99c4
removed benchmark files
NicolasHug Jan 18, 2019
e4738ee
Added higgs boson benchmark and removed files
NicolasHug Jan 18, 2019
2341a04
Added another benchmark
NicolasHug Jan 18, 2019
29ffcdf
changed benchmark default learning rate
NicolasHug Jan 18, 2019
b4ba169
used custom expit function
NicolasHug Jan 18, 2019
e66fff2
doc
NicolasHug Jan 18, 2019
c75acca
Added decision_function
NicolasHug Jan 18, 2019
9ff4242
Using openmp flags from #11950
NicolasHug Jan 18, 2019
d782d02
scipy logsumexp import from misc if error
NicolasHug Jan 18, 2019
ea53299
pep8
NicolasHug Jan 18, 2019
c50f9e7
fix test_loss in 3.5
NicolasHug Jan 19, 2019
48abf28
truncate array before rank check in check_decision_proba_consistency …
NicolasHug Jan 20, 2019
f93e2a5
set random_state in second round of fit_idempotent
NicolasHug Jan 20, 2019
01098e3
revert bad changes
NicolasHug Jan 20, 2019
602802f
probing travis
NicolasHug Jan 20, 2019
a70b150
second
NicolasHug Jan 20, 2019
396b65c
...
NicolasHug Jan 20, 2019
0dbbcee
...
NicolasHug Jan 20, 2019
44c34cc
Merge branch 'master' into gbm
NicolasHug Jan 21, 2019
4614762
Revert travis probing changes
NicolasHug Jan 22, 2019
2181495
slightly change feature splitting routine
NicolasHug Jan 22, 2019
cb38816
removed unused attributes
NicolasHug Jan 22, 2019
8fc65f7
put back small optimization for small hessians
NicolasHug Jan 22, 2019
d703bf1
trying range instead of prange for summing gradients
NicolasHug Jan 22, 2019
afd48ac
cosmetics
NicolasHug Jan 22, 2019
2e5bf39
revert change in setup
NicolasHug Jan 22, 2019
ce5dff3
Added note in user guide
NicolasHug Jan 22, 2019
6e791ba
some docstrings
NicolasHug Jan 22, 2019
f543d61
convert prange argument to int instead of unsigned int to avoid cytho…
NicolasHug Jan 22, 2019
00aab5f
minor comments
NicolasHug Jan 22, 2019
ad94842
removed construction_speed
NicolasHug Jan 22, 2019
468ec14
removed throughput computation
NicolasHug Jan 22, 2019
a53de7b
lower decimal rounding for check
NicolasHug Jan 22, 2019
e06b988
set random seed in test
NicolasHug Jan 23, 2019
a92cbbd
typo
NicolasHug Jan 23, 2019
2ba66cb
Merge branch 'master' into gbm
NicolasHug Jan 23, 2019
783a399
Should fix check_fit_idempotent due to prange summing instability
NicolasHug Jan 23, 2019
e47b745
renamed start and stop into partition_start and partition_stop
NicolasHug Jan 25, 2019
39d8030
Parallelized root gradient and hessians sums
NicolasHug Jan 27, 2019
14c7d47
Used floats instead of doubles for gradients and hessians arrays
NicolasHug Jan 30, 2019
c34a054
Merge branch 'master' into gbm
NicolasHug Jan 30, 2019
92dfe9d
More explicit names for sums of gradients and hessians in SplitInfo
NicolasHug Jan 30, 2019
c338ba8
Merge branch 'master' into gbm
NicolasHug Feb 4, 2019
170a5e1
first round of comments
NicolasHug Feb 4, 2019
d16ecff
made raw_predictions a C-contiguous on the n_samples dimension
NicolasHug Feb 4, 2019
9e68984
optimized gradient update for multiclass loss
NicolasHug Feb 4, 2019
96d9ea6
used 2d arrays instead of 1d for gradients and hessians
NicolasHug Feb 4, 2019
cbd9d15
added comment about tests that should be removed
NicolasHug Feb 4, 2019
e160d55
Addressed Joels comments
NicolasHug Feb 5, 2019
3cb197e
slightly more detailed doc about when not to use new estimators
NicolasHug Feb 5, 2019
d653a54
p is now a 2d numpy array instead of malloc'ed buffer
NicolasHug Feb 5, 2019
c3e4340
typo
NicolasHug Feb 5, 2019
b1784b0
used memcpy in splitter instead of loop
NicolasHug Feb 5, 2019
2004615
removed unused n_bins parameter to histogram routines
NicolasHug Feb 6, 2019
b071efd
Merge branch 'master' into gbm
NicolasHug Feb 10, 2019
483a744
Apply suggestions from code review
thomasjpfan Feb 14, 2019
c5ccae7
removed useless nogil in function def
NicolasHug Feb 14, 2019
23f1d4f
Used timeit.default_timer instead of time.time
NicolasHug Feb 14, 2019
f935761
reverted unwanted change
NicolasHug Feb 14, 2019
9ec5a49
made n_trees_per_iter and baseline_pred private attributes
NicolasHug Feb 14, 2019
1364f43
Slightly changed logisitic loss gradient computation
NicolasHug Feb 17, 2019
e512799
Addressed comments
NicolasHug Feb 17, 2019
6266c6d
Removed unused import in loss.pyx
NicolasHug Feb 17, 2019
e818f00
use check_early_stopping insteaf of get_scores
NicolasHug Feb 18, 2019
2d76ad3
Added XGBoost and CatBoost estimators in benchmarks
NicolasHug Feb 18, 2019
9717834
Should fix tests
NicolasHug Feb 19, 2019
a83225e
used lightgbm xgboost catboost full names
NicolasHug Feb 20, 2019
b9a151a
Addressed Adrin's comments:
NicolasHug Feb 26, 2019
0c09736
Merge branch 'master' into gbm
NicolasHug Feb 26, 2019
b7cf145
better use of _in_fit attribute
NicolasHug Feb 27, 2019
82f4ce1
changed use of estimators for predictors and iterations
NicolasHug Feb 27, 2019
5d53e5b
BinMapper now private
NicolasHug Feb 27, 2019
d79d636
renamed estimators from Fastblahblah to Histblahblah
NicolasHug Feb 27, 2019
0204a5d
Created experimental module
NicolasHug Feb 27, 2019
8045eb9
add subpackage
NicolasHug Feb 27, 2019
b3d32ba
hmmm
NicolasHug Feb 27, 2019
de051a9
added experimental in sklearn.__init__.__all__
NicolasHug Feb 27, 2019
431920d
added empty test folder
NicolasHug Feb 27, 2019
404f3ae
test
NicolasHug Feb 27, 2019
fb86030
Biggish refactoring of splitting:
NicolasHug Feb 28, 2019
69f6c4b
typo
NicolasHug Feb 28, 2019
6f5c93f
histogram are returned, not passed as OUT variables
NicolasHug Feb 28, 2019
9262d45
Merge branch 'master' into gbm
NicolasHug Feb 28, 2019
796183f
renaming and comments
NicolasHug Mar 1, 2019
f04f4d8
use regular class instead of cdef class for SplitInfo
NicolasHug Mar 1, 2019
7fcf760
Created HistogramBuilder class
NicolasHug Mar 1, 2019
8de4e4f
Added compute_hist_time for verbose output
NicolasHug Mar 1, 2019
c76dcd4
some cleaning
NicolasHug Mar 1, 2019
ee96ac3
Fixed constant hessian issue
NicolasHug Mar 1, 2019
c08ca89
Update sklearn/_fast_gradient_boosting/_binning.pyx
adrinjalali Mar 15, 2019
bc0d805
Removed wrapper functions in loss updates
NicolasHug Mar 15, 2019
fcfbf64
Addressed comments from Adrin
NicolasHug Mar 15, 2019
c4f3985
Merge branch 'master' into gbm
NicolasHug Mar 15, 2019
2d2c081
removed __all__ from _fast.../__init__.py
NicolasHug Mar 15, 2019
2af2504
optional ( instead of optional(
NicolasHug Mar 16, 2019
cec180e
moved _fast_.. into sklearn/ensemble/ and renamed *fast* into *hist*
NicolasHug Mar 16, 2019
f79763e
typo
NicolasHug Mar 16, 2019
930c4d6
removed unnecessary estimator check change?
NicolasHug Mar 16, 2019
8df021e
windows fix?
NicolasHug Mar 16, 2019
d6df35f
Addressing comments
NicolasHug Mar 25, 2019
e8d3554
more addressing
NicolasHug Mar 25, 2019
33e8374
added notes about unwrapping
NicolasHug Mar 25, 2019
fa38f02
renamed n_bins_per_feature to actual_n_bins
NicolasHug Mar 25, 2019
c3702a5
Merge remote-tracking branch 'upstream/master' into gbm
NicolasHug Mar 25, 2019
27f6481
more pythonic empty list checking
NicolasHug Mar 31, 2019
357a283
Merge remote-tracking branch 'upstream/master' into gbm
NicolasHug Apr 1, 2019
e4d67f7
Benchmark now using AUC from predict_proba
NicolasHug Apr 4, 2019
3f94a32
lgbm -> lightgbm, xgb -> xgboost, etc.
NicolasHug Apr 4, 2019
a4d5c9b
Apply suggestions from code review
glemaitre Apr 4, 2019
da1174c
Addressed comments
NicolasHug Apr 4, 2019
fc1a399
Merge branch 'gbm' of github.com:NicolasHug/scikit-learn into gbm
NicolasHug Apr 4, 2019
e2319be
Flake8
NicolasHug Apr 4, 2019
1fc79af
subsampling without replacement
NicolasHug Apr 4, 2019
86a8496
Apply suggestions from code review
glemaitre Apr 5, 2019
3c5f922
Addressed comments
NicolasHug Apr 5, 2019
491e14c
Make sure score time runs on n_samples
ogrisel Apr 5, 2019
04f0e86
Small improvement to benchmark script
ogrisel Apr 5, 2019
bdfacb1
scipy/scipy#9608 seems to be fixed in 1.2.1
ogrisel Apr 5, 2019
2644cb3
Better coverage and error message for binary_crossentropy on multicla…
ogrisel Apr 5, 2019
2416cb7
Cosmetic
ogrisel Apr 5, 2019
b0ba1d6
Cosmetic
ogrisel Apr 5, 2019
ae0d101
Make the least squares loss slightly less surprising
ogrisel Apr 5, 2019
f53a2e0
Merge branch 'master' of github.com:scikit-learn/scikit-learn into gbm
NicolasHug Apr 5, 2019
beb0e31
Merge branch 'gbm' of github.com:NicolasHug/scikit-learn into gbm
NicolasHug Apr 5, 2019
9c3c450
update Note text
NicolasHug Apr 5, 2019
5b40ffd
print loss instead of neg loss
NicolasHug Apr 5, 2019
47a72da
n_trees_per_iteration_ is now a public attribute
NicolasHug Apr 5, 2019
01ec7d6
Optimized early stopping when computed on the loss
NicolasHug Apr 8, 2019
7813e96
forgot to ammend changes
NicolasHug Apr 8, 2019
f1c1c3d
Apply suggestions from code review
glemaitre Apr 11, 2019
c563977
Addressed Guillaume's comments
NicolasHug Apr 11, 2019
6d1b606
Apply suggestions from code review
glemaitre Apr 11, 2019
b33ebad
Addressed comments
NicolasHug Apr 11, 2019
946823f
Apply suggestions from code review
glemaitre Apr 11, 2019
1934c56
Added shape for samples_indices
NicolasHug Apr 11, 2019
cf2d832
Merge branch 'master' of github.com:scikit-learn/scikit-learn into gbm
NicolasHug Apr 12, 2019
9a93692
Merge branch 'master' of github.com:scikit-learn/scikit-learn into gbm
NicolasHug Apr 13, 2019
5ab23cd
Merge branch 'gbm' of github.com:NicolasHug/scikit-learn into gbm
NicolasHug Apr 15, 2019
81a51c9
Update sklearn/ensemble/_hist_gradient_boosting/grower.py
glemaitre Apr 15, 2019
2e86b3c
Update sklearn/ensemble/_hist_gradient_boosting/splitting.pyx
glemaitre Apr 15, 2019
ccde666
Update sklearn/ensemble/_hist_gradient_boosting/splitting.pyx
glemaitre Apr 15, 2019
726f0e6
Merge branch 'gbm' of github.com:NicolasHug/scikit-learn into gbm
NicolasHug Apr 15, 2019
903b522
Added explicit scheduling and chunksizes for prange
NicolasHug Apr 15, 2019
a120db2
assert baseline_prediction has the same dtype has y_train
NicolasHug Apr 15, 2019
4c4a05a
removed default values for SplitInfo
NicolasHug Apr 15, 2019
8b70c5d
removed check_estimators
NicolasHug Apr 15, 2019
82428f0
Apply suggestions from code review
glemaitre Apr 15, 2019
063fdea
Merge branch 'gbm' of github.com:NicolasHug/scikit-learn into gbm
NicolasHug Apr 15, 2019
bd72a4b
pep8
NicolasHug Apr 15, 2019
2e24b71
minor docstring
NicolasHug Apr 15, 2019
a7766fa
removed explicit type conversion and copy=False not supported in all …
NicolasHug Apr 15, 2019
1536120
Update sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py
glemaitre Apr 15, 2019
7e4a88b
changed min_samples_leaf default back to 20, and updated set_checking…
NicolasHug Apr 15, 2019
b4ce890
added check for bin mapper for wrong n_features at transform
NicolasHug Apr 15, 2019
c272fd0
Adjusted early stopping tests now taht min_samples_leaf default has c…
NicolasHug Apr 15, 2019
22ce4fa
changed confusing should_stop test
NicolasHug Apr 15, 2019
11f5573
fixed again should_stop test
NicolasHug Apr 16, 2019
0bb5a9f
Addressed comments
NicolasHug Apr 16, 2019
2c461d6
forces max_depth and max_leaf_nodes >= 2 and added max_depth test
NicolasHug Apr 16, 2019
45a1d05
addressed comments
NicolasHug Apr 16, 2019
dcce26b
Addressed comments
NicolasHug Apr 16, 2019
f4ac929
use from sklearn.experimental import enable_hist_gradient_boosting
NicolasHug Apr 17, 2019
062ec75
noqa for whole file test_enable_hist_gradient_boosting.py
NicolasHug Apr 17, 2019
72d48b9
flake8
NicolasHug Apr 17, 2019
6d70978
Merge remote-tracking branch 'upstream/master' into gbm
NicolasHug Apr 17, 2019
505d409
protected omp_get_max_threads()
NicolasHug Apr 17, 2019
b8b73e6
trying without module deletion hack
NicolasHug Apr 17, 2019
acfcce5
deleted test_enable file: impossible to do properly
NicolasHug Apr 17, 2019
fab5cc2
Merge branch 'master' of github.com:scikit-learn/scikit-learn into gbm
NicolasHug Apr 18, 2019
8b1f603
test enable_experimental with assert_run_python_script from cloud_pickle
NicolasHug Apr 18, 2019
ea14a84
Addressed comments
NicolasHug Apr 18, 2019
69f127c
put back min_samples_leaf=5 for checks of HistGradientBoostingRegressor
NicolasHug Apr 18, 2019
83bc17a
removed one line so that the PR is 5555 lines
NicolasHug Apr 18, 2019
b79876b
Merge branch 'master' of github.com:scikit-learn/scikit-learn into gbm
NicolasHug Apr 19, 2019
c4b22bf
Moved utility into utils.testing and updated docstring
NicolasHug Apr 19, 2019
6553f72
pep8
NicolasHug Apr 19, 2019
4cb5da4
added comment for min_samples_leaf
NicolasHug Apr 19, 2019
a8a4ce0
doc
NicolasHug Apr 19, 2019
6109620
docstring params
NicolasHug Apr 20, 2019
3ef0212
no idea whats going on?
NicolasHug Apr 20, 2019
d493fe4
remove coverage?
NicolasHug Apr 20, 2019
1da9941
put back helper in experimental/test_ :/
NicolasHug Apr 20, 2019
5623288
hmm
NicolasHug Apr 20, 2019
442593a
changed cwd and env
NicolasHug Apr 20, 2019
172abaa
Merge remote-tracking branch 'upstream/master' into gbm
NicolasHug Apr 22, 2019
4755ba7
specify --cov-file
NicolasHug Apr 22, 2019
058ae94
rcfile instead of -cov-file
NicolasHug Apr 22, 2019
5cbabf8
noideawatimdoing
NicolasHug Apr 22, 2019
42dda67
revert
NicolasHug Apr 22, 2019
6c9f03e
Trying with parallel = True in coveragerc
NicolasHug Apr 24, 2019
6d62f92
Merge remote-tracking branch 'upstream/master' into gbm
NicolasHug Apr 24, 2019
cc980a7
using --cov-config??
NicolasHug Apr 24, 2019
6685244
Small improvements to coverage config
ogrisel Apr 24, 2019
49ca471
removed include to avoid warning
NicolasHug Apr 24, 2019
f0d477c
Merge branch 'gbm' of github.com:NicolasHug/scikit-learn into gbm
NicolasHug Apr 24, 2019
37e2742
Merge branch 'master' of github.com:scikit-learn/scikit-learn into gbm
NicolasHug Apr 24, 2019
8bffe2c
put back parallel = True
NicolasHug Apr 24, 2019
e1deb05
trying to pass --rcfile to coverage
NicolasHug Apr 24, 2019
dfbea1d
magic
NicolasHug Apr 24, 2019
94b814c
revert magic
NicolasHug Apr 24, 2019
66d1376
magic again
NicolasHug Apr 25, 2019
7bc7f6e
Update test_pytest_soft_dependency.sh
glemaitre Apr 25, 2019
6f6fa51
Trigger CI??
NicolasHug Apr 25, 2019
ac40e4d
Merge remote-tracking branch 'upstream/master' into gbm
NicolasHug Apr 26, 2019
55152cd
Merge remote-tracking branch 'upstream/master' into gbm
NicolasHug Apr 26, 2019
962c5e4
MAINT coverage config for test_pytest_soft_dependency.sh
ogrisel Apr 26, 2019
10cb5be
Try to omit any setup.py file from the coverage report
ogrisel Apr 26, 2019
8adb9f0
TEST_DIR is not a subfolder of BUILD_SOURCESDIRECTORY
ogrisel Apr 26, 2019
406cec1
One more try
ogrisel Apr 26, 2019
9d8269a
coverage combine in TEST_DIR
ogrisel Apr 26, 2019
d63d9db
remove useless pass
ogrisel Apr 26, 2019
280c487
omit */setup.py
ogrisel Apr 26, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .coveragerc
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[run]
branch = True
source = sklearn
include = */sklearn/*
parallel = True
omit =
*/sklearn/externals/*
*/benchmarks/*
Expand Down
241 changes: 241 additions & 0 deletions benchmarks/bench_hist_gradient_boosting.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
from time import time
import argparse

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# To use this experimental feature, we need to explicitly ask for it:
from sklearn.experimental import enable_hist_gradient_boosting # noqa
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.datasets import make_regression
from sklearn.ensemble._hist_gradient_boosting.utils import (
get_equivalent_estimator)


parser = argparse.ArgumentParser()
parser.add_argument('--n-leaf-nodes', type=int, default=31)
parser.add_argument('--n-trees', type=int, default=10)
parser.add_argument('--lightgbm', action="store_true", default=False,
help='also plot lightgbm')
parser.add_argument('--xgboost', action="store_true", default=False,
help='also plot xgboost')
parser.add_argument('--catboost', action="store_true", default=False,
help='also plot catboost')
parser.add_argument('--learning-rate', type=float, default=.1)
parser.add_argument('--problem', type=str, default='classification',
choices=['classification', 'regression'])
parser.add_argument('--n-classes', type=int, default=2)
parser.add_argument('--n-samples-max', type=int, default=int(1e6))
parser.add_argument('--n-features', type=int, default=20)
parser.add_argument('--max-bins', type=int, default=255)
args = parser.parse_args()

n_leaf_nodes = args.n_leaf_nodes
n_trees = args.n_trees
lr = args.learning_rate
max_bins = args.max_bins


def get_estimator_and_data():
if args.problem == 'classification':
X, y = make_classification(args.n_samples_max * 2,
n_features=args.n_features,
n_classes=args.n_classes,
n_clusters_per_class=1,
random_state=0)
return X, y, HistGradientBoostingClassifier
elif args.problem == 'regression':
X, y = make_regression(args.n_samples_max * 2,
n_features=args.n_features, random_state=0)
return X, y, HistGradientBoostingRegressor


X, y, Estimator = get_estimator_and_data()
X_train_, X_test_, y_train_, y_test_ = train_test_split(
X, y, test_size=0.5, random_state=0)


def one_run(n_samples):
X_train = X_train_[:n_samples]
X_test = X_test_[:n_samples]
y_train = y_train_[:n_samples]
y_test = y_test_[:n_samples]
assert X_train.shape[0] == n_samples
assert X_test.shape[0] == n_samples
print("Data size: %d samples train, %d samples test."
% (n_samples, n_samples))
print("Fitting a sklearn model...")
tic = time()
est = Estimator(learning_rate=lr,
max_iter=n_trees,
max_bins=max_bins,
max_leaf_nodes=n_leaf_nodes,
n_iter_no_change=None,
random_state=0,
verbose=0)
est.fit(X_train, y_train)
sklearn_fit_duration = time() - tic
tic = time()
sklearn_score = est.score(X_test, y_test)
sklearn_score_duration = time() - tic
print("score: {:.4f}".format(sklearn_score))
print("fit duration: {:.3f}s,".format(sklearn_fit_duration))
print("score duration: {:.3f}s,".format(sklearn_score_duration))

lightgbm_score = None
lightgbm_fit_duration = None
lightgbm_score_duration = None
if args.lightgbm:
print("Fitting a LightGBM model...")
# get_lightgbm does not accept loss='auto'
if args.problem == 'classification':
loss = 'binary_crossentropy' if args.n_classes == 2 else \
'categorical_crossentropy'
est.set_params(loss=loss)
lightgbm_est = get_equivalent_estimator(est, lib='lightgbm')

tic = time()
lightgbm_est.fit(X_train, y_train)
lightgbm_fit_duration = time() - tic
tic = time()
lightgbm_score = lightgbm_est.score(X_test, y_test)
lightgbm_score_duration = time() - tic
print("score: {:.4f}".format(lightgbm_score))
print("fit duration: {:.3f}s,".format(lightgbm_fit_duration))
print("score duration: {:.3f}s,".format(lightgbm_score_duration))

xgb_score = None
xgb_fit_duration = None
xgb_score_duration = None
if args.xgboost:
print("Fitting an XGBoost model...")
# get_xgb does not accept loss='auto'
if args.problem == 'classification':
loss = 'binary_crossentropy' if args.n_classes == 2 else \
'categorical_crossentropy'
est.set_params(loss=loss)
xgb_est = get_equivalent_estimator(est, lib='xgboost')

tic = time()
xgb_est.fit(X_train, y_train)
xgb_fit_duration = time() - tic
tic = time()
xgb_score = xgb_est.score(X_test, y_test)
xgb_score_duration = time() - tic
print("score: {:.4f}".format(xgb_score))
print("fit duration: {:.3f}s,".format(xgb_fit_duration))
print("score duration: {:.3f}s,".format(xgb_score_duration))

cat_score = None
cat_fit_duration = None
cat_score_duration = None
if args.catboost:
print("Fitting a CatBoost model...")
# get_cat does not accept loss='auto'
if args.problem == 'classification':
loss = 'binary_crossentropy' if args.n_classes == 2 else \
'categorical_crossentropy'
est.set_params(loss=loss)
cat_est = get_equivalent_estimator(est, lib='catboost')

tic = time()
cat_est.fit(X_train, y_train)
cat_fit_duration = time() - tic
tic = time()
cat_score = cat_est.score(X_test, y_test)
cat_score_duration = time() - tic
print("score: {:.4f}".format(cat_score))
print("fit duration: {:.3f}s,".format(cat_fit_duration))
print("score duration: {:.3f}s,".format(cat_score_duration))

return (sklearn_score, sklearn_fit_duration, sklearn_score_duration,
lightgbm_score, lightgbm_fit_duration, lightgbm_score_duration,
xgb_score, xgb_fit_duration, xgb_score_duration,
cat_score, cat_fit_duration, cat_score_duration)


n_samples_list = [1000, 10000, 100000, 500000, 1000000, 5000000, 10000000]
n_samples_list = [n_samples for n_samples in n_samples_list
if n_samples <= args.n_samples_max]

sklearn_scores = []
sklearn_fit_durations = []
sklearn_score_durations = []
lightgbm_scores = []
lightgbm_fit_durations = []
lightgbm_score_durations = []
xgb_scores = []
xgb_fit_durations = []
xgb_score_durations = []
cat_scores = []
cat_fit_durations = []
cat_score_durations = []

for n_samples in n_samples_list:
(sklearn_score,
sklearn_fit_duration,
sklearn_score_duration,
lightgbm_score,
lightgbm_fit_duration,
lightgbm_score_duration,
xgb_score,
xgb_fit_duration,
xgb_score_duration,
cat_score,
cat_fit_duration,
cat_score_duration) = one_run(n_samples)

for scores, score in (
(sklearn_scores, sklearn_score),
(sklearn_fit_durations, sklearn_fit_duration),
(sklearn_score_durations, sklearn_score_duration),
(lightgbm_scores, lightgbm_score),
(lightgbm_fit_durations, lightgbm_fit_duration),
(lightgbm_score_durations, lightgbm_score_duration),
(xgb_scores, xgb_score),
(xgb_fit_durations, xgb_fit_duration),
(xgb_score_durations, xgb_score_duration),
(cat_scores, cat_score),
(cat_fit_durations, cat_fit_duration),
(cat_score_durations, cat_score_duration)):
scores.append(score)

fig, axs = plt.subplots(3, sharex=True)

axs[0].plot(n_samples_list, sklearn_scores, label='sklearn')
axs[1].plot(n_samples_list, sklearn_fit_durations, label='sklearn')
axs[2].plot(n_samples_list, sklearn_score_durations, label='sklearn')

if args.lightgbm:
axs[0].plot(n_samples_list, lightgbm_scores, label='lightgbm')
axs[1].plot(n_samples_list, lightgbm_fit_durations, label='lightgbm')
axs[2].plot(n_samples_list, lightgbm_score_durations, label='lightgbm')

if args.xgboost:
axs[0].plot(n_samples_list, xgb_scores, label='XGBoost')
axs[1].plot(n_samples_list, xgb_fit_durations, label='XGBoost')
axs[2].plot(n_samples_list, xgb_score_durations, label='XGBoost')

if args.catboost:
axs[0].plot(n_samples_list, cat_scores, label='CatBoost')
axs[1].plot(n_samples_list, cat_fit_durations, label='CatBoost')
axs[2].plot(n_samples_list, cat_score_durations, label='CatBoost')

for ax in axs:
ax.set_xscale('log')
ax.legend(loc='best')
ax.set_xlabel('n_samples')

axs[0].set_title('scores')
axs[1].set_title('fit duration (s)')
axs[2].set_title('score duration (s)')

title = args.problem
if args.problem == 'classification':
title += ' n_classes = {}'.format(args.n_classes)
fig.suptitle(title)


plt.tight_layout()
plt.show()
123 changes: 123 additions & 0 deletions benchmarks/bench_hist_gradient_boosting_higgsboson.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
from urllib.request import urlretrieve
import os
from gzip import GzipFile
from time import time
import argparse

import numpy as np
import pandas as pd
from joblib import Memory
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
# To use this experimental feature, we need to explicitly ask for it:
from sklearn.experimental import enable_hist_gradient_boosting # noqa
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble._hist_gradient_boosting.utils import (
get_equivalent_estimator)


parser = argparse.ArgumentParser()
parser.add_argument('--n-leaf-nodes', type=int, default=31)
parser.add_argument('--n-trees', type=int, default=10)
parser.add_argument('--lightgbm', action="store_true", default=False)
parser.add_argument('--xgboost', action="store_true", default=False)
parser.add_argument('--catboost', action="store_true", default=False)
parser.add_argument('--learning-rate', type=float, default=1.)
parser.add_argument('--subsample', type=int, default=None)
parser.add_argument('--max-bins', type=int, default=255)
args = parser.parse_args()

HERE = os.path.dirname(__file__)
URL = ("https://archive.ics.uci.edu/ml/machine-learning-databases/00280/"
"HIGGS.csv.gz")
m = Memory(location='/tmp', mmap_mode='r')

n_leaf_nodes = args.n_leaf_nodes
n_trees = args.n_trees
subsample = args.subsample
lr = args.learning_rate
max_bins = args.max_bins


@m.cache
def load_data():
filename = os.path.join(HERE, URL.rsplit('/', 1)[-1])
if not os.path.exists(filename):
print(f"Downloading {URL} to {filename} (2.6 GB)...")
urlretrieve(URL, filename)
print("done.")

print(f"Parsing {filename}...")
tic = time()
with GzipFile(filename) as f:
df = pd.read_csv(f, header=None, dtype=np.float32)
toc = time()
print(f"Loaded {df.values.nbytes / 1e9:0.3f} GB in {toc - tic:0.3f}s")
return df


df = load_data()
target = df.values[:, 0]
data = np.ascontiguousarray(df.values[:, 1:])
data_train, data_test, target_train, target_test = train_test_split(
data, target, test_size=.2, random_state=0)

if subsample is not None:
data_train, target_train = data_train[:subsample], target_train[:subsample]

n_samples, n_features = data_train.shape
print(f"Training set with {n_samples} records with {n_features} features.")

print("Fitting a sklearn model...")
tic = time()
est = HistGradientBoostingClassifier(loss='binary_crossentropy',
learning_rate=lr,
max_iter=n_trees,
max_bins=max_bins,
max_leaf_nodes=n_leaf_nodes,
n_iter_no_change=None,
random_state=0,
verbose=1)
est.fit(data_train, target_train)
toc = time()
predicted_test = est.predict(data_test)
predicted_proba_test = est.predict_proba(data_test)
roc_auc = roc_auc_score(target_test, predicted_proba_test[:, 1])
acc = accuracy_score(target_test, predicted_test)
print(f"done in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}, ACC: {acc :.4f}")

if args.lightgbm:
print("Fitting a LightGBM model...")
tic = time()
lightgbm_est = get_equivalent_estimator(est, lib='lightgbm')
lightgbm_est.fit(data_train, target_train)
toc = time()
predicted_test = lightgbm_est.predict(data_test)
predicted_proba_test = lightgbm_est.predict_proba(data_test)
roc_auc = roc_auc_score(target_test, predicted_proba_test[:, 1])
acc = accuracy_score(target_test, predicted_test)
print(f"done in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}, ACC: {acc :.4f}")

if args.xgboost:
print("Fitting an XGBoost model...")
tic = time()
xgboost_est = get_equivalent_estimator(est, lib='xgboost')
xgboost_est.fit(data_train, target_train)
toc = time()
predicted_test = xgboost_est.predict(data_test)
predicted_proba_test = xgboost_est.predict_proba(data_test)
roc_auc = roc_auc_score(target_test, predicted_proba_test[:, 1])
acc = accuracy_score(target_test, predicted_test)
print(f"done in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}, ACC: {acc :.4f}")

if args.catboost:
print("Fitting a Catboost model...")
tic = time()
catboost_est = get_equivalent_estimator(est, lib='catboost')
catboost_est.fit(data_train, target_train)
toc = time()
predicted_test = catboost_est.predict(data_test)
predicted_proba_test = catboost_est.predict_proba(data_test)
roc_auc = roc_auc_score(target_test, predicted_proba_test[:, 1])
acc = accuracy_score(target_test, predicted_test)
print(f"done in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}, ACC: {acc :.4f}")
6 changes: 4 additions & 2 deletions build_tools/azure/test_pytest_soft_dependency.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,10 @@ conda remove -y py pytest || pip uninstall -y py pytest

if [[ "$COVERAGE" == "true" ]]; then
# Need to append the coverage to the existing .coverage generated by
# running the tests
CMD="coverage run --append"
# running the tests. Make sure to reuse the same coverage
# configuration as the one used by the main pytest run to be
# able to combine the results.
CMD="coverage run --rcfile=$BUILD_SOURCESDIRECTORY/.coveragerc"
else
CMD="python"
fi
Expand Down
Loading