Skip to content

[WIP] Alternative architecture for regression tree #8458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 64 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
ab30b43
implement new criterion and splitter
Feb 7, 2017
001fd30
Making it all work together
raghavrv Feb 8, 2017
49d9bf3
Expose _add_node to python level
raghavrv Feb 8, 2017
bf7a850
Debuging fixes
raghavrv Feb 8, 2017
95b8343
Update init
raghavrv Feb 8, 2017
3ad0349
fix issue propagating stats to children node
Feb 8, 2017
7d7f25f
Finish to refactor before debugging
Feb 9, 2017
8e028a2
remove temporary property and correct init
Feb 9, 2017
013abdb
fixed the impurity improvement bug and iadd
Feb 9, 2017
7b4703f
Do not check again the samples in leaf
Feb 10, 2017
05e2db7
Clarify the deepcopy
Feb 10, 2017
2ccb28f
fix bug samples mapping
Feb 10, 2017
7269591
add the possibility to select shuffled features
Feb 10, 2017
293d953
remove useless comment
Feb 10, 2017
9f18d6b
fix bug update X_nid
Feb 12, 2017
d4cc0d0
clean the printing
Feb 12, 2017
48405c5
Use FEATURE_THRESHOLD to detect constant features
ogrisel Feb 13, 2017
02c012a
Merge pull request #1 from ogrisel/newtree-fixes
glemaitre Feb 13, 2017
71e5468
Move the tree in a new module for testing purposes
Feb 16, 2017
e3a3efb
fitting running
Feb 16, 2017
015528a
EXA tree
Feb 22, 2017
4fea914
Make it python3 compatible
raghavrv Feb 22, 2017
b090c98
Update the node value for all the parent nodes
raghavrv Feb 22, 2017
040771f
FIX solve issue with the prediction
Feb 22, 2017
cb3f6bc
TST test stats node class
Feb 22, 2017
0ddf44a
FIX rename residuals to y
Feb 22, 2017
3a9d940
reverse __init__ in tree module
Feb 25, 2017
3dc03c3
TST testing split_record
Feb 26, 2017
1747fd7
FIX remove backup files
raghavrv Feb 26, 2017
aed1be9
TST criterion and split_record tests
Feb 26, 2017
6206f6e
TST test fo splitter
Feb 27, 2017
0125a23
PEP8
Feb 27, 2017
4ef497b
TST/FIX solve issue for prediction of the tree - identical to Decisio…
Feb 27, 2017
f8d2d52
ENH Use copy_to instead of deepcopy (speedup +50%)
raghavrv Mar 1, 2017
6e625ac
FIX remvove useless test
Mar 7, 2017
9abbafd
FIX use sklearn.utils.random.choice instead of numpy
Mar 7, 2017
be5724b
FIX take into account min_samples_lead and min_samples_split
Mar 8, 2017
8f0da72
FIX resize tree nodes after fitting
Mar 8, 2017
53d7bd8
FIX solve bug inequality of max_samples_split ... _leaf
Mar 8, 2017
ad28f9f
FIX avoid computation for constant feature
Mar 10, 2017
4228377
FIX/TST correct depth count
Mar 10, 2017
e195ce9
PEP8/FIX solve bug if no feature visited
Mar 10, 2017
1b4b9ac
FIX/TST ensure right behaviour of constant feature
Mar 10, 2017
6535dbe
cythonizing subclasses
glemaitre Sep 9, 2017
f28a244
FIX fix diverse bug before further cythoniazation
glemaitre Sep 11, 2017
adf4787
make cdef function instead of cpdef
glemaitre Sep 12, 2017
aefbe3f
stats_node C struct
glemaitre Sep 13, 2017
c897b53
Create structure instead of cython class
glemaitre Sep 14, 2017
6ea1c61
Cyhtonize every component
glemaitre Sep 22, 2017
fb16e65
inline some function called in loop
glemaitre Sep 22, 2017
36b1d7b
compute proxy in one function
glemaitre Sep 23, 2017
dae3057
Working for the moment
glemaitre Sep 24, 2017
7f12bdf
Improve the reassignment
glemaitre Sep 24, 2017
9637b70
Refactor code
glemaitre Sep 24, 2017
54662b0
Remove bug when max_depth is not None
glemaitre Sep 25, 2017
5db33c8
make it float32
glemaitre Sep 28, 2017
dd3a0c9
add missing folder
glemaitre Sep 28, 2017
e4b4c07
Merge remote-tracking branch 'origin/master' into refactor_tree
glemaitre Sep 28, 2017
f30f14a
Modify Gradient Boosting for benchmark
glemaitre Sep 28, 2017
2f53f03
iter
glemaitre Sep 28, 2017
560787b
iter
glemaitre Sep 28, 2017
a857b92
change gradient tree
glemaitre Sep 28, 2017
c9ab96e
Merge remote-tracking branch 'origin' into refactor_tree
glemaitre Nov 17, 2017
9cae475
iter
glemaitre Nov 20, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion sklearn/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,8 @@ def config_context(**new_config):
'mixture', 'model_selection', 'multiclass', 'multioutput',
'naive_bayes', 'neighbors', 'neural_network', 'pipeline',
'preprocessing', 'random_projection', 'semi_supervised',
'svm', 'tree', 'discriminant_analysis',
'svm', 'tree', 'regression_tree', 'discriminant_analysis',
'exact_tree',
# Non-modules:
'clone']

Expand Down
2 changes: 1 addition & 1 deletion sklearn/_build_utils/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,4 +81,4 @@ def maybe_cythonize_extensions(top_path, config):
exc.args += (message,)
raise

config.ext_modules = cythonize(config.ext_modules)
config.ext_modules = cythonize(config.ext_modules, annotate=True)
6 changes: 3 additions & 3 deletions sklearn/ensemble/_gradient_boosting.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,12 @@ np.import_array()
from scipy.sparse import issparse
from scipy.sparse import csr_matrix

from sklearn.tree._tree cimport Node
from sklearn.tree._tree cimport Tree
from sklearn.exact_tree._tree cimport Node
from sklearn.exact_tree._tree cimport Tree
from sklearn.tree._tree cimport DTYPE_t
from sklearn.tree._tree cimport SIZE_t
from sklearn.tree._tree cimport INT32_t
from sklearn.tree._utils cimport safe_realloc
from sklearn.exact_tree._utils cimport safe_realloc

ctypedef np.int32_t int32
ctypedef np.float64_t float64
Expand Down
59 changes: 39 additions & 20 deletions sklearn/ensemble/gradient_boosting.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
from .base import BaseEnsemble
from ..base import ClassifierMixin
from ..base import RegressorMixin
from ..base import clone
from ..externals import six

from ._gradient_boosting import predict_stages
Expand All @@ -47,8 +48,10 @@
from time import time
from ..model_selection import train_test_split
from ..tree.tree import DecisionTreeRegressor
from ..tree._tree import DTYPE
from ..tree._tree import TREE_LEAF
# from ..tree._tree import DTYPE
# from ..tree._tree import TREE_LEAF
DTYPE = np.float32
TREE_LEAF = -1

from ..utils import check_random_state
from ..utils import check_array
Expand Down Expand Up @@ -726,7 +729,7 @@ def __init__(self, loss, learning_rate, n_estimators, criterion,
random_state, alpha=0.9, verbose=0, max_leaf_nodes=None,
warm_start=False, presort='auto',
validation_fraction=0.1, n_iter_no_change=None,
tol=1e-4):
tol=1e-4, estimator=None):

self.n_estimators = n_estimators
self.learning_rate = learning_rate
Expand All @@ -750,6 +753,7 @@ def __init__(self, loss, learning_rate, n_estimators, criterion,
self.validation_fraction = validation_fraction
self.n_iter_no_change = n_iter_no_change
self.tol = tol
self.estimator = estimator

def _fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask,
random_state, X_idx_sorted, X_csc=None, X_csr=None):
Expand All @@ -767,19 +771,34 @@ def _fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask,
sample_weight=sample_weight)

# induce regression tree on residuals
tree = DecisionTreeRegressor(
criterion=self.criterion,
splitter='best',
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
min_samples_leaf=self.min_samples_leaf,
min_weight_fraction_leaf=self.min_weight_fraction_leaf,
min_impurity_decrease=self.min_impurity_decrease,
min_impurity_split=self.min_impurity_split,
max_features=self.max_features,
max_leaf_nodes=self.max_leaf_nodes,
random_state=random_state,
presort=self.presort)
if self.estimator is None:
tree = DecisionTreeRegressor(
criterion=self.criterion,
splitter='best',
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
min_samples_leaf=self.min_samples_leaf,
min_weight_fraction_leaf=self.min_weight_fraction_leaf,
min_impurity_decrease=self.min_impurity_decrease,
min_impurity_split=self.min_impurity_split,
max_features=self.max_features,
max_leaf_nodes=self.max_leaf_nodes,
random_state=random_state,
presort=self.presort)
else:
tree = clone(self.estimator)
tree.set_params(
criterion=self.criterion,
splitter='best',
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
min_samples_leaf=self.min_samples_leaf,
min_weight_fraction_leaf=self.min_weight_fraction_leaf,
min_impurity_decrease=self.min_impurity_decrease,
min_impurity_split=self.min_impurity_split,
max_features=self.max_features,
max_leaf_nodes=self.max_leaf_nodes,
random_state=random_state)

if self.subsample < 1.0:
# no inplace multiplication!
Expand Down Expand Up @@ -1526,7 +1545,7 @@ def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,
random_state=None, max_features=None, verbose=0,
max_leaf_nodes=None, warm_start=False,
presort='auto', validation_fraction=0.1,
n_iter_no_change=None, tol=1e-4):
n_iter_no_change=None, tol=1e-4, estimator=None):

super(GradientBoostingClassifier, self).__init__(
loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,
Expand All @@ -1541,7 +1560,7 @@ def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,
min_impurity_split=min_impurity_split,
warm_start=warm_start, presort=presort,
validation_fraction=validation_fraction,
n_iter_no_change=n_iter_no_change, tol=tol)
n_iter_no_change=n_iter_no_change, tol=tol, estimator=estimator)

def _validate_y(self, y):
check_classification_targets(y)
Expand Down Expand Up @@ -1969,7 +1988,7 @@ def __init__(self, loss='ls', learning_rate=0.1, n_estimators=100,
min_impurity_split=None, init=None, random_state=None,
max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None,
warm_start=False, presort='auto', validation_fraction=0.1,
n_iter_no_change=None, tol=1e-4):
n_iter_no_change=None, tol=1e-4, estimator=None):

super(GradientBoostingRegressor, self).__init__(
loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,
Expand All @@ -1983,7 +2002,7 @@ def __init__(self, loss='ls', learning_rate=0.1, n_estimators=100,
random_state=random_state, alpha=alpha, verbose=verbose,
max_leaf_nodes=max_leaf_nodes, warm_start=warm_start,
presort=presort, validation_fraction=validation_fraction,
n_iter_no_change=n_iter_no_change, tol=tol)
n_iter_no_change=n_iter_no_change, tol=tol, estimator=estimator)

def predict(self, X):
"""Predict regression target for X.
Expand Down
4 changes: 4 additions & 0 deletions sklearn/exact_tree/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from .tree import RegressionTree


__all__ = ['RegressionTree']
38 changes: 38 additions & 0 deletions sklearn/exact_tree/_criterion.pxd
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#cython: cdivision=True
from ._stats_node cimport StatsNode


cdef inline double _impurity_mse(StatsNode* stats_node):
cdef double impurity
impurity = (stats_node[0].sum_sq_y /
stats_node[0].sum_weighted_samples)
impurity -= ((stats_node[0].sum_y /
stats_node[0].sum_weighted_samples) ** 2)

return impurity


cdef inline double impurity_improvement(StatsNode* c_stats,
StatsNode* l_stats,
StatsNode* r_stats,
double sum_total_weighted_samples):
# FIXME: only using MSE for the moment
c_impurity = _impurity_mse(c_stats)
l_impurity = _impurity_mse(l_stats)
r_impurity = _impurity_mse(r_stats)

return ((c_stats[0].sum_weighted_samples /
sum_total_weighted_samples) *
(c_impurity -
(l_stats[0].sum_weighted_samples /
sum_total_weighted_samples * l_impurity) -
(r_stats[0].sum_weighted_samples /
sum_total_weighted_samples * r_impurity)))


cdef inline double proxy_impurity_improvement(StatsNode* l_stats,
StatsNode* r_stats):
return ((l_stats[0].sum_y * l_stats[0].sum_y) /
l_stats[0].sum_weighted_samples +
(r_stats[0].sum_y * r_stats[0].sum_y) /
r_stats[0].sum_weighted_samples)
Empty file.
60 changes: 60 additions & 0 deletions sklearn/exact_tree/_split_record.pxd
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
from libc.math cimport NAN, INFINITY

from ._stats_node cimport StatsNode
from ._stats_node cimport stats_node_copy_to
from ._stats_node cimport stats_node_clear

from ._criterion cimport _impurity_mse


cdef struct SplitRecord:
int feature
int pos
float threshold
float impurity
float impurity_improvement
int nid
StatsNode c_stats
StatsNode l_stats
StatsNode r_stats


cdef void split_record_reset(SplitRecord* split_record, int feature,
int pos, float threshold,
float impurity,
float impurity_improvement, int nid,
StatsNode* c_stats, StatsNode* l_stats,
StatsNode* r_stats)


cdef inline void split_record_clear(SplitRecord* split_record):
split_record[0].feature = 0
split_record[0].pos = 0
split_record[0].threshold = NAN
split_record[0].impurity = INFINITY
split_record[0].impurity_improvement = -INFINITY
split_record[0].nid = 0

stats_node_clear(&split_record.c_stats)
stats_node_clear(&split_record.l_stats)
stats_node_clear(&split_record.r_stats)


cdef inline void split_record_expand_record(SplitRecord* split_record,
SplitRecord* left_split_record,
SplitRecord* right_split_record):
split_record_clear(left_split_record)
stats_node_copy_to(&split_record[0].l_stats, &left_split_record[0].c_stats)
stats_node_copy_to(&split_record[0].l_stats, &left_split_record[0].r_stats)
left_split_record[0].impurity = _impurity_mse(
&left_split_record[0].c_stats)

split_record_clear(right_split_record)
stats_node_copy_to(&split_record.r_stats, &right_split_record[0].c_stats)
stats_node_copy_to(&split_record.r_stats, &right_split_record[0].r_stats)
right_split_record[0].impurity = _impurity_mse(
&right_split_record[0].c_stats)


cdef void split_record_copy_to(SplitRecord* src_split_record,
SplitRecord* dst_split_record)
32 changes: 32 additions & 0 deletions sklearn/exact_tree/_split_record.pyx
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
cdef void split_record_reset(SplitRecord* split_record, int feature,
int pos, float threshold,
float impurity,
float impurity_improvement, int nid,
StatsNode* c_stats, StatsNode* l_stats,
StatsNode* r_stats):
split_record[0].feature = feature
split_record[0].pos = pos
split_record[0].threshold = threshold
split_record[0].impurity = impurity
split_record[0].impurity_improvement = impurity_improvement

stats_node_copy_to(c_stats, &split_record.c_stats)
stats_node_copy_to(l_stats, &split_record.l_stats)
stats_node_copy_to(r_stats, &split_record.r_stats)


cdef inline void split_record_copy_to(SplitRecord* src_split_record,
SplitRecord* dst_split_record):
dst_split_record[0].feature = src_split_record[0].feature
dst_split_record[0].pos = src_split_record[0].pos
dst_split_record[0].threshold = src_split_record[0].threshold
dst_split_record[0].impurity = src_split_record[0].impurity
dst_split_record[0].impurity_improvement = src_split_record[0].impurity_improvement
dst_split_record[0].nid = src_split_record[0].nid

stats_node_copy_to(&src_split_record[0].c_stats,
&dst_split_record[0].c_stats)
stats_node_copy_to(&src_split_record[0].l_stats,
&dst_split_record[0].l_stats)
stats_node_copy_to(&src_split_record[0].r_stats,
&dst_split_record[0].r_stats)
Loading