Skip to content

[MRG+1] split tree module into several packages #5230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 9, 2015

Conversation

jmschrei
Copy link
Member

@jmschrei jmschrei commented Sep 8, 2015

This pull request handles splitting _tree.pyx into _tree.pyx with Tree and TreeBuilders, _splitter.pyx with all splitters and split records, _criterion.pyx with all criteria, and moves some functions into _utils.pyx. No code is changed, simply moved, in this PR. I almost have this working, with all unit tests passing except those which trigger a specific segfault. The segfault relates to the following function:

cdef inline np.ndarray sizet_ptr_to_ndarray(SIZE_t* data, SIZE_t size):
    """Encapsulate data into a 1D numpy array of intp's."""
    cdef np.npy_intp shape[1]
    shape[0] = <np.npy_intp> size
    return np.PyArray_SimpleNewFromData(1, shape, np.NPY_INTP, data)

specifically, this call:

np.PyArray_SimpleNewFromData(1, shape, np.NPY_INTP, data)

In the test case, shape = 1, data = [ 2 ], and it triggers a segfault. I have been reading about PyArray_SimpleNewFromData but figured that it might be easier to ask. So, (1) do you agree with this layout, and (2) do you know why this segfault is being triggered? Thanks!

cc @glouppe @arjoly @ogrisel

@glouppe
Copy link
Contributor

glouppe commented Sep 8, 2015

In all files using Numpy, make sure that the C-API is properly initialized using cimport numpy as np and np.import_array().

See https://github.com/cython/cython/wiki/tutorials-numpy#c-api-initalization

This should fix the segfault I believe.

I'll check the structure tomorrow.

@jmschrei
Copy link
Member Author

jmschrei commented Sep 8, 2015

Solved. You're a genius. :)

@jmschrei jmschrei changed the title [WIP] split tree module into several packages [MRG] split tree module into several packages Sep 8, 2015
@glouppe
Copy link
Contributor

glouppe commented Sep 8, 2015

Solved. You're a genius. :)

I got bit by this very obscure error several times and learned from it :)

@@ -1074,6 +1074,7 @@ def _init_decision_function(self, X):
raise ValueError("X.shape[1] should be {0:d}, not {1:d}.".format(
self.n_features, X.shape[1]))
score = self.init_.predict(X).astype(np.float64)
print(score)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this with the first round of comments.

@glouppe
Copy link
Contributor

glouppe commented Sep 9, 2015

As a double check, could you quickly make sure that there is no regression in terms of speed? (Should be all the same, but we are never too sure)

(I am saying this because these changes add 12000 new lines of C code)

@glouppe
Copy link
Contributor

glouppe commented Sep 9, 2015

For _tree.pyx the diff is too long, but it seems you removed unwillingly some comments delimiting sections, e.g. above the definition TreeBuilder (line 79).

@glouppe
Copy link
Contributor

glouppe commented Sep 9, 2015

Other than my comments, I am happy with the code structure. I hope it will make our lives easier for the next PRs :)

@jmschrei jmschrei mentioned this pull request Sep 9, 2015
12 tasks
@@ -14,7 +14,7 @@

from ..externals import six

from . import _tree
from . import _tree, _criterion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you have one import per line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better to have one import per line as it's easier to see changes whenever you use a software like git.

@arjoly
Copy link
Member

arjoly commented Sep 9, 2015

(I am saying this because these changes add 12000 new lines of C code)

The culprit is cython and this is well known. Cython adds a lot of functions to handle all sort of situations without letting you know.

@arjoly
Copy link
Member

arjoly commented Sep 9, 2015

Looks good, thanks @jmschrei

@jmschrei
Copy link
Member Author

jmschrei commented Sep 9, 2015

All changes incorporated, and all unit tests pass on my computer. Lets see if travis/appveyor build.

@arjoly
Copy link
Member

arjoly commented Sep 9, 2015

A last nitpick if you agree, can you sort import alphabetically?

Thanks a lot for this pr.

@jmschrei
Copy link
Member Author

jmschrei commented Sep 9, 2015

Times seem to be the same as well. All estimators use default settings and n_estimators=10 if they are an ensemble.

I'll alphabetize imports now.

BRANCH
RandomForestClassifier
madelon    0.623 0.26
Gaussian   0.914 4.072
arcene     0.76 0.02
spambase   0.941 0.046
mnist      0.948 4.902
covtypes   0.945 14.429

ExtraTreesClassifier
madelon    0.592 0.079
Gaussian   0.919 0.657
arcene     0.73 0.013
spambase   0.945 0.041
mnist      0.95 4.642
covtypes   0.943 12.037

GradientBoostingClassifier
madelon    0.74 0.657
Gaussian   0.853 8.004
arcene     0.74 0.843
spambase   0.911 0.092
mnist      0.836 165.77
covtypes   0.706 95.89

DecisionTreeClassifier
madelon    0.74 0.634
Gaussian   0.734 10.386
arcene     0.63 0.192
spambase   0.917 0.065
mnist      0.882 22.677
covtypes   0.943 10.317

GradientBoostingRegressor
boston 0.0 0.008
regression 0.0 1.778
diabetes 0.0 0.006

ExtraTreesRegressor
boston 0.02 0.019
regression 0.0 3.068
diabetes 0.0 0.015

RandomForestRegressor
boston 0.0 0.03
regression 0.0 10.94
diabetes 0.0 0.022

DecisionTreeRegressor
boston 0.02 0.004
regression 0.0 1.723
diabetes 0.0 0.003

MASTER
RandomForestClassifier
madelon    0.623 0.262
Gaussian   0.914 4.53
arcene     0.76 0.029
spambase   0.941 0.051
mnist      0.948 5.41
covtypes   0.945 14.127

ExtraTreesClassifier
madelon    0.592 0.088
Gaussian   0.919 0.75
arcene     0.73 0.014
spambase   0.945 0.045
mnist      0.95 4.926
covtypes   0.943 12.366

GradientBoostingClassifier
madelon    0.74 0.667
Gaussian   0.853 8.049
arcene     0.74 0.858
spambase   0.911 0.092
mnist      0.836 168.6
covtypes   0.706 97.512

DecisionTreeClassifier
madelon    0.74 0.637
Gaussian   0.734 10.428
arcene     0.63 0.191
spambase   0.917 0.066
mnist      0.882 22.735
covtypes   0.943 10.363

GradientBoostingRegressor
boston 0.0 0.007
regression 0.0 1.691
diabetes 0.0 0.006

ExtraTreesRegressor
boston 0.02 0.017
regression 0.0 2.888
diabetes 0.0 0.014

RandomForestRegressor
boston 0.0 0.03
regression 0.0 10.875
diabetes 0.0 0.022

DecisionTreeRegressor
boston 0.02 0.004
regression 0.0 1.762
diabetes 0.0 0.003

@arjoly
Copy link
Member

arjoly commented Sep 9, 2015

Could you run benchmarks/bench_covertype.py?

@glouppe
Copy link
Contributor

glouppe commented Sep 9, 2015

Thanks for the checks. +1 on my side

@glouppe glouppe changed the title [MRG] split tree module into several packages [MRG+1] split tree module into several packages Sep 9, 2015
@jmschrei
Copy link
Member Author

jmschrei commented Sep 9, 2015

Windows is not happy. I think maybe I'm not allowed to import inline functions?

@jmschrei
Copy link
Member Author

jmschrei commented Sep 9, 2015

Everything looks good, commits have been squashed. @arjoly do you give your +1?

@jmschrei
Copy link
Member Author

jmschrei commented Sep 9, 2015

benchmark/bench_covertype

Classification performance:
===========================
Classifier   train-time test-time error-rate
--------------------------------------------
CART          12.0291s   0.0214s     0.0424  
liblinear     8.8294s    0.0124s     0.2305  
SGD           0.5443s    0.0170s     0.2315  
GaussianNB    0.2599s    0.0788s     0.4841 

@arjoly
Copy link
Member

arjoly commented Sep 9, 2015

I was thinking to this benchmark.

± make in &&  python benchmarks/bench_covertype.py --classifiers CART ExtraTrees RandomForest GBRT

# Master

Classifier   train-time test-time error-rate
--------------------------------------------
RandomForest  45.6812s   0.3655s     0.0330  
ExtraTrees    34.7320s   0.5940s     0.0372  
CART          11.7957s   0.0919s     0.0424  
GBRT         446.7187s   0.2867s     0.1777  

# this pr

Classifier   train-time test-time error-rate
--------------------------------------------
RandomForest  47.4927s   0.3775s     0.0330  
ExtraTrees    33.4654s   0.5784s     0.0372  
CART          12.3618s   0.0693s     0.0424  
GBRT         447.3703s   0.2879s     0.1777  

Overall it looks good. I am merging.

arjoly added a commit that referenced this pull request Sep 9, 2015
[MRG+1] split tree module into several packages
@arjoly arjoly merged commit 77c963d into scikit-learn:master Sep 9, 2015
@jmschrei
Copy link
Member Author

jmschrei commented Sep 9, 2015

🎺

@arjoly
Copy link
Member

arjoly commented Sep 9, 2015

thanks @jmschrei I am eager to see new speed up! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants