Skip to content

[MRG] Sparse input support for Gradient Boosting #3880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

pprett
Copy link
Member

@pprett pprett commented Nov 24, 2014

adds sparse input support:

  • Falls back to BestSparseSplitter
  • Holds data in both csc and csr format during fit (needs the latter just to call tree.apply to get the terminal leaf assignment).

Todo:

  • more & better tests
  • benchmarks
  • partial dependence

cc @arjoly @jnothman @glouppe

@pprett
Copy link
Member Author

pprett commented Nov 24, 2014

@arjoly this is an initial draft -- its fully functional but I need to do more benchmarks

@pprett
Copy link
Member Author

pprett commented Nov 24, 2014

on covertype:

GBRT (25 trees of depth 3):

Classifier   train-time test-time error-rate
--------------------------------------------
dense         69.2286s   0.0894s     0.2201 
sparse        105.2789s   0.1667s     0.2201

CART tree

Classifier   train-time test-time error-rate
--------------------------------------------
dense          18.0742s   0.0215s     0.0423  
sparse          63.5370s   0.0408s     0.0423

@pprett
Copy link
Member Author

pprett commented Nov 24, 2014

covertype is not too sparse -- but sparse enough to get an improvement for SGD

Classifier   train-time test-time error-rate
--------------------------------------------
sparse         0.2035s    0.0042s     0.2301  
dense          0.4488s    0.0235s     0.2302

for i in range(n_estimators):
for k in range(K):
tree = estimators[i, k].tree_
out += scale * tree.predict(X).reshape((X.shape[0], 1))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the sparse case we fall back to tree.predict -- the special implementation above is only to get good performance for very small X

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put that remark as an inline comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check

@arjoly
Copy link
Member

arjoly commented Nov 25, 2014

I am curious. Can you benchmark on 20 newsgroup? There is a benchmark script in benchmarks.

@arjoly
Copy link
Member

arjoly commented Nov 25, 2014

If you have time, can you have a look at #3790? I would like to end this and at least put some improvement in the gbrt from this pr.

if sparse:
to_sparse = sp.csc_matrix
if order.lower() == 'c':
to_sparse = sp.csr_matrix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit hackish...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the sparse argument a str with the csr / csc / coo / None (default) value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check

@pprett
Copy link
Member Author

pprett commented Nov 25, 2014

I am curious. Can you benchmark on 20 newsgroup? There is a benchmark script in benchmarks.

@arjoly I tried it -- it takes ages -- waiting 5 minutes to train a single decision stump; unbearable. I wonder if we should create a dedicated DecisionStumpClassifier|Regressor that is optimized for sparse inputs

@pprett
Copy link
Member Author

pprett commented Nov 25, 2014

strange.. adaboost is way faster on this benchmark than gbrt... both use max_depht=1.. both use the BestSparseSplitter -- @arjoly do you know of any enhancements to adaboost regarding sparse data?

@pprett
Copy link
Member Author

pprett commented Nov 25, 2014

@arjoly nevermind -- stupid me... stupid me.. stupid me...

Thou shalt not use gradient boosting for multi-class problems

@pprett
Copy link
Member Author

pprett commented Nov 25, 2014

so -- here we go -- of course we will try to distinguish between alt.atheism and soc.religion.christian::

Classifier   train-time test-time Accuracy
--------------------------------------------
dummy             0.0003s    0.0006s     0.4658  
naive_bayes       0.0207s    0.0032s     0.6750  
cart              0.3463s    0.0014s     0.7406  
logistic_regression  0.1091s    0.0010s     0.8563  
adaboost          31.0893s   0.1319s     0.9219  
gbrt              24.4430s   0.0965s     0.9247

adaboost and gbrt use 100 stumps

@pprett
Copy link
Member Author

pprett commented Nov 25, 2014

the time difference between gbrt and adaboost is the difference between optimizing for MSE and Gini, respectively.

@arjoly
Copy link
Member

arjoly commented Nov 25, 2014

Thanks for the benchmark. I will try to find time tomorrow to read further this pr.

@mblondel
Copy link
Member

Holds data in both csc and csr format during fit

Would be nice to find a way around this.

@pprett
Copy link
Member Author

pprett commented Nov 25, 2014

Would be nice to find a way around this.

yes, we currently need csr to make predictions (which training example ended up in which leaf) and csc for fitting the trees. If we record during tree fitting which training examples ended up in which leaf then we dont need to run the tree.apply, however, we would need to do it for the out-of-bag examples (subsample < 1).

To make matters worse: our trees require CSR format for predictions (makes a lot of sense to assume that), so if you want to make predictions using your model, you need to have your data in CSR anyways. So for example if you make a grid search using a sparse matrix, each grid point will convert your X anyways (either CSC or CSR)

@pprett pprett changed the title Sparse input support for Gradient Boosting [MRG] Sparse input support for Gradient Boosting Nov 26, 2014
@arjoly
Copy link
Member

arjoly commented Nov 27, 2014

Making an apply for csc matrices could be an option. It would require to code the same algorithm as for the depth tree builder.

Each estimator in the stage is scaled by ``scale`` before
its prediction is added to ``out``.
"""
return predict_stages_sparse(estimators[stage:stage + 1], X, scale, out)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce the pair of functions predict_stage_sparse / predict_stage_dense instead of checking sp.issparse(X) to dispatch to the right predict_stages_xxx method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check

@ogrisel
Copy link
Member

ogrisel commented Nov 27, 2014

I think it's ok to hold data under both CSR and CSC during fit for now. +1 for merge on my side.

@pprett
Copy link
Member Author

pprett commented Nov 27, 2014

forgot to push the commits to address the review

@mblondel
Copy link
Member

Making an apply for csc matrices could be an option. It would require to code the same algorithm as for the depth tree builder.

I haven't thought about it deeply but it should be possible to implement apply directly on a CSC matrix using a binary search to retrieve feature j in sample i.

@ogrisel
Copy link
Member

ogrisel commented Nov 28, 2014

I haven't thought about it deeply but it should be possible to implement apply directly on a CSC matrix using a binary search to retrieve feature j in sample i.

That would be great. However I don't want to delay this PR too much if it's not easy to do. We could add an inline comment where X_apply is pre-computed to tell that it would be better to add native CSC support to the apply method of trees to spare the in-memory input data copy.

@mblondel
Copy link
Member

I don't oppose merging but CSC support would be nice to have before the
next release :)

@@ -20,6 +21,16 @@
from .gradient_boosting import BaseGradientBoosting


def _csc_col_minmax(X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using min_max_axis from sklearn.utils.sparsefuncs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point -- I'll check min_max_axis -- I still need a utility function though because I need the same return value as mquantile

@arjoly
Copy link
Member

arjoly commented Jan 19, 2015

I don't oppose merging but CSC support would be nice to have before the
next release :)

I have just implemented a csc implementation of apply (see this branch). Unfortunately, it is considerably slower than the csr implementation (here the benchmark) .

@mblondel
Copy link
Member

@arjoly Any clue why this is so slow? This seems like an algorithmic issue.

@amueller
Copy link
Member

amueller commented Jun 8, 2015

@pprett didn't you say at the sprint this was merged? hum :-/

@arjoly
Copy link
Member

arjoly commented Sep 29, 2015

done and super-seeded by #5252 thanks to @jmschrei

@arjoly arjoly closed this Sep 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants