Skip to content

GBRT API consistency #1088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from
Closed

Conversation

pprett
Copy link
Member

@pprett pprett commented Aug 30, 2012

This is a cerry-pick of #1036 that addresses #1085 and some other API issues of the GBRT module.

Summary:

  • Better docstrings
  • GradientBoostingClassifier now has staged_decision_function, staged_predict_proba, and staged_predict

These methods are important for efficient model selection for GBRT models.

"call `fit` before `feature_importances_`.")
total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for stage in self.estimators_:
stage_sum = sum(tree.compute_feature_importances(method='gini')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important - GBRT now uses 'gini' for feature importances - this was a bug in the current code that @glouppe pointed out a while ago but I thought it was ok - sorry about that - feature importances don't differ much though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious, what made you change your mind? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2012/9/3 Gilles Louppe notifications@github.com

In sklearn/ensemble/gradient_boosting.py:

     for i in range(self.n_estimators):
         predict_stage(self.estimators_, i, X, self.learn_rate, score)
         yield score
  • @Property
  • def feature_importances_(self):
  •    if self.estimators_ is None or len(self.estimators_) == 0:
    
  •        raise ValueError("Estimator not fitted, " \
    
  •                         "call `fit` before `feature_importances_`.")
    
  •    total_sum = np.zeros((self.n_features, ), dtype=np.float64)
    
  •    for stage in self.estimators_:
    
  •        stage_sum = sum(tree.compute_feature_importances(method='gini')
    

I am curious, what make you change your mind? :)

The fact that you were right and I was wrong :-) - I've misread/interpreted
the prose in ESL: "improvement in squared error risk over that for a
constant fit in the entire region" - the original publication is much more
specific about that.

I've noticed though, that feature rankings were remarkably stable despite
the weighting by n_left and n_right.


Reply to this email directly or view it on GitHubhttps://github.com//pull/1088/files#r1516452.

Peter Prettenhofer

@ogrisel
Copy link
Member

ogrisel commented Aug 30, 2012

LGTM, +1 for API consistency. Can you document the changes in what's new?

@pprett
Copy link
Member Author

pprett commented Aug 30, 2012

sure - i'll finish that tomorrow
Am 30.08.2012 20:24 schrieb "Olivier Grisel" notifications@github.com:

LGTM, +1 for API consistency. Can you document the changes in what's new?


Reply to this email directly or view it on GitHubhttps://github.com//pull/1088#issuecomment-8169312.

@ogrisel
Copy link
Member

ogrisel commented Sep 3, 2012

@pprett thanks for the whatsnew fix. 👍 for merging by rebase on top of master.

@pprett
Copy link
Member Author

pprett commented Sep 4, 2012

thanks @ogrisel - I'll wait for another pair of eyes before I merge

@glouppe
Copy link
Contributor

glouppe commented Sep 4, 2012

I am +1 for merge.

@pprett
Copy link
Member Author

pprett commented Sep 4, 2012

thx @glouppe - then I merge (by rebase)

@glouppe
Copy link
Contributor

glouppe commented Sep 4, 2012

(By the way how do you merge by rebase? and what are the advantages with respect to usual merge? I never understood that :s)

@pprett
Copy link
Member Author

pprett commented Sep 4, 2012

the only advantage I can see is that rebase makes the history more consistent - when you work on a branch for quite a while and then merge to master chances are that the relevant commits are scattered all over the master history.

The drawback of rebasing is that your commits get new sha1 hashes which causes pain when other people work on your branch. According to http://git-scm.com/book/en/Git-Branching-Rebasing you should not rebase when you have pushed your branch to a public repo (e.g. a github PR) - I totally agree.

Besides from that I hardly know git rebase so take this with a grain of salt.

@pprett
Copy link
Member Author

pprett commented Sep 4, 2012

Merged by rebase

@pprett pprett closed this Sep 4, 2012
@GaelVaroquaux
Copy link
Member

(By the way how do you merge by rebase?

  1. Update master,
  2. Pull and checkout the corresponding branch
  3. git rebase master
  4. git checkout master
  5. git merge the_branch
  6. git push

It's a bit tedious. If a get a lot of rebase conflicts, I switch to a merge, that can be easier.

and what are the advantages with
respect to usual merge? I never understood that :s)

It gives you a linear history in which each patch is applied after the other. It makes bisecting bugs much easier.

@GaelVaroquaux
Copy link
Member

Merged by rebase

great. Thanks

@glouppe
Copy link
Contributor

glouppe commented Sep 4, 2012

Didn't know that, thanks :)

On 4 September 2012 08:38, Gael Varoquaux notifications@github.com wrote:

Merged by rebase

great. Thanks


Reply to this email directly or view it on GitHubhttps://github.com//pull/1088#issuecomment-8253796.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants