Skip to content

[MRG] Gradient Boosting Classifier CV #5689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

vighneshbirodkar
Copy link
Contributor

A reincarnation of #1036 with early stopping. I will add the documentation and an example soon
The early stopping API is inspired by that of xgboost

@amueller
Copy link
Member

amueller commented Nov 2, 2015

awesome! I have to work on the release at the moment, but I'll review shortly.
Maybe @pprett @arjoly or @glouppe are interested.

@glouppe
Copy link
Contributor

glouppe commented Nov 3, 2015

Thanks for working on this! This is indeed a very welcome addition :)

@glouppe
Copy link
Contributor

glouppe commented Nov 3, 2015

ping us when you want us to start reviewing

@vighneshbirodkar
Copy link
Contributor Author

@glouppe @arjoly @pprett @amueller
I think this is ready for an initial review, I want to get the structure right before I add the example and the GradientBoostingRegressionCV class. Here are some things that still need to be taken care of.

  • Support sparse data
  • For each set of parameters, I am currently returning the mean of n_estimators from the cross validation set. Is there a better way to do this ? Maybe max ?
  • Should there be another class for regression ? I could do both cases by taking a base_model parameter in the constructor.

@@ -146,7 +146,7 @@

symbols, names = np.array(list(symbol_dict.items())).T

quotes = [finance.quotes_historical_yahoo(symbol, d1, d2, asobject=True)
quotes = [finance.quotes_historical_yahoo_ohlc(symbol, d1, d2, asobject=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this line should be modified.

@arjoly
Copy link
Member

arjoly commented Nov 6, 2015

You don't seem to use the warm option which is crucial to get good speed performance.

parameter `n_estimators` for the gradient boosting model
"""

params['warm_start'] = True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arjoly the warm_start parameter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we set this True by default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified IRL

@amueller
Copy link
Member

I'm surprised this didn't fail the tests with kwargs

@vighneshbirodkar
Copy link
Contributor Author

@glouppe @arjoly @pprett @amueller
I don't know if these tests are sufficient. predict from the new classifier is the same as GridSearchCV, but predict_proba is not

@vighneshbirodkar
Copy link
Contributor Author

I have absolutely no idea why these tests are failing on Python 3.5. I haven't removed or changed any part of the code, just added. Moreover, I ran the tests locally in a Python 3.5 conda environment and all tests passed.

Any ideas @MechCoder @amueller ?

@vighneshbirodkar
Copy link
Contributor Author

Here is the plot generated by the example
figure_1

@raghavrv
Copy link
Member

I'm gonna give a shot at reviewing this!

@vighneshbirodkar
Copy link
Contributor Author

@rvraghav93 Please, be my guest.

@raghavrv
Copy link
Member

Should there be another class for regression ? I could do both cases by taking a base_model parameter in the constructor.

I think separate classes would be the way to go.

@raghavrv
Copy link
Member

predict from the new classifier is the same as GridSearchCV, but predict_proba is not

I fail to understand how. Could you explain a bit?

cross-validation folds
* ``cv_validation_scores``, the list of scores for each fold

best_score_tuple_ : named tuple
Copy link
Member

@raghavrv raghavrv May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this different from GridSearchCV? (not best_score_)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cause apart from the score, I thought of adding attributes telling the user how his/her score varies with with train/test split. For example, if it varies significantly, something might be wrong

@vighneshbirodkar
Copy link
Contributor Author

@rvraghav93
I was referring to the tests. What I meant is the the result of predict from grid search and my class is the same, but the result of predict_proba is not

@raghavrv
Copy link
Member

Ah okay that makes sense now... But still the predict_proba result is expected to be the same IIUC?

BTW are you in the middle of exams? If so I could ping you later.

@vighneshbirodkar
Copy link
Contributor Author

@rvraghav93
No I'm not. What's IIUC ?

@raghavrv
Copy link
Member

If I Understand Correctly.

@vighneshbirodkar
Copy link
Contributor Author

@rvraghav93 I am not sure, because both the classes use significantly different number of estimators.

Parameters
----------
n_stop_rounds : int, optional, default=10
If the score on the test set rounded off to `score_precision` decimal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use rather "validation set" instead of "test set" throughout this class. This class is about cross-validation for model selection. We are not allowed to use the "final" test set here as our interest is not model evaluation per-se but model selection.

@ogrisel
Copy link
Member

ogrisel commented Jun 13, 2016

Also could you please add GradientBoostingRegressorCV ? I see no reason to include only classification and not regression in this PR.

@raghavrv
Copy link
Member

@vighneshbirodkar Are you planning to continue this anytime soon? Or if it's okay by you I could lend a hand. (Either cherry-picking your commits or push access to this branch?). I have a couple of weeks to work on this.

@vighneshbirodkar
Copy link
Contributor Author

@raghavrv. I don't think I will be working on this anytime soon. Which of these 2 ways is more convenient to you ? Most of Olivier's comments are minor refactoring changes. About the benchmarking for threading backend, do you plan to do it now ?

@raghavrv
Copy link
Member

If it's okay by you I can cherry-pick into a new PR?

About the benchmarking for threading backend, do you plan to do it now ?

Yes...

@raghavrv
Copy link
Member

@jnothman Should we also use a dict of np (ma) arrays here like we did at GridSearchCV instead of using the _CVScoreTuple?

@raghavrv
Copy link
Member

Also ping @amueller reg this...

params['warm_start'] = True
gb = estimator(**params)
scorer = check_scoring(estimator, scoring=scoring)
scores = np.ones((stop_rounds,))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be np.full((stop_rounds,), -np.inf)?

@Zhuifeng414
Copy link

wonderful!!!
give me lots of help, thank you .

@amueller
Copy link
Member

replaced by #7071

@amueller amueller closed this Sep 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants