-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MNT deprecating presort #14907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT deprecating presort #14907
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we need an entry in the what's new to know that this is deprecated
Code for quick benchmark: import matplotlib.pyplot as plt
import neurtu
import numpy as np
import pandas as pd
def full_bench():
from itertools import product
from sklearn.datasets import make_classification
from sklearn.model_selection import ParameterGrid
from sklearn.ensemble import GradientBoostingClassifier
dict_grid = {
'presort': [True, False],
'max_depth': [3]
}
param_grid = ParameterGrid(dict_grid)
all_samples = [1000, 5000, 10000, 50000]
all_features = [10, 30, 50]
for n_samples, n_features in product(all_samples, all_features):
X, y = make_classification(
n_samples=n_samples, n_features=n_features, random_state=42
)
for params in param_grid:
clf = GradientBoostingClassifier(**params, random_state=42)
tags = params.copy()
tags['n_samples'] = n_samples
tags['n_features'] = n_features
print(f"tags={tags}")
yield neurtu.delayed(clf, tags=tags).fit(X, y)
bench = neurtu.Benchmark(wall_time=True, cpu_time=False, peak_memory=False, repeat=3)
results = bench(full_bench()) I'll post the plot soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awaiting for the benchmark, the code change is OK
So we can expect to have a slow down between ~2.5x-3x |
thanks for the benchmark @glemaitre , what are the actual times? |
Ups the results are on my other computer. Let me launch the benchmark. I will add |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the benchmark. So basically this makes the code slower, but mostly in cases where the new Hist version would be much better anyway.
Maybe we can recommend to use HistGradientBoostingXXX
in the warning message and in the docstring?
LGTM regardless
Indeed, it would be a nice addition. |
Added the recommendation, anything else @NicolasHug @glemaitre ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, LGTM
Feel free to merge @adrinjalali
- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682)
- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682) - Fix deprecated imports
- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682) - Fix deprecated imports (scikit-learn/scikit-learn#9250)
- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682) - Fix deprecated imports (scikit-learn/scikit-learn#9250) Do not add ccp_alpha to SurvivalTree, because it relies node_impurity, which is not set for SurvivalTree.
- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682) - Fix deprecated imports (scikit-learn/scikit-learn#9250) Do not add ccp_alpha to SurvivalTree, because it relies node_impurity, which is not set for SurvivalTree.
- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682) - Fix deprecated imports (scikit-learn/scikit-learn#9250) Do not add ccp_alpha to SurvivalTree, because it relies node_impurity, which is not set for SurvivalTree.
Further corrections are necessary to make 'monoensemble' work with current sklearn. The main ones that I need your attention to: (1) the "presort" and "X_idx_sorted" sklearn parameters have been deprecated. See, respectively: scikit-learn/scikit-learn#14907 scikit-learn/scikit-learn#16818 Since I don't know how exactly do you prefer to handle that in light of the suggestions in the first link above, in order to at least leave 'monoensemble' in a working state, the only thing I did was to comment out "presort=self.presort" from line 1540 in the file 'mono_gradient_boosting.py'. But a more definitive solution will be necessary, since right now a FutureWarning is issued every iteration due to "X_idx_sorted" deprecation (which, besides being annoying, means that the code will soon be broken again if "X_idx_sorted" is not eliminated from the code base). (2) in the line 436, from file 'mono_forest.py', the "_generate_unsampled_indices" throws an error because that function now has an extra parameters, 'n_samples_bootstrap': https://github.com/scikit-learn/scikit-learn/blob/4b8cd880397f279200b8faf9c75df13801cb45b7/sklearn/ensemble/_forest.py#L123 I obviously also do not know what is your preference here, but given the implementation in that link, it seems safe to assume that thus far your code was operating with the equivalent of 'n_samples_bootstrap = 1'. So that is what I imposed for now in the line 436, from file 'mono_forest.py'.
Related to #14711.
This PR removes
presort
from the tree codebase and deprecates wherever public facing.It is a first step towards simplifying the tree codebase, and we can do it since presort is only used in the
GradientBoosting*
classes, which are superseded byHistGradientBoost*
.@glemaitre is working on benchmarks.
Also ping @ogrisel , @NicolasHug