-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[WIP] ENH Tree Splitter: 50% performance improvement with radix sort and feature ranks #24239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Downsides: we are currently using auxilary vectors declared in BestSplitter and we have to hold two whole matrices X_ranks and X_order. Further strategies to reduce memory usage are welcome. this branch: n_samples=1000 with 0.066 +/- 0.017 n_samples=5000 with 0.491 +/- 0.124 n_samples=10000 with 1.276 +/- 0.048 n_samples=20000 with 3.520 +/- 0.416 n_samples=50000 with 12.558 +/- 2.260 sklearn/main: n_samples=1000 with 0.100 +/- 0.015 n_samples=5000 with 0.972 +/- 0.279 n_samples=10000 with 2.492 +/- 0.188 n_samples=20000 with 7.367 +/- 1.070 n_samples=50000 with 26.034 +/- 3.585
Regarding item 3, I'd be glad to hear opinions and ideas from anyone out there and discuss the feasibility of this PR. Edit: It turns out that Benchmark scriptfrom argparse import ArgumentParser
from timeit import repeat
import numpy as np
from scipy.stats import rankdata
import matplotlib.pyplot as plt
argparser = ArgumentParser()
argparser.add_argument('--out', '-o', default='benchmark_results.tsv')
argparser.add_argument('--n-iter', default=8, type=int)
argparser.add_argument('--n-times', default=1, type=int)
argparser.add_argument('--min', default=10, type=int)
argparser.add_argument('--max', default=17, type=int)
argparser.add_argument('--cols', default=500, type=int)
argparser.add_argument('--repeat', '-r', default=20, type=int)
args = argparser.parse_args()
nn = np.logspace(args.min, args.max, args.n_iter, base=2, dtype=int)
outfile = open(args.out, 'w')
outfile.write('n\tnp_mean\tnp_std\tsp_mean\tsp_std\n')
np_means, np_stdevs , sp_means, sp_stdevs = [], [], [], []
for n in nn:
print('*** n =', n)
a = np.random.rand(n, args.cols)
np_res = repeat('np.argsort(np.argsort(a, axis=0), axis=0)',
number=args.n_times, repeat=args.repeat, globals=globals())
sp_res = repeat("rankdata(a, method='ordinal', axis=0)-1",
number=args.n_times, repeat=args.repeat, globals=globals())
np_mean, np_stdev = np.mean(np_res), np.std(np_res)
sp_mean, sp_stdev = np.mean(sp_res), np.std(sp_res)
print(f'np.argsort: {np_mean:.4f} ({np_stdev:.4f})')
print(f'scipy.rankdata: {sp_mean:.4f} ({sp_stdev:.4f})')
outfile.write(f'{n}\t{np_mean}\t{np_stdev}\t{sp_mean}\t{sp_stdev}\n')
np_means.append(np_mean)
np_stdevs.append(np_stdev)
sp_means.append(sp_mean)
sp_stdevs.append(sp_stdev)
outfile.close()
# nn, np_means, np_stdevs , sp_means, sp_stdevs = np.loadtxt(args.out, skiprows=1).T
plt.figure(figsize=(10, 3))
plt.errorbar(nn, np_means, fmt='-o', yerr=np_stdevs, label='numpy.argsort')
plt.errorbar(nn, sp_means, fmt='--o', yerr=sp_stdevs, label='scipy.stats.rankdata')
plt.xlabel('Number of samples')
plt.ylabel('Duration (s)')
plt.yscale('log')
plt.xscale('log')
plt.legend()
plt.grid()
plt.savefig(args.out+'.png') |
Based on scipy's implementation of ranking: https://github.com/scipy/scipy/blob/651a9b717deb68adde9416072c1e1d5aa14a58a1/scipy/stats/_stats_py.py#L9128-L9134
Based on New implementation: New benchmarksargsort is the previous version using |
This looks like an interesting idea and not too much code/new complexity. What would be the next steps to move this forward? I think, as pointed out, the extra memory consumption is worth thinking about. Can we measure this for the benchmarks you performed to get a feeling for the size of the increase? I think I'd defer the work for sparse WDYT? |
Thanks for your input, @betatim. I agree that precomputed ranks and a sparse version can wait for a future PR. I see that passing Additionally, the radix sort implementation seems good enough to me, but it still may benefit from someone more experienced checking it out. Maybe an in-place version to drop the auxiliary arrays? But I think the only point really preventing us from going on is the increased memory load. As suggested, I ran a similar benchmark (bellow) for us to get a better idea of the impact. The increase was lower than I expected given the arrays' relative sizes and that A safe way forward is to rethink this PR as an optional change. I noticed that Anyway, everything depends on how much of a problem would the extra memory consumption be, and I would appreciate hearing some more thoughts on that before going on and making a separate Please also let me know if there are any further analyses and considerations I can help with. Memory usage comparison (MB) Memory benchmark scriptimport argparse
import json
from statistics import mean, stdev
from memory_profiler import memory_usage
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from collections import defaultdict
N_SAMPLES = [1_000, 5_000, 10_000, 20_000, 50_000]
N_REPEATS = 20
parser = argparse.ArgumentParser()
parser.add_argument("results")
args = parser.parse_args()
results = defaultdict(list)
def train_tree(X, y, random_state):
tree = DecisionTreeClassifier(random_state=random_state)
tree.fit(X, y)
for n_samples in N_SAMPLES:
for n_repeat in range(N_REPEATS):
X, y = make_classification(
random_state=n_repeat, n_samples=n_samples, n_features=100
)
memusage = memory_usage((train_tree, (X, y, n_repeat), {}))
results[n_samples].append(max(memusage))
results_mean, results_stdev = mean(results[n_samples]), stdev(results[n_samples])
print(f"n_samples={n_samples} with {results_mean:.3f} +/- {results_stdev:.3f}")
with open(args.results, "w") as out:
json.dump(results, out) |
When I run your script I get the following results for the current
and these numbers when I run it for this PR:
I had to make one change to the script, but I think that it doesnt change the results: import argparse
import json
from statistics import mean, stdev
from memory_profiler import memory_usage
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from collections import defaultdict
N_SAMPLES = [1_000, 5_000, 10_000, 20_000, 50_000]
N_REPEATS = 20
parser = argparse.ArgumentParser()
parser.add_argument("results")
args = parser.parse_args()
results = defaultdict(list)
def train_tree(X, y, random_state):
tree = DecisionTreeClassifier(random_state=random_state)
tree.fit(X, y)
if __name__ == "__main__":
for n_samples in N_SAMPLES:
for n_repeat in range(N_REPEATS):
X, y = make_classification(
random_state=n_repeat, n_samples=n_samples, n_features=100
)
memusage = memory_usage((train_tree, (X, y, n_repeat), {}))
results[n_samples].append(max(memusage))
results_mean, results_stdev = mean(results[n_samples]), stdev(results[n_samples])
print(f"n_samples={n_samples} with {results_mean:.3f} +/- {results_stdev:.3f}")
with open(args.results, "w") as out:
json.dump(results, out) I also ran the script using
and on this PR the script has been running on the first value of I think something to investigate is what happens if you use the decision trees in something like a random forest (with different values of The overall question is: how much extra memory are you allowed to use for a given speed up? I don't know the answer to that. Maybe we need a core dev to help with that. |
I've modified the memory benchmarking script to use filprofiler, in order to track allocation instead of resident memory. Results are more consistent, showing the expected 3-fold increase in memory usage (we are additionally storing an We should be able to get a 2-fold instead of 3-fold increase by simply using InstructionsSet the PR and 1.1 environments (named
Notice that additional command-line options are possible, for example:
|
I close as this PR seems stalled and a heavy performance regression is unresolved. |
Since sorting on subsets of
X
's columns is performed multiple times when searching for the best split, we can achieve around 50% reduction in split search duration by computing the ranks of each feature and using them for future sorts instead of the actual feature values inXf
, given that integer sorting algorithms such as radix sort can run in linear time.Benchmarks (script from #22868)
Test durations in seconds comparing this PR to
sklearn 1.1
:Pending discussion topics
FasterBestSplitter
class could be created, allowing users to opt for the faster version using thesplitter
parameter of decision tree-based estimators. I haven't found a good way to implement this though, avoiding copy-pasting the wholeBestSplit.node_split()
. ABestSplitter._sort()
method maybe could be created, but wrapping the two sorting procedures (radix and the current introsort) in the same method seems a little complicated;X_ranks
array to the Splitter to avoid redundant calculation in ensembles or cross-validation;In this implementation, I usedFound a way of sorting once, based on Scipy'snumpy.argsort
twice to compute the ranks. Is there any better option? Numpy already uses radix sort for integer types;rankdata
. I don't see much room for improvement, but ideas are always appreciated;In the first tree node, when the ranks are sorted, the results would be the sorted indices (output fromI imagine this would require to store not only the ranks but also the sorted indices, which adds a lot of memory load for very small gains. An alternative could be calculating the ranks in the root node, but I don't see any intuitive way of doing so;np.argsort
). Since these sorted indices would naturally be obtained when determining the ranks, is it possible to reuse them in the root node?claim to be able to perform in-place sorting with similar time complexity. I did not explore those;
Also implement a sparse splitter version.Delegated to future PR.The branch passes all
tree/tests/test_tree.py
tests.