ENH Use simultaenous sort in tree splitter #22868

thomasjpfan · 2022-03-16T17:22:49Z

This PR replaces the use of sort in the tree splitter with simultaneous_sort. Running the following benchmark script:

Benchmark

import argparse
from time import perf_counter
import json
from statistics import mean, stdev

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from collections import defaultdict


N_SAMPLES = [1_000, 5_000, 10_000, 20_000, 50_000]
N_REPEATS = 20

parser = argparse.ArgumentParser()
parser.add_argument("results", type=argparse.FileType("w"))
args = parser.parse_args()

results = defaultdict(list)

for n_samples in N_SAMPLES:
    for n_repeat in range(N_REPEATS):
        X, y = make_classification(
            random_state=n_repeat, n_samples=n_samples, n_features=100
        )
        tree = DecisionTreeClassifier(random_state=n_repeat)
        start = perf_counter()
        tree.fit(X, y)
        duration = perf_counter() - start
        results[n_samples].append(duration)
    results_mean, results_stdev = mean(results[n_samples]), stdev(results[n_samples])
    print(f"n_samples={n_samples} with {results_mean:.3f} +/- {results_stdev:.3f}")

json.dump(results, args.results)

Decision Tree Benchmarks

I see the performance benefits for this PR compared to main:

RandomForest Benchmarks

GradientBoosting Benchmarks

n_features=20, and less samples since the runtime is longer overall

CC @jjerphan

jjerphan

Thank you, @thomasjpfan!

Better performances while removing custom implementations: everything maintainers love!

Before clicking "✅ Approve": do simultaneous_sort, introsort, and heap_sort share the same stability?

Edit: I think addressing #22827 for sklearn.tree (and also probably sklearn.ensemble) would be preferable before subsequently refactoring this module. What do you think?

lorentzenchr

@thomasjpfan already got a maintainability award in #22365 and most of the work here was prepared by @jjerphan, so the jury needs to think what to do.

thomasjpfan · 2022-03-18T15:00:04Z

Before clicking "✅ Approve": do simultaneous_sort, introsort, and heap_sort share the same "stableness"?

They do not share the same stableness. I ran an experiment with a small python interface:

# tree sort
def py_sort(DTYPE_t[:] Xf, SIZE_t[:] samples):
    cdef SIZE_t n = Xf.shape[0]
    sort(&Xf[0], &samples[0], n)

def py_simultaneous_sort(floating[:] values, ITYPE_t[:] indices):
    cdef ITYPE_t n = values.shape[0]
    simultaneous_sort(&values[0], &indices[0], n)

with:

values = np.asarray([1.1, -0.5, -0.5, -0.18, -0.5], dtype=np.float32)
indicies = np.arange(5)

# tree sort result indices
array([4, 1, 2, 3, 0])

# simultaneous sort
array([2, 4, 1, 3, 0])

Both implementations are unstable. Also I observe that simultaneous_sort is about ~15% faster for various input sizes. The sorted values are the same, but the indices are not when there are ties.

I updated the original message with benchmarks using other tree based models and observe similar performance improvements.

Since, a model trained on the same data may be different because of this PR, I add an entry in the "Changed models" section here: e9ba146

I think addressing #22827 for sklearn.tree (and also probably sklearn.ensemble) would be preferable before subsequently refactoring this module. What do you think?

It would be nice to have it before this PR, but I do not see it as strict requirement.

jjerphan

LGTM. Thank you, @thomasjpfan

It would be nice to have it before this PR, but I do not see it as strict requirement.

So do I now, since you provided experiments regarding the two algorithms stability.

Should we wait for another maintainer to merge this PR due to models' changes?

doc/whats_new/v1.1.rst

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

ogrisel

Since both the old and the new algorithms are unstable I think this is good to go.

doc/whats_new/v1.1.rst

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

This reverts commit 4cf932d.

MAINT Use simultaenous sort in tree splitter

da11ec2

thomasjpfan mentioned this pull request Mar 16, 2022

MAINT Create a private extension for sorting utilities #22760

Merged

jjerphan reviewed Mar 16, 2022

View reviewed changes

thomasjpfan added 2 commits March 17, 2022 18:48

Merge remote-tracking branch 'upstream/main' into simultaenous_sort_tree

ff8a635

FIX Uses new sorting location

ec95183

lorentzenchr approved these changes Mar 18, 2022

View reviewed changes

DOC Adds changed models entry

e9ba146

jjerphan approved these changes Mar 18, 2022

View reviewed changes

doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved

Update doc/whats_new/v1.1.rst

510f5ad

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

ogrisel approved these changes Mar 18, 2022

View reviewed changes

DOC Minor rewording

1e4f403

ogrisel reviewed Mar 18, 2022

View reviewed changes

doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved

Update doc/whats_new/v1.1.rst

21fc187

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

lorentzenchr changed the title ~~MAINT Use simultaenous sort in tree splitter~~ ENH Use simultaenous sort in tree splitter Mar 21, 2022

lorentzenchr merged commit 4cf932d into scikit-learn:main Mar 21, 2022

thomasjpfan mentioned this pull request Mar 22, 2022

MNT Uses memoryviews in tree criterion #22921

Merged

lorentzenchr mentioned this pull request Mar 30, 2022

Performance improvements in tree based models scikit-learn/communication#13

Open

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Apr 6, 2022

ENH Use simultaenous sort in tree splitter (scikit-learn#22868)

a5909c8

thomasjpfan mentioned this pull request Apr 28, 2022

MNT Refactor splitter flow by removing indentation #23237

Merged

This was referenced May 17, 2022

DecisionTreeClassifier became slower in v1.1 when fitting encoded variables #23397

Closed

FIX Fixes performance regression in trees #23404

Closed

FIX fix performance regression in trees with low-cardinality features #23410

Merged

thomasjpfan added a commit to thomasjpfan/scikit-learn that referenced this pull request May 18, 2022

Revert "ENH Use simultaenous sort in tree splitter (scikit-learn#22868)"

509c2ca

This reverts commit 4cf932d.

thomasjpfan added a commit to thomasjpfan/scikit-learn that referenced this pull request May 18, 2022

Revert "ENH Use simultaenous sort in tree splitter (scikit-learn#22868)"

7cbb186

This reverts commit 4cf932d.

pedroilidio mentioned this pull request Aug 23, 2022

[WIP] ENH Tree Splitter: 50% performance improvement with radix sort and feature ranks #24239

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH Use simultaenous sort in tree splitter #22868

ENH Use simultaenous sort in tree splitter #22868

Uh oh!

thomasjpfan commented Mar 16, 2022 •

edited

Loading

Uh oh!

jjerphan left a comment •

edited

Loading

Uh oh!

lorentzenchr left a comment

Uh oh!

thomasjpfan commented Mar 18, 2022 •

edited

Loading

Uh oh!

jjerphan left a comment

Uh oh!

Uh oh!

ogrisel left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH Use simultaenous sort in tree splitter #22868

ENH Use simultaenous sort in tree splitter #22868

Uh oh!

Conversation

thomasjpfan commented Mar 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Decision Tree Benchmarks

RandomForest Benchmarks

GradientBoosting Benchmarks

Uh oh!

jjerphan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Mar 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thomasjpfan commented Mar 16, 2022 •

edited

Loading

jjerphan left a comment •

edited

Loading

thomasjpfan commented Mar 18, 2022 •

edited

Loading