Skip to content

ENH Use simultaenous sort in tree splitter #22868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 21, 2022

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Mar 16, 2022

This PR replaces the use of sort in the tree splitter with simultaneous_sort. Running the following benchmark script:

Benchmark
import argparse
from time import perf_counter
import json
from statistics import mean, stdev

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from collections import defaultdict


N_SAMPLES = [1_000, 5_000, 10_000, 20_000, 50_000]
N_REPEATS = 20

parser = argparse.ArgumentParser()
parser.add_argument("results", type=argparse.FileType("w"))
args = parser.parse_args()

results = defaultdict(list)

for n_samples in N_SAMPLES:
    for n_repeat in range(N_REPEATS):
        X, y = make_classification(
            random_state=n_repeat, n_samples=n_samples, n_features=100
        )
        tree = DecisionTreeClassifier(random_state=n_repeat)
        start = perf_counter()
        tree.fit(X, y)
        duration = perf_counter() - start
        results[n_samples].append(duration)
    results_mean, results_stdev = mean(results[n_samples]), stdev(results[n_samples])
    print(f"n_samples={n_samples} with {results_mean:.3f} +/- {results_stdev:.3f}")

json.dump(results, args.results)

Decision Tree Benchmarks

I see the performance benefits for this PR compared to main:

tree_sort_compare

RandomForest Benchmarks

forst_tree_sort_compare

GradientBoosting Benchmarks

n_features=20, and less samples since the runtime is longer overall

gb_tree_sort_compare

CC @jjerphan

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @thomasjpfan!

Better performances while removing custom implementations: everything maintainers love!

Before clicking "✅ Approve": do simultaneous_sort, introsort, and heap_sort share the same stability?


Edit: I think addressing #22827 for sklearn.tree (and also probably sklearn.ensemble) would be preferable before subsequently refactoring this module. What do you think?

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan already got a maintainability award in #22365 and most of the work here was prepared by @jjerphan, so the jury needs to think what to do.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Mar 18, 2022

Before clicking "✅ Approve": do simultaneous_sort, introsort, and heap_sort share the same "stableness"?

They do not share the same stableness. I ran an experiment with a small python interface:

# tree sort
def py_sort(DTYPE_t[:] Xf, SIZE_t[:] samples):
    cdef SIZE_t n = Xf.shape[0]
    sort(&Xf[0], &samples[0], n)

def py_simultaneous_sort(floating[:] values, ITYPE_t[:] indices):
    cdef ITYPE_t n = values.shape[0]
    simultaneous_sort(&values[0], &indices[0], n)

with:

values = np.asarray([1.1, -0.5, -0.5, -0.18, -0.5], dtype=np.float32)
indicies = np.arange(5)

# tree sort result indices
array([4, 1, 2, 3, 0])

# simultaneous sort
array([2, 4, 1, 3, 0])

Both implementations are unstable. Also I observe that simultaneous_sort is about ~15% faster for various input sizes. The sorted values are the same, but the indices are not when there are ties.

I updated the original message with benchmarks using other tree based models and observe similar performance improvements.

Since, a model trained on the same data may be different because of this PR, I add an entry in the "Changed models" section here: e9ba146

I think addressing #22827 for sklearn.tree (and also probably sklearn.ensemble) would be preferable before subsequently refactoring this module. What do you think?

It would be nice to have it before this PR, but I do not see it as strict requirement.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you, @thomasjpfan

It would be nice to have it before this PR, but I do not see it as strict requirement.

So do I now, since you provided experiments regarding the two algorithms stability.

Should we wait for another maintainer to merge this PR due to models' changes?

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since both the old and the new algorithms are unstable I think this is good to go.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@lorentzenchr lorentzenchr changed the title MAINT Use simultaenous sort in tree splitter ENH Use simultaenous sort in tree splitter Mar 21, 2022
@lorentzenchr lorentzenchr merged commit 4cf932d into scikit-learn:main Mar 21, 2022
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Apr 6, 2022
thomasjpfan added a commit to thomasjpfan/scikit-learn that referenced this pull request May 18, 2022
thomasjpfan added a commit to thomasjpfan/scikit-learn that referenced this pull request May 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants