FIX fix performance regression in trees with low-cardinality features #23410

lesteve · 2022-05-18T15:28:57Z

A more conservative alternative to #23404. This reverts #22868 and fixes the conflicts.

With the following benchmark script, I get a similar performance in this PR and in 1.0.2:

n_samples=50000, n_features=10: 0.087 +/- 0.002

# /tmp/test.py
import numpy as np
from time import perf_counter

from statistics import mean, stdev
from collections import defaultdict

from sklearn.tree import DecisionTreeClassifier

rng = np.random.RandomState(0)
n_samples, n_features = 50_000, 10

tree = DecisionTreeClassifier()

N_REPEATS = 5
results = defaultdict(list)


def make_data(random_state):
    rng = np.random.RandomState(random_state)
    X = rng.choice([0, 1, 2], size=(n_samples, n_features))
    y = rng.choice([0, 1], size=n_samples)
    return X, y


for n_repeat in range(N_REPEATS):
    X, y = make_data(n_repeat)
    tree = DecisionTreeClassifier(random_state=n_repeat)
    start = perf_counter()
    tree.fit(X, y)
    duration = perf_counter() - start
    results[n_samples].append(duration)

results_mean, results_stdev = mean(results[n_samples]), stdev(results[n_samples])
print(
    f"n_samples={n_samples}, n_features={n_features}: {results_mean:.3f} +/- {results_stdev:.3f}"
)

ogrisel

+1 for merging this a regression fix and explore in a later PR to main how to consolidate the 2 introsort/quicksort implementation with more extensive benchmarking for various data distributions.

ogrisel · 2022-05-18T16:21:59Z

Actually I spoke too quickly, we now have:

IndexError: Out of bounds on buffer access (axis 0)

https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=42268&view=logs&j=c8afde5f-ef70-5983-62e8-c6b665ad6161&t=7fc5d80c-d67c-560e-65a2-6822419a270a

that was previously fixed in main.

ogrisel · 2022-05-18T16:22:34Z

sklearn/tree/_splitter.pyx

-                simultaneous_sort(&Xf[start_positive], &samples[start_positive], end - start_positive)
+            sort(&Xf[start], &samples[start], end_negative - start)
+            sort(&Xf[start_positive], &samples[start_positive],
+                 end - start_positive)


We might need to re-introduce the if start_positive < end: condition here.

ogrisel · 2022-05-18T16:23:10Z

We also need to document this fix in the 1.1.1 changelog similar to https://github.com/scikit-learn/scikit-learn/pull/23404/files#diff-5a8a66bfe0792b4a0e50eb7c8cc929f2a8e0d4d88ac2d0084e07b70d069ced89R39

sklearn/tree/_splitter.pyx

glemaitre · 2022-05-18T16:28:48Z

how to consolidate the 2 introsort/quicksort implementation with more extensive benchmarking for various data distributions.

I actually looked at the std::sort that should be an introsort and available in Cython.
However, since we want to do a simultaneous sort the indices and from values, it makes it quite hard to use. Indeed, you want a comparator that needs to rely on a closure (that is in a Cython class furthermore). I could find the following information that shows the pattern: https://stackoverflow.com/a/63800579/6513708

glemaitre

+1 once the bug is fixed and we acknowledge the bug in what's new.

ogrisel · 2022-05-18T16:53:16Z

Indeed, you want a comparator that needs to rely on a closure (that is in a Cython class furthermore). I could find the following information that shows the pattern: https://stackoverflow.com/a/63800579/6513708

Nice trick! +1 for a follow-up PR for 1.2.

thomasjpfan · 2022-05-18T17:45:09Z

To use the comparator, we would need to represent the values and indices as an "array of structures". Currently, it is an "structure of arrays".

As for this PR, I am +1 on it as well. Backporting this fix should be easier, since the version on 1.1.X still uses pointers.

glemaitre · 2022-05-18T19:02:49Z

To use the comparator, we would need to represent the values and indices as an "array of structures". Currently, it is an "structure of arrays".

Here we only have an array of values (feature values) and an array of indices (sample indices) to sort, isn't it?

thomasjpfan · 2022-05-18T19:14:26Z

Here we only have an array of values (feature values) and an array of indices (sample indices) to sort, isn't it?

This representation is like a "structure of arrays". Currently, the code is similar to this:

struct MyArrays:
    double* values_array
    int* indices_array

and dual_swap has the index to follow the value.

To use a comparator, the values and indices need to be it's own structure:

struct ValueIndex:
    double value
    int indices

ValueIndex* array_to_sort

This way the index will follow the value when value gets sorted. (The comparator will only consider value for sorting purposes)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

thomasjpfan

+1 once this has a change log entry under 1.1.1. (As suggested in #23410 (review))

Backporting will be slightly more involved because the 1.1.X branch still users pointers for Xf, but it should not be too difficult.

sklearn/tree/_splitter.pyx

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

ogrisel · 2022-05-19T07:42:14Z

@thomasjpfan the code snippet in https://stackoverflow.com/a/63800579/6513708 shows how to do it directly with a structure of arrays without materializing it as an array of index-value pairs, no? Although I am not sure how it works.

EDIT: actually this solution would not inplace-sort the values, only the indices. We would need an extra step to re-shuffle the values based on the sorted indices which would require a temporary copy. Not necessarily a big deal but we will need to keep that temp buffer around to avoid reallocating it each time. Not sure it's worth it and I am not sure about the CPU cache impact of this extra shuffle w.r.t. the existing inplace implementation.

…scikit-learn#23410) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

…#23410) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

…scikit-learn#23410) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

…#23410) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Revert use of simultaneous_sort

1622f73

github-actions bot added cython module:tree labels May 18, 2022

lesteve mentioned this pull request May 18, 2022

FIX Fixes performance regression in trees #23404

Closed

lesteve changed the title ~~Revert use of simultaneous_sort~~ Revert use of simultaneous_sort in trees May 18, 2022

ogrisel approved these changes May 18, 2022

View reviewed changes

ogrisel reviewed May 18, 2022

View reviewed changes

glemaitre reviewed May 18, 2022

View reviewed changes

sklearn/tree/_splitter.pyx Outdated Show resolved Hide resolved

glemaitre approved these changes May 18, 2022

View reviewed changes

thomasjpfan added this to the 1.1.1 milestone May 18, 2022

thomasjpfan added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label May 18, 2022

Update sklearn/tree/_splitter.pyx

6af149a

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

thomasjpfan approved these changes May 19, 2022

View reviewed changes

sklearn/tree/_splitter.pyx Outdated Show resolved Hide resolved

lesteve and others added 2 commits May 19, 2022 08:41

Update sklearn/tree/_splitter.pyx

7e78c02

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Add changelog entry

d8b5343

lesteve changed the title ~~Revert use of simultaneous_sort in trees~~ FIX fix performance regression in trees for low-cardinality features May 19, 2022

lesteve changed the title ~~FIX fix performance regression in trees for low-cardinality features~~ FIX fix performance regression in trees with low-cardinality features May 19, 2022

ogrisel merged commit c217527 into scikit-learn:main May 19, 2022

lesteve deleted the revert-simultaneous-sort branch May 19, 2022 08:13

glemaitre mentioned this pull request May 19, 2022

DecisionTreeClassifier became slower in v1.1 when fitting encoded variables #23397

Closed

glemaitre added a commit that referenced this pull request May 19, 2022

FIX fix performance regression in trees with low-cardinality features (…

0dfcaee

…#23410) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

glemaitre added a commit that referenced this pull request Aug 5, 2022

FIX fix performance regression in trees with low-cardinality features (…

84dbab2

…#23410) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX fix performance regression in trees with low-cardinality features #23410

FIX fix performance regression in trees with low-cardinality features #23410

lesteve commented May 18, 2022 •

edited

Loading

ogrisel left a comment

ogrisel commented May 18, 2022 •

edited

Loading

ogrisel May 18, 2022 •

edited

Loading

ogrisel commented May 18, 2022

glemaitre commented May 18, 2022 •

edited

Loading

glemaitre left a comment

ogrisel commented May 18, 2022

thomasjpfan commented May 18, 2022 •

edited

Loading

glemaitre commented May 18, 2022 •

edited

Loading

thomasjpfan commented May 18, 2022 •

edited

Loading

thomasjpfan left a comment •

edited

Loading

ogrisel commented May 19, 2022 •

edited

Loading

FIX fix performance regression in trees with low-cardinality features #23410

FIX fix performance regression in trees with low-cardinality features #23410

Conversation

lesteve commented May 18, 2022 • edited Loading

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel commented May 18, 2022 • edited Loading

ogrisel May 18, 2022 • edited Loading

Choose a reason for hiding this comment

ogrisel commented May 18, 2022

glemaitre commented May 18, 2022 • edited Loading

glemaitre left a comment

Choose a reason for hiding this comment

ogrisel commented May 18, 2022

thomasjpfan commented May 18, 2022 • edited Loading

glemaitre commented May 18, 2022 • edited Loading

thomasjpfan commented May 18, 2022 • edited Loading

thomasjpfan left a comment • edited Loading

Choose a reason for hiding this comment

ogrisel commented May 19, 2022 • edited Loading

lesteve commented May 18, 2022 •

edited

Loading

ogrisel commented May 18, 2022 •

edited

Loading

ogrisel May 18, 2022 •

edited

Loading

glemaitre commented May 18, 2022 •

edited

Loading

thomasjpfan commented May 18, 2022 •

edited

Loading

glemaitre commented May 18, 2022 •

edited

Loading

thomasjpfan commented May 18, 2022 •

edited

Loading

thomasjpfan left a comment •

edited

Loading

ogrisel commented May 19, 2022 •

edited

Loading