ENH: use the sparse-sparse backend for computing pairwise distance #28191

glemaitre · 2024-01-19T20:11:53Z

First reported in: scikit-learn-contrib/imbalanced-learn#1056

We have a regression in kneighbors with sparse matrix from 1.1.X to 1.3.X.
A code sample to reproduce:

# %%
import sklearn
sklearn.__version__

# %%
import numpy as np
from scipy import sparse
from sklearn.neighbors import KNeighborsRegressor

n_samples, n_features = 1_000, 10_000
X = sparse.random(n_samples, n_features, density=0.01, format="csr", random_state=0)
rng = np.random.default_rng(0)
y = rng.integers(0, 2, size=n_samples)
knn = KNeighborsRegressor(n_neighbors=5).fit(X, y)

# %%
%%timeit
knn.kneighbors(X, return_distance=False)

1.1.X

21.5 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

`main`

1.16 s ± 9.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Small benchmark

# %%
import sklearn
sklearn.__version__

# %%
import time
from collections import defaultdict
import numpy as np
from scipy import sparse
from sklearn.neighbors import KNeighborsRegressor

n_samples, n_features = 1_000, [500, 1_000, 2_500, 5_000, 7_500, 10_000, 25_000, 50_000]

results = defaultdict(list)
for nf in n_features:
    X = sparse.random(n_samples, nf, density=0.01, format="csr", random_state=0)
    rng = np.random.default_rng(0)
    y = rng.integers(0, 2, size=n_samples)
    knn = KNeighborsRegressor(n_neighbors=5).fit(X, y)
    start = time.time()
    knn.kneighbors(X, return_distance=False)
    elapsed_time = time.time() - start
    results["version"].append(sklearn.__version__)
    results["n_features"].append(nf)
    results["elapsed_time"].append(elapsed_time)

# %%
import pandas as pd

results = pd.DataFrame(results)
results.to_csv(f"bench_{sklearn.__version__}.csv", index=False)

# %%
import pandas as pd

results_main = pd.read_csv(f"bench_1.5.dev0.csv")
results_1_1 = pd.read_csv(f"bench_1.1.3.csv")
results = pd.concat([results_main, results_1_1], axis=0)
results

# %%
import seaborn as sns

sns.set_context("talk")

# %%
sns.relplot(
    data=results,
    x="n_features",
    y="elapsed_time",
    hue="version",
    kind="line",
    height=6,
    aspect=1.5,
    legend=True,
)
sns.despine(offset=10, trim=True)

Debug profiling

My first investigation look like we are spending all our time in the following function:

scikit-learn/sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pyx.tp

Lines 39 to 69 in 5c7e831

    
           cdef void _middle_term_sparse_sparse_64( 
        
               const float64_t[:] X_data, 
        
               const int32_t[:] X_indices, 
        
               const int32_t[:] X_indptr, 
        
               intp_t X_start, 
        
               intp_t X_end, 
        
               const float64_t[:] Y_data, 
        
               const int32_t[:] Y_indices, 
        
               const int32_t[:] Y_indptr, 
        
               intp_t Y_start, 
        
               intp_t Y_end, 
        
               float64_t * D, 
        
           ) noexcept nogil: 
        
               # This routine assumes that D points to the first element of a 
        
               # zeroed buffer of length at least equal to n_X × n_Y, conceptually 
        
               # representing a 2-d C-ordered array. 
        
               cdef: 
        
                   intp_t i, j, k 
        
                   intp_t n_X = X_end - X_start 
        
                   intp_t n_Y = Y_end - Y_start 
        
                   intp_t x_col, x_ptr, y_col, y_ptr 
        
               for i in range(n_X): 
        
                   for x_ptr in range(X_indptr[X_start+i], X_indptr[X_start+i+1]): 
        
                       x_col = X_indices[x_ptr] 
        
                       for j in range(n_Y): 
        
                           k = i * n_Y + j 
        
                           for y_ptr in range(Y_indptr[Y_start+j], Y_indptr[Y_start+j+1]): 
        
                               y_col = Y_indices[y_ptr] 
        
                               if x_col == y_col: 
        
                                   D[k] += -2 * X_data[x_ptr] * Y_data[y_ptr]

But I'm a bit rusty with profiling native code. I need a bit more time to investigate.

The text was updated successfully, but these errors were encountered:

glemaitre · 2024-01-19T20:49:51Z

I'm under the impression that having a large number of features is something that was not taken into account when benchmarking in #24556

jjerphan · 2024-01-19T22:58:47Z

I confirm that nearly all the time (>99%) is spent in _middle_term_sparse_sparse_*.

glemaitre · 2024-01-20T14:39:16Z

FYI, I'm running a larger benchmark for different n_features, n_samples, density of the sparse matrix to see if we should keep a single backend or switch in a smart manner to the previous dispatching mechanism.

glemaitre · 2024-01-21T19:48:56Z

I made an additional benchmark to check the influence of the density of the matrix:

This is also an important variable to make the switch.

adrinjalali · 2024-01-23T17:09:43Z

How likely is it to have this fixed pretty quickly? I wonder how realistic it is to include it in 1.4.1

jjerphan · 2024-01-23T17:14:00Z

A quick fix is to change the definition of is_usable_for to exclude this case.

jjerphan · 2024-01-23T19:38:31Z

csr_matmat from SciPy was called before and likely has a better algorithm.

glemaitre added this to the 1.4.1 milestone Jan 19, 2024

github-actions bot added the Needs Triage Issue requires triage label Jan 19, 2024

glemaitre added Regression and removed Needs Triage Issue requires triage labels Jan 19, 2024

This was referenced Jan 19, 2024

[BUG] BorderlineSMOTE takes an unusually long amount of time in later versions of scikit-learn scikit-learn-contrib/imbalanced-learn#1056

Closed

Resolve issues concerning upgrading scikit-learn to latest version mozilla/bugbug#3799

Merged

jjerphan mentioned this issue Jan 23, 2024

FIX Disable PairwiseDistanceReductions on the Euclidean CSR×CSR case #28235

Merged

glemaitre added Enhancement and removed Regression labels Jan 26, 2024

glemaitre changed the title ~~Regression in k-neighbours with sparse data~~ EHN: use the sparse-sparse backend for computing pairwise distance Jan 26, 2024

lesteve changed the title ~~EHN: use the sparse-sparse backend for computing pairwise distance~~ ENH: use the sparse-sparse backend for computing pairwise distance Feb 1, 2024

marco-c mentioned this issue Feb 2, 2024

Prepare to use the latest version of scikit-learn mozilla/bugbug#3798

Closed

jeremiedbb removed this from the 1.4.1 milestone Feb 7, 2024

suhaibmujahid mentioned this issue Feb 8, 2024

Revert "Bump python from 3.10.7-slim to 3.11.7-slim in /infra (#4020)" mozilla/bugbug#4033

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: use the sparse-sparse backend for computing pairwise distance #28191

ENH: use the sparse-sparse backend for computing pairwise distance #28191

glemaitre commented Jan 19, 2024 •

edited

Loading

glemaitre commented Jan 19, 2024

jjerphan commented Jan 19, 2024

glemaitre commented Jan 20, 2024

glemaitre commented Jan 21, 2024

adrinjalali commented Jan 23, 2024

jjerphan commented Jan 23, 2024

jjerphan commented Jan 23, 2024

ENH: use the sparse-sparse backend for computing pairwise distance #28191

ENH: use the sparse-sparse backend for computing pairwise distance #28191

Comments

glemaitre commented Jan 19, 2024 • edited Loading

1.1.X

main

Small benchmark

Debug profiling

glemaitre commented Jan 19, 2024

jjerphan commented Jan 19, 2024

glemaitre commented Jan 20, 2024

glemaitre commented Jan 21, 2024

adrinjalali commented Jan 23, 2024

jjerphan commented Jan 23, 2024

jjerphan commented Jan 23, 2024

glemaitre commented Jan 19, 2024 •

edited

Loading

`main`