Skip to content

ENH: use the sparse-sparse backend for computing pairwise distance #28191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
glemaitre opened this issue Jan 19, 2024 · 7 comments
Open

ENH: use the sparse-sparse backend for computing pairwise distance #28191

glemaitre opened this issue Jan 19, 2024 · 7 comments

Comments

@glemaitre
Copy link
Member

glemaitre commented Jan 19, 2024

First reported in: scikit-learn-contrib/imbalanced-learn#1056

We have a regression in kneighbors with sparse matrix from 1.1.X to 1.3.X.
A code sample to reproduce:

# %%
import sklearn
sklearn.__version__

# %%
import numpy as np
from scipy import sparse
from sklearn.neighbors import KNeighborsRegressor

n_samples, n_features = 1_000, 10_000
X = sparse.random(n_samples, n_features, density=0.01, format="csr", random_state=0)
rng = np.random.default_rng(0)
y = rng.integers(0, 2, size=n_samples)
knn = KNeighborsRegressor(n_neighbors=5).fit(X, y)

# %%
%%timeit
knn.kneighbors(X, return_distance=False)

1.1.X

21.5 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

main

1.16 s ± 9.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Small benchmark

image

# %%
import sklearn
sklearn.__version__

# %%
import time
from collections import defaultdict
import numpy as np
from scipy import sparse
from sklearn.neighbors import KNeighborsRegressor

n_samples, n_features = 1_000, [500, 1_000, 2_500, 5_000, 7_500, 10_000, 25_000, 50_000]

results = defaultdict(list)
for nf in n_features:
    X = sparse.random(n_samples, nf, density=0.01, format="csr", random_state=0)
    rng = np.random.default_rng(0)
    y = rng.integers(0, 2, size=n_samples)
    knn = KNeighborsRegressor(n_neighbors=5).fit(X, y)
    start = time.time()
    knn.kneighbors(X, return_distance=False)
    elapsed_time = time.time() - start
    results["version"].append(sklearn.__version__)
    results["n_features"].append(nf)
    results["elapsed_time"].append(elapsed_time)

# %%
import pandas as pd

results = pd.DataFrame(results)
results.to_csv(f"bench_{sklearn.__version__}.csv", index=False)
# %%
import pandas as pd

results_main = pd.read_csv(f"bench_1.5.dev0.csv")
results_1_1 = pd.read_csv(f"bench_1.1.3.csv")
results = pd.concat([results_main, results_1_1], axis=0)
results

# %%
import seaborn as sns

sns.set_context("talk")

# %%
sns.relplot(
    data=results,
    x="n_features",
    y="elapsed_time",
    hue="version",
    kind="line",
    height=6,
    aspect=1.5,
    legend=True,
)
sns.despine(offset=10, trim=True)

Debug profiling

My first investigation look like we are spending all our time in the following function:

cdef void _middle_term_sparse_sparse_64(
const float64_t[:] X_data,
const int32_t[:] X_indices,
const int32_t[:] X_indptr,
intp_t X_start,
intp_t X_end,
const float64_t[:] Y_data,
const int32_t[:] Y_indices,
const int32_t[:] Y_indptr,
intp_t Y_start,
intp_t Y_end,
float64_t * D,
) noexcept nogil:
# This routine assumes that D points to the first element of a
# zeroed buffer of length at least equal to n_X × n_Y, conceptually
# representing a 2-d C-ordered array.
cdef:
intp_t i, j, k
intp_t n_X = X_end - X_start
intp_t n_Y = Y_end - Y_start
intp_t x_col, x_ptr, y_col, y_ptr
for i in range(n_X):
for x_ptr in range(X_indptr[X_start+i], X_indptr[X_start+i+1]):
x_col = X_indices[x_ptr]
for j in range(n_Y):
k = i * n_Y + j
for y_ptr in range(Y_indptr[Y_start+j], Y_indptr[Y_start+j+1]):
y_col = Y_indices[y_ptr]
if x_col == y_col:
D[k] += -2 * X_data[x_ptr] * Y_data[y_ptr]

But I'm a bit rusty with profiling native code. I need a bit more time to investigate.

@glemaitre
Copy link
Member Author

I'm under the impression that having a large number of features is something that was not taken into account when benchmarking in #24556

@jjerphan
Copy link
Member

I confirm that nearly all the time (>99%) is spent in _middle_term_sparse_sparse_*.

@glemaitre
Copy link
Member Author

FYI, I'm running a larger benchmark for different n_features, n_samples, density of the sparse matrix to see if we should keep a single backend or switch in a smart manner to the previous dispatching mechanism.

@glemaitre
Copy link
Member Author

I made an additional benchmark to check the influence of the density of the matrix:

image

This is also an important variable to make the switch.

@adrinjalali
Copy link
Member

How likely is it to have this fixed pretty quickly? I wonder how realistic it is to include it in 1.4.1

@jjerphan
Copy link
Member

A quick fix is to change the definition of is_usable_for to exclude this case.

@jjerphan
Copy link
Member

csr_matmat from SciPy was called before and likely has a better algorithm.

@glemaitre glemaitre changed the title Regression in k-neighbours with sparse data EHN: use the sparse-sparse backend for computing pairwise distance Jan 26, 2024
@lesteve lesteve changed the title EHN: use the sparse-sparse backend for computing pairwise distance ENH: use the sparse-sparse backend for computing pairwise distance Feb 1, 2024
@jeremiedbb jeremiedbb removed this from the 1.4.1 milestone Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants