Skip to content

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented Mar 25, 2024

Fix #28625. More context in scipy/scipy#20294.

It is easy to add a test, but for now we only have a single Windows build using MKL. The main question is do we want to add another Windows build with OpenBLAS?

Copy link

github-actions bot commented Mar 25, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 13c6fdb. Link to the linter CI: here

@@ -36,8 +37,9 @@ cdef float64_t[::1] _sqeuclidean_row_norms64_dense(
intp_t d = X.shape[1]
float64_t[::1] squared_row_norms = np.empty(n, dtype=np.float64)

for idx in prange(n, schedule='static', nogil=True, num_threads=num_threads):
squared_row_norms[idx] = _dot(d, X_ptr + idx * d, 1, X_ptr + idx * d, 1)
with threadpool_limits(limits=1, user_api='blas'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment here because it's surprising that it's necessary while _dot is already single threaded. Maybe with a link to the issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_dot is already single threaded.

Actually it looks like it's not. see scipy/scipy#20294 (comment). It's weird cause every time I tested it, I only saw a single thread running, no matter the length of the arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just use nrm2 as suggested here scipy/scipy#20294 (comment), even though it implies take a square root and then square back. From an efficiency point of view, it should not have a visible impact since the computation of the norms is not the most costly part of the pairwise distance reduction compuations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know this but ddot is not necessarily single-threaded, see scipy/scipy#20294 (comment)

@ogrisel
Copy link
Member

ogrisel commented Mar 25, 2024

Instead of calling threadpool_limits inside _sqeuclidean_row_norms64_dense which is itself called by the constructor of EuclideanArgKmin64 which is itself called by the ArgKmin64.compute class method, I would prefer to move the existing call to threadpool_limits in ArgKmin64.compute to include both the call to the EuclideanArgKmin64 and the _parallel_on_*:

if metric in ("euclidean", "sqeuclidean"):
# Specialized implementation of ArgKmin for the Euclidean distance
# for the dense-dense and sparse-sparse cases.
# This implementation computes the distances by chunk using
# a decomposition of the Squared Euclidean distance.
# This specialisation has an improved arithmetic intensity for both
# the dense and sparse settings, allowing in most case speed-ups of
# several orders of magnitude compared to the generic ArgKmin
# implementation.
# For more information see MiddleTermComputer.
use_squared_distances = metric == "sqeuclidean"
pda = EuclideanArgKmin{{name_suffix}}(
X=X, Y=Y, k=k,
use_squared_distances=use_squared_distances,
chunk_size=chunk_size,
strategy=strategy,
metric_kwargs=metric_kwargs,
)
else:
# Fall back on a generic implementation that handles most scipy
# metrics by computing the distances between 2 vectors at a time.
pda = ArgKmin{{name_suffix}}(
datasets_pair=DatasetsPair{{name_suffix}}.get_for(X, Y, metric, metric_kwargs),
k=k,
chunk_size=chunk_size,
strategy=strategy,
)
# Limit the number of threads in second level of nested parallelism for BLAS
# to avoid threads over-subscription (in GEMM for instance).
with threadpool_limits(limits=1, user_api="blas"):
if pda.execute_in_parallel_on_Y:
pda._parallel_on_Y()
else:
pda._parallel_on_X()

More precisely, move the with threadpool_limits(limits=1, user_api="blas") to start at line 59 instead of 89.

This will this will reduce the number of successive calls to openblas_set_num_threads from 4 to 2 when calling ArgKmin64.compute. And this will also make threadpool_limits(limits=1, user_api="blas") always happen at a consistent API level with the pairwise distance reduction codebase.

Note that: in addition to working around the thread-safety problem of recent OpenBLAS on Windows, this change can also protect against oversubscription problems for all platforms.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ogrisel
Copy link
Member

ogrisel commented Mar 26, 2024

Not sure if we need a changelog entry or not for this. Probably we should (probably as a fix).

lesteve and others added 2 commits March 26, 2024 11:56
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@lesteve
Copy link
Member Author

lesteve commented Mar 26, 2024

I added a short changelog entry. Not sure if it is worth trying to mention all the estimators that can be affected by this.

Also another question: should this be back-ported to 1.4.2?

@jeremiedbb
Copy link
Member

Not sure if it is worth trying to mention all the estimators that can be affected by this.

If you want to keep it simple, you can just say "neighbor based algorithms" :)

Also another question: should this be back-ported to 1.4.2?

I think so since 1.4 users are impacted

@jeremiedbb jeremiedbb added this to the 1.4.2 milestone Mar 26, 2024
@jeremiedbb jeremiedbb added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Mar 26, 2024
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the changelog entry to target 1.4.2. LGTM, thanks.

@jeremiedbb jeremiedbb enabled auto-merge (squash) March 26, 2024 13:19
@jeremiedbb jeremiedbb merged commit 1bbb228 into scikit-learn:main Mar 26, 2024
@lesteve lesteve deleted the avoid-openblas-windows-hang branch March 26, 2024 14:25
jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Apr 8, 2024
…istances calculation (scikit-learn#28692)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
@jeremiedbb jeremiedbb mentioned this pull request Apr 8, 2024
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cython module:metrics To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: ArgKmin64 on Windows with scipy 1.13rc1 or 1.14.dev times out
3 participants