-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEA Introduce PairwiseDistances
, a generic back-end for pairwise_distances
#25561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Introduce PairwiseDistances
, a generic back-end for pairwise_distances
#25561
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Vincent-Maladiere, it looks like you are on good tracks. 👍
Do you think you can rerun benchmarks on other cases e.g. especially on:
metric=manhattan
solely- on the
dense-dense
andcsr-csr
combination - on
np.float64
andnp.float32
data - on 1 thread and on 16 threads
and provide results? 🙂
Also, as discussed in 1:1, I think we better skip the n_jobs=1
for now and work on it as part of a second PR.
Moreover, this PR needs a whatsnew entry for 1.3.
Finally, depending on others maintainers' opinion, we might want to use a feature branch (as recently chosen by @Vincent-Maladiere with feature/PairwiseDistances
). In our opinion, this would allow validating parts of the parameters' combinations independently from one another, easing review and integration while avoiding partial work being integrated in main
.
What do other reviewers think?
sklearn/metrics/_pairwise_distances_reduction/_pairwise_distances.pyx.tp
Outdated
Show resolved
Hide resolved
sklearn/metrics/_pairwise_distances_reduction/_pairwise_distances.pyx.tp
Outdated
Show resolved
Hide resolved
@@ -33,79 +29,3 @@ def _chi2_kernel_fast(floating[:, :] X, | |||
if nom != 0: | |||
res += denom * denom / nom | |||
result[i, j] = -res | |||
|
|||
|
|||
def _sparse_manhattan( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function is the simplest and the most frugal implementation for computing the pairwise manhattan distances on a pair of CSR datasets. Yet I think supporting all of the combinations (of metrics, {dense, sparse}², float32 and float64, etc.) by having such implementations is hardly maintainable.
The abstractions that we develop with Cython have a cost (that we can estimate against this implementation) but I think they ease future extensions.
What are people thinking of this? Are the Cython abstractions reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you said elsewhere, I think we need a few comparative benchmarks (dense and sparse) to answer that question.
I agree the tempita Cython-class oriented code offers more flexibility to support all combinations of memory representation / dtypes / metrics and would lean towards removing special cases if the performance impact is ok.
I not sure how I feel about merging a PR to It's true that on the positive side, most workloads are run with at least 2 (or 4 threads) per nodes nowadays. So maybe nobody would notice the performance regression in practice. But still... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As explained below, let's focus this PR on the non-Euclidean case so that we can merge it without implementing the GEMM trick / Euclidean specialization and without risking a performance regression for the single threaded case nor changing the impact or handling the deprecation of the precomputed X/Y_norm_squared
parameters.
Then we will more finally study the Euclidean case in a follow-up PR.
I am not 100% convinced the Euclidean specialization is needed if we are careful to benchmark equivalent things by correctly controlling the impact of OMP_NUM_THREADS
/ OPENBLAS_NUM_THREADS
both in main
or 1.2.1
and the Cython PR with Euclidean case enabled. But let's delay that for later an finalize the non-Euclidean cases first.
) | ||
with pytest.raises(AssertionError): | ||
assert_allclose(wrong_D, D1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would no longer would pass because X_norm_squared
and Y_norm_squared
are no longer used as long as we do not implement the Euclidean specialization / GEMM trick, right?
It seems like a potentially problematic silent change. I we really decide no to do Euclidean specialization, we should deprecate those X_norm_squared
and Y_norm_squared
parameters everywhere in the public API.
If we plan to re-introduce the Euclidean specialization specialization in the Cython code, then we should just comment out this assertion to explain that it should be re-enabled once we implement the Euclidean specialization in Cython.
Or even better we could make this PR focus on the non-Euclidean cases only, and make sure that the Cython code is not "usable for" metric="(sq)euclidean"
/ metric="minkowski"
with p=2
for now to keep the existing numpy-based code with the GEMM trick for the time being.
This way we would not introduce a performance regression for the single-threaded case for now.
To answer this remark in #25561 (review):
As proposed in #25561 (review), we can simultaneously:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments to complete @ogrisel's remark.
Running this asv benchmark with this script on 1 and 16 cores confirm our intuition that we have performances increase for As discussed IRL with @jjerphan, we have performance regression for the cases sparse-sparse, dense-sparse, and sparse-dense, which could pave the way for the next PRs on this feature branch.
Edit: after IRL discussion @jjerphan, we think this PR brings value because:
|
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the remark given by @Vincent-Maladiere in #25561 (comment) (see the edit), I think this PR has more value than I initially have thought since it enlarge the feature scope for the support of fused sparse and dense datasets pairs. Moreover, I think we can choose to have performance regression for computing Manhattan distances on pairs of CSR datasets.
To me the following items have to be addressed so that this PR can be merge in feature/PairwiseDistances
(or alternatively in `main):
- rename
PairwiseDistance
toPairwiseDistances
- modify
PairwiseDistances.is_usable_for
andPairwiseDistances.valid_metrics
so that the following cases aren't yet treated:n_jobs=1
effective_n_threads=1
as pointed out in FEA IntroducePairwiseDistances
, a generic back-end forpairwise_distances
#25561 (comment)metric="sqeuclidean"
- treating the remaining comments, e.g.
- adapt docstrings as hinted by https://github.com/scikit-learn/scikit-learn/pull/25561/files#r1098634886 and https://github.com/scikit-learn/scikit-learn/pull/25561/files#r1098634652
- reverting some unrelated changes to
ArgKmin
- others
- add one changelog entry in
whats_new/v1.3.rst
for performance improvements forpairwise_distances
on the dense-dense case - add one changelog entry in
whats_new/v1.3.rst
for the support of the CSR-CSR, dense-CSR, CSR-dense cases forpairwise_distances
What do others think?
This does not seem to always be the case in the benchmark results: there are dense-dense case with a 2x slowdown. How to you explain this? Edit: those are the sequential cases. Could you please re-run with the benchmarks with 2 threads? Edit: actually, this is already visible in the second plot at the beginning of the PR. |
+1 for updating the changelog as quickly as possible in PRs. It helps reviewers better understand the user-facing scope of a PR by making it explicit. |
I suppose you mean the Is so, +1. This means that at least in the short term we keep on using the scipy metrics implementation in the dense-dense, single threaded case and would use the new Cython infrastructure for all the other cases. |
Our new benchmark is consistent with what we expected: Note that we don't use |
PairwiseDistance
PairwiseDistances
, a general back-end for pairwise_distances
PairwiseDistances
, a general back-end for pairwise_distances
PairwiseDistances
, a generic back-end for pairwise_distances
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few lasts remarks and suggestions to complete #25561 (review), which I corrected after @ogrisel's remark in #25561 (comment)
sklearn/metrics/_pairwise_distances_reduction/_pairwise_distances.pyx.tp
Outdated
Show resolved
Hide resolved
b289d9f
to
a569758
Compare
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Of course -- no rush either. I just wanted to make sure the project wasn't entirely set aside :) |
#26471 (which recently has been merged in Feel free to let us know if you need help resolving them. :) |
On it, let's try to have something working by the end of the week :) |
It seems that the merge history here got messed up @Vincent-Maladiere |
@Vincent-Maladiere: if your reflog also getting in the previous state, you can find the commit of the previous state using the reflog and reset the branch to it: # Get the latest changes from the base branch (which is up-to-date)
git checkout feature/PairwiseDistances
git pull upstream feature/PairwiseDistances
# Checkout to the previous commit
git checkout feat/pairwise_distances-pdr-backend
git reflog # find the commit
git checkout <commit>
# Force the branch of this PR back to this commit
git branch -f feat/pairwise_distances-pdr-backend
# Checkout back to this branch
git checkout feat/pairwise_distances-pdr-backend
# Update the branch, you might have conflicts to solve
git merge feature/PairwiseDistances
# After fixing conflicts, run isort on the files that you
# have modified via pre-commit.
pre-commit install
pre-commit run
# Finish the merge
git merge --continue
# Inspect the diff with the base branch
git diff feature/PairwiseDistances...
# If the diff is clear, force-push the branch
git push -f fork feat/pairwise_distances-pdr-backend |
Hey! Yes sorry about that, after rebasing on main it seems that feature/PairwiseDistance hasn't been synced for a while. @glemaitre has rebased feature/PairwiseDistance on main. So, after fixing the conflict hopefully, the large diffs will vanish. Thanks @jjerphan for the |
✔️ Linting PassedAll linting checks passed. Your pull request is in excellent shape! ☀️ Generated for commit: 54600aa |
I'm running the benchmark again on {Dense, CSR} x {Dense, CSR} -- {manhattan, euclidean} distance -- {1, 2, 4, 6} threads, to make sure our conclusion is the same. |
Blue is main, orange is this branch. We obtain a slight speed-up for Euclidean distances in the Dense x CSR case (why?) and no difference otherwise, which matches our expectations, since we're not using PairwiseDistances for the Euclidean distance yet. ![]() ![]() ![]() However, the Manhattan distance is significantly worse when using Dense x CSR or CSR x CSR, and the ![]() ![]() ![]() The CSR implementation we use are: main: branch: The main difference between these implementations is the parallelization happening within the Therefore, this branch makes more function call to the distance routine and this might explain the decrease. Do you have any other ideas to explain the performance decrease? WDYT? |
By the way, our tests are failing because of some connection error to conda forge. |
@Vincent-Maladiere could you generate a speedscope-compatible profile report for both methods? It'll help us spot any less-than-obvious costs, as well as help us confirm valid assumptions. You can see this comment for instructions: #26316 (comment) |
Okay, let's profile this. TBH, I was hoping that you or @jjerphan knew what might have impaired the benchmark so that I don't need to run Pyspy and speedscope 😅 |
Okay after some initial profiling, it looks like we're running into the same problems we had with the refactor attempts for the entire backend. We forfeit a lot of performance due to indirection and calls with this approach. I didn't expect it to be this dramatic, but alas. When This can be partially mitigated by only passing memoryviews by reference rather than value (dropping peripheral information like shape) wherever call overhead is a significant factor (e.g. in tight loops). Specifically here, it would mean changing I'm not sure that there is a way around this without copy-and-pasting entire swathes of the implementations. Options like code-generation and templating could solve this but at the cost of build time and binary size, along with the obvious maintenance burden. Abstraction is directly in contest with performance here. With that being said, this is really only a problem for |
@Vincent-Maladiere would you like to to keep working on this PR, or would you prefer I take over efforts here and let you focus on other things :) |
Superseded by #26983 Please feel free to leave comments/suggestions on the new PR. Thank you @Vincent-Maladiere for all the work you've done so far! |
Reference Issues/PRs
Towards #23958
What does this implement/fix? Explain your changes.
This simplifies the original implementation of
PairwiseDistance
by @jjerphan, with the following differences:PairwiseDistance{32,64}
doesn't subclassBaseDistancesReduction{32,64}
anymore._parallel_on_{X,Y}
methods toPairwiseDistance{32,64}
, since these methods are decorated with@final
inBaseBaseDistancesReduction{32,64}
and thus can't be overwritten.chunk_size = 1
, as proposed by @ogrisel in this comment.Following this benchmark, we found that this PR yields a significant performance regression when
n_jobs = 1
and an improvement whenn_jobs > 1
, for both euclidean and manhattan distances:Any other comments?
As suggested by @jjerphan, decorating
DistanceMetric
subclasses with@final
could alleviate some of the overhead and make this implementation competitive withmain
whenn_jobs=1
.