Skip to content

FEA Introduce PairwiseDistances, a generic back-end for pairwise_distances #25561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Vincent-Maladiere
Copy link
Contributor

@Vincent-Maladiere Vincent-Maladiere commented Feb 7, 2023

Reference Issues/PRs

Towards #23958

What does this implement/fix? Explain your changes.

This simplifies the original implementation of PairwiseDistance by @jjerphan, with the following differences:

  • PairwiseDistance{32,64} doesn't subclass BaseDistancesReduction{32,64} anymore.
  • This allows to add _parallel_on_{X,Y} methods to PairwiseDistance{32,64}, since these methods are decorated with @final in BaseBaseDistancesReduction{32,64} and thus can't be overwritten.
  • This also remove the chunk computing mechanism, by considering only the case chunk_size = 1, as proposed by @ogrisel in this comment.
  • This doesn't implement the Euclidean specialization yet to make benchmarks simpler.

Following this benchmark, we found that this PR yields a significant performance regression when n_jobs = 1 and an improvement when n_jobs > 1, for both euclidean and manhattan distances:

euclidean

manhattan

Any other comments?

As suggested by @jjerphan, decorating DistanceMetric subclasses with @final could alleviate some of the overhead and make this implementation competitive with main when n_jobs=1.

@Vincent-Maladiere Vincent-Maladiere changed the base branch from main to feature/PairwiseDistances February 7, 2023 12:32
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Vincent-Maladiere, it looks like you are on good tracks. 👍

Do you think you can rerun benchmarks on other cases e.g. especially on:

  • metric=manhattan solely
  • on the dense-dense and csr-csr combination
  • on np.float64 and np.float32 data
  • on 1 thread and on 16 threads

and provide results? 🙂

Also, as discussed in 1:1, I think we better skip the n_jobs=1 for now and work on it as part of a second PR.

Moreover, this PR needs a whatsnew entry for 1.3.

Finally, depending on others maintainers' opinion, we might want to use a feature branch (as recently chosen by @Vincent-Maladiere with feature/PairwiseDistances). In our opinion, this would allow validating parts of the parameters' combinations independently from one another, easing review and integration while avoiding partial work being integrated in main.

What do other reviewers think?

@@ -33,79 +29,3 @@ def _chi2_kernel_fast(floating[:, :] X,
if nom != 0:
res += denom * denom / nom
result[i, j] = -res


def _sparse_manhattan(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is the simplest and the most frugal implementation for computing the pairwise manhattan distances on a pair of CSR datasets. Yet I think supporting all of the combinations (of metrics, {dense, sparse}², float32 and float64, etc.) by having such implementations is hardly maintainable.

The abstractions that we develop with Cython have a cost (that we can estimate against this implementation) but I think they ease future extensions.

What are people thinking of this? Are the Cython abstractions reasonable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you said elsewhere, I think we need a few comparative benchmarks (dense and sparse) to answer that question.

I agree the tempita Cython-class oriented code offers more flexibility to support all combinations of memory representation / dtypes / metrics and would lean towards removing special cases if the performance impact is ok.

@ogrisel
Copy link
Member

ogrisel commented Feb 8, 2023

Also, as discussed in 1:1, I think we better skip the n_jobs=1 for now and work on it as part of a second PR.

I not sure how I feel about merging a PR to main with a performance regression in the single threaded case.

It's true that on the positive side, most workloads are run with at least 2 (or 4 threads) per nodes nowadays. So maybe nobody would notice the performance regression in practice. But still...

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained below, let's focus this PR on the non-Euclidean case so that we can merge it without implementing the GEMM trick / Euclidean specialization and without risking a performance regression for the single threaded case nor changing the impact or handling the deprecation of the precomputed X/Y_norm_squared parameters.

Then we will more finally study the Euclidean case in a follow-up PR.

I am not 100% convinced the Euclidean specialization is needed if we are careful to benchmark equivalent things by correctly controlling the impact of OMP_NUM_THREADS / OPENBLAS_NUM_THREADS both in main or 1.2.1 and the Cython PR with Euclidean case enabled. But let's delay that for later an finalize the non-Euclidean cases first.

)
with pytest.raises(AssertionError):
assert_allclose(wrong_D, D1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would no longer would pass because X_norm_squared and Y_norm_squared are no longer used as long as we do not implement the Euclidean specialization / GEMM trick, right?

It seems like a potentially problematic silent change. I we really decide no to do Euclidean specialization, we should deprecate those X_norm_squared and Y_norm_squared parameters everywhere in the public API.

If we plan to re-introduce the Euclidean specialization specialization in the Cython code, then we should just comment out this assertion to explain that it should be re-enabled once we implement the Euclidean specialization in Cython.

Or even better we could make this PR focus on the non-Euclidean cases only, and make sure that the Cython code is not "usable for" metric="(sq)euclidean" / metric="minkowski" with p=2 for now to keep the existing numpy-based code with the GEMM trick for the time being.

This way we would not introduce a performance regression for the single-threaded case for now.

@jjerphan
Copy link
Member

jjerphan commented Feb 8, 2023

To answer this remark in #25561 (review):

I not sure how I feel about merging a PR to main with a performance regression in the single threaded case.

As proposed in #25561 (review), we can simultaneously:

  • use a feature branch to delay the inspection and the resolution
  • fall-back on the previous implementation when n_jobs=1

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments to complete @ogrisel's remark.

@Vincent-Maladiere
Copy link
Contributor Author

Vincent-Maladiere commented Feb 8, 2023

Running this asv benchmark with this script on 1 and 16 cores confirm our intuition that we have performances increase for n_core > 1 for 32 and 64bits, on dense-dense matrices.

As discussed IRL with @jjerphan, we have performance regression for the cases sparse-sparse, dense-sparse, and sparse-dense, which could pave the way for the next PRs on this feature branch.

       before           after         ratio
     [7917117e]       [ddc464ba]
     <main>           <pull/25561/head>
-         822+/-0ms          310+/-0ms     0.38  time_pairwise_distances(10000, 10000, 10, 'manhattan',  <class 'numpy.float32'>, 'dense', 'dense', 16)
-         823+/-0ms          265+/-0ms     0.32  time_pairwise_distances(10000, 10000, 10, 'manhattan',  <class 'numpy.float64'>, 'dense', 'dense', 16)
       before           after         ratio
     [7917117e]       [ddc464ba]
     <main>           <pull/25561/head>
+         1.07+/-0s          3.39+/-0s     3.17  time_pairwise_distances(10000, 10000, 10, 'manhattan',  <class 'numpy.float64'>, 'sparse', 'sparse', 1)
+         1.10+/-0s          3.38+/-0s     3.08  time_pairwise_distances(10000, 10000, 10, 'manhattan',  <class 'numpy.float32'>, 'sparse', 'sparse', 1)
+         1.34+/-0s          3.66+/-0s     2.73  time_pairwise_distances(10000, 10000, 10, 'manhattan',  <class 'numpy.float64'>, 'sparse', 'dense', 1)
+         185+/-0ms          505+/-0ms     2.73  time_pairwise_distances(10000, 10000, 10, 'manhattan',  <class 'numpy.float64'>, 'sparse', 'sparse', 16)
+         193+/-0ms          506+/-0ms     2.61  time_pairwise_distances(10000, 10000, 10, 'manhattan',  <class 'numpy.float32'>, 'sparse', 'sparse', 16)
+         1.61+/-0s          3.61+/-0s     2.24  time_pairwise_distances(10000, 10000, 10, 'manhattan',  <class 'numpy.float32'>, 'sparse', 'dense', 1)
+         823+/-0ms          1.76+/-0s     2.14  time_pairwise_distances(10000, 10000, 10, 'manhattan',  <class 'numpy.float64'>, 'dense', 'dense', 1)
+         257+/-0ms          540+/-0ms     2.11  time_pairwise_distances(10000, 10000, 10, 'manhattan',  <class 'numpy.float64'>, 'sparse', 'dense', 16)
+         822+/-0ms          1.68+/-0s     2.04  time_pairwise_distances(10000, 10000, 10, 'manhattan',  <class 'numpy.float32'>, 'dense', 'dense', 1)
+         290+/-0ms          566+/-0ms     1.96  time_pairwise_distances(10000, 10000, 10, 'manhattan',  <class 'numpy.float32'>, 'sparse', 'dense', 16)
+         2.38+/-0s          4.52+/-0s     1.90  time_pairwise_distances(10000, 10000, 10, 'manhattan',  <class 'numpy.float64'>, 'dense', 'sparse', 1)
+         339+/-0ms          611+/-0ms     1.80  time_pairwise_distances(10000, 10000, 10, 'manhattan',  <class 'numpy.float64'>, 'dense', 'sparse', 16)
+         2.52+/-0s          4.53+/-0s     1.80  time_pairwise_distances(10000, 10000, 10, 'manhattan',  <class 'numpy.float32'>, 'dense', 'sparse', 1)
+         365+/-0ms          644+/-0ms     1.76  time_pairwise_distances(10000, 10000, 10, 'manhattan',  <class 'numpy.float32'>, 'dense', 'sparse', 16)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Edit: after IRL discussion @jjerphan, we think this PR brings value because:

  • Currently, on the main branch, manhattan is the only distance with sparse-sparse support. It relies on the most efficient Cython implementation, with no dispatch nor abstraction cost, which makes reconsidering the observed performance regression.
  • Moreover, the remaining distance metrics don't provide sparse-sparse, sparse-dense, and dense-sparse support, nor do any other libraries in the scientific python ecosystem.
  • Therefore, this PR not only improves performance in the dense-dense case but also unifies the sparse-sparse, sparse-dense, and dense-sparse support for all distance metrics.

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the remark given by @Vincent-Maladiere in #25561 (comment) (see the edit), I think this PR has more value than I initially have thought since it enlarge the feature scope for the support of fused sparse and dense datasets pairs. Moreover, I think we can choose to have performance regression for computing Manhattan distances on pairs of CSR datasets.

To me the following items have to be addressed so that this PR can be merge in feature/PairwiseDistances (or alternatively in `main):

What do others think?

@ogrisel
Copy link
Member

ogrisel commented Feb 10, 2023

Therefore, this PR not only improves performance in the dense-dense case but also unifies the sparse-sparse, sparse-dense, and dense-sparse support for all distance metrics.

This does not seem to always be the case in the benchmark results: there are dense-dense case with a 2x slowdown. How to you explain this?

Edit: those are the sequential cases.

Could you please re-run with the benchmarks with 2 threads?

Edit: actually, this is already visible in the second plot at the beginning of the PR.

@ogrisel
Copy link
Member

ogrisel commented Feb 10, 2023

+1 for updating the changelog as quickly as possible in PRs. It helps reviewers better understand the user-facing scope of a PR by making it explicit.

@ogrisel
Copy link
Member

ogrisel commented Feb 10, 2023

modify PairwiseDistances.is_usable_for and PairwiseDistances.valid_metrics so that the following cases aren't yet treated:
n_jobs=1

I suppose you mean the effective_n_threads measured via openmp, not the n_jobs passed by the user to a public facing API that would be meant to control the use of joblib?

Is so, +1. This means that at least in the short term we keep on using the scipy metrics implementation in the dense-dense, single threaded case and would use the new Cython infrastructure for all the other cases.

@Vincent-Maladiere
Copy link
Contributor Author

Vincent-Maladiere commented Feb 11, 2023

Our new benchmark is consistent with what we expected:

euclidean

manhattan

Note that we don't use n_jobs in this benchmark anymore, only limiting threads through threadpoolctl, which alleviates some of the overhead on main that led to the slowdown for n_jobs = 2 observed in the previous benchmark.

@jjerphan jjerphan changed the title FEA Introduce PairwiseDistance FEA Introduce PairwiseDistances, a general back-end for pairwise_distances Feb 13, 2023
@jjerphan jjerphan changed the title FEA Introduce PairwiseDistances, a general back-end for pairwise_distances FEA Introduce PairwiseDistances, a generic back-end for pairwise_distances Feb 13, 2023
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few lasts remarks and suggestions to complete #25561 (review), which I corrected after @ogrisel's remark in #25561 (comment)

@Vincent-Maladiere Vincent-Maladiere force-pushed the feat/pairwise_distances-pdr-backend branch from b289d9f to a569758 Compare February 14, 2023 17:22
@Micky774
Copy link
Contributor

Hey @Micky774 ! I'm not working on it currently but I can resume this PR in the next 2 weeks if that sounds good to you?

Of course -- no rush either. I just wanted to make sure the project wasn't entirely set aside :)

@Micky774 Micky774 self-requested a review May 19, 2023 15:21
@jjerphan
Copy link
Member

jjerphan commented Jun 4, 2023

Hi @Vincent-Maladiere,

#26471 (which recently has been merged in main) has introduced a separation between the interface of DistanceMetric and its implementations. This created conflicts with this PR.

Feel free to let us know if you need help resolving them. :)

@Vincent-Maladiere
Copy link
Contributor Author

On it, let's try to have something working by the end of the week :)

@Micky774
Copy link
Contributor

It seems that the merge history here got messed up @Vincent-Maladiere

@jjerphan
Copy link
Member

@Vincent-Maladiere: if your reflog also getting in the previous state, you can find the commit of the previous state using the reflog and reset the branch to it:

# Get the latest changes from the base branch (which is up-to-date)
git checkout feature/PairwiseDistances
git pull upstream feature/PairwiseDistances

# Checkout to the previous commit
git checkout feat/pairwise_distances-pdr-backend
git reflog # find the commit
git checkout <commit>

# Force the branch of this PR back to this commit
git branch -f feat/pairwise_distances-pdr-backend

# Checkout back to this branch
git checkout feat/pairwise_distances-pdr-backend

# Update the branch, you might have conflicts to solve
git merge feature/PairwiseDistances

# After fixing conflicts, run isort on the files that you
# have modified via pre-commit.
pre-commit install
pre-commit run

# Finish the merge
git merge --continue

# Inspect the diff with the base branch
git diff feature/PairwiseDistances...

# If the diff is clear, force-push the branch 
git push -f fork feat/pairwise_distances-pdr-backend

@Vincent-Maladiere
Copy link
Contributor Author

Hey! Yes sorry about that, after rebasing on main it seems that feature/PairwiseDistance hasn't been synced for a while. @glemaitre has rebased feature/PairwiseDistance on main. So, after fixing the conflict hopefully, the large diffs will vanish.

Thanks @jjerphan for thegit branch -f trick, I'll try to fix the conflicts and simply push on the target feature branch one more time.

@github-actions
Copy link

github-actions bot commented Jun 22, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 54600aa

@Vincent-Maladiere
Copy link
Contributor Author

I'm running the benchmark again on {Dense, CSR} x {Dense, CSR} -- {manhattan, euclidean} distance -- {1, 2, 4, 6} threads, to make sure our conclusion is the same.

@Vincent-Maladiere
Copy link
Contributor Author

Blue is main, orange is this branch.

We obtain a slight speed-up for Euclidean distances in the Dense x CSR case (why?) and no difference otherwise, which matches our expectations, since we're not using PairwiseDistances for the Euclidean distance yet.

Screen Shot 2023-06-22 at 11 42 14 Screen Shot 2023-06-22 at 11 42 25 Screen Shot 2023-06-22 at 11 42 32

However, the Manhattan distance is significantly worse when using Dense x CSR or CSR x CSR, and the n_threads = 1 case with Dense Dense.

Screen Shot 2023-06-22 at 11 42 38 Screen Shot 2023-06-22 at 11 42 43 Screen Shot 2023-06-22 at 11 42 48

The CSR implementation we use are:

main: metrics._pairwise_fast._sparse_manhattan

branch: metrics._dist_metrics.ManhattanDistance.dist_csr

The main difference between these implementations is the parallelization happening within the _sparse_manhattan function, whereas parallelization happens outside the dist_csr in this branch, in _parallel_on_X.

Therefore, this branch makes more function call to the distance routine and this might explain the decrease.

Do you have any other ideas to explain the performance decrease? WDYT?

@Vincent-Maladiere
Copy link
Contributor Author

By the way, our tests are failing because of some connection error to conda forge.

@Micky774
Copy link
Contributor

Micky774 commented Jun 23, 2023

@Vincent-Maladiere could you generate a speedscope-compatible profile report for both methods? It'll help us spot any less-than-obvious costs, as well as help us confirm valid assumptions. You can see this comment for instructions: #26316 (comment)

@Vincent-Maladiere
Copy link
Contributor Author

Okay, let's profile this. TBH, I was hoping that you or @jjerphan knew what might have impaired the benchmark so that I don't need to run Pyspy and speedscope 😅

@Micky774
Copy link
Contributor

Micky774 commented Jul 4, 2023

Okay after some initial profiling, it looks like we're running into the same problems we had with the refactor attempts for the entire backend. We forfeit a lot of performance due to indirection and calls with this approach. I didn't expect it to be this dramatic, but alas.

When SparseSparseDataset64.dist is called, despite it being a rather thin wrapper around self.distance_metric.dist_csr, around 50% of the time is spent purely pushing the arguments to the stack, while the rest is the actual computation (a significant portion of which is also taken up by loading the arguments). Overall, only ~32% of time is actually spent on core computation.

This can be partially mitigated by only passing memoryviews by reference rather than value (dropping peripheral information like shape) wherever call overhead is a significant factor (e.g. in tight loops). Specifically here, it would mean changing DistanceMetric.{r}dist_csr to accept a pointer to self.{X, Y}_indices rather than the memoryview itself (#26765). This helps bridge the gap. The remaining performance loss is due to the inability to inline the computations.

I'm not sure that there is a way around this without copy-and-pasting entire swathes of the implementations. Options like code-generation and templating could solve this but at the cost of build time and binary size, along with the obvious maintenance burden.

Abstraction is directly in contest with performance here.

With that being said, this is really only a problem for manhattan distance because we have a cython specialization of it right now. We could keep using the custom specialization a special case (solely to avoid performance regression) and defer to the PairwiseDistance backend for all other cases.

@Micky774
Copy link
Contributor

@Vincent-Maladiere would you like to to keep working on this PR, or would you prefer I take over efforts here and let you focus on other things :)

@Micky774
Copy link
Contributor

Micky774 commented Aug 8, 2023

Superseded by #26983

Please feel free to leave comments/suggestions on the new PR. Thank you @Vincent-Maladiere for all the work you've done so far!

@Micky774 Micky774 closed this Aug 8, 2023
@Micky774 Micky774 added the Superseded PR has been replace by a newer PR label Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants