Skip to content

PERF speed up confusion matrix calculation #26820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

adrinjalali
Copy link
Member

Fixes #26808

Speed up confusion_matrix calculation.

This achieves the speedup by calculating the unique values on y_true and y_pred only once, and adds the required args along the way to the inner methods.

Not sure about the names of the added args on public functions, and might need to refactor to hide them if necessary.

Also this needs to add the same for some other functions to make the tests pass, but for now this works, and brings the time from about 15s to 5s on my machine.

import numpy as np

from sklearn.metrics import classification_report

y_true = np.random.randint(0, 2, size=2**23)
y_pred = y_true.copy()
np.random.shuffle(y_pred[2**20 : 2**21])

print(
    classification_report(
        y_true=y_true,
        y_pred=y_pred,
        digits=10,
        output_dict=False,
        zero_division=0,
    )
)

@github-actions
Copy link

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 75929c9. Link to the linter CI: here

@fingoldo
Copy link

Hi, what is required to merge this change?

@adrinjalali
Copy link
Member Author

reviews, our bottleneck is usually reviewer time ;)

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, I like removing the unique calls. On the other hand, I do not like adding more to the public API. I'm overall +0.5.

Comment on lines +424 to +425
unique_y_true=None,
unique_y_pred=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

labels, unique_y_true and unique_y_pred makes the public API looks really bloated. The only alternative I see is to have a private function.

Copy link
Member

@glemaitre glemaitre Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @thomasjpfan. I see only 2 solutions:

  • Make some private function where the public one is a wrapper around
  • Implementing our own efficient unique for NumPy arrays but it could be rather complex

(third solution is to contribute upstream :))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up classification_report
4 participants