PERF speedup classification_report by attaching unique values to dtype.metadata #29738

adrinjalali · 2024-08-29T11:50:54Z

Fixes #26808
Closes #26820

This is alternative to #26820 where we attach unique values to the dtype.metadata of a view on y.

This gets the same speedup as reported in #26820 but is a lot cleaner IMO.

WDYT @ogrisel @glemaitre @thomasjpfan

(I'm still working on speeding up np.unique independent of this)

…e.metadata

github-actions · 2024-08-29T11:52:12Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: c6ced7c. Link to the linter CI: here}

thomasjpfan

This is pretty cool. I did not know you can attached metadata to a dtype.

thomasjpfan · 2024-08-29T12:13:52Z

sklearn/utils/_unique.py

+    unique = xp.unique_values(y)
+    try:
+        unique_dtype = np.dtype(y.dtype, metadata={"unique": unique})
+        return y.view(dtype=unique_dtype)


When you pickle or np.save this view, does the dtype metadata get serialized as well?

pickle:

In [1]: import numpy as np In [2]: a = np.array([1]) In [4]: m_dtype = np.dtype(a.dtype, metadata={"unique": np.unique(a)}) In [5]: b = a.view(dtype=m_dtype) In [7]: from pickle import loads, dumps In [8]: loads(dumps(b)).dtype.metadata Out[8]: mappingproxy({'unique': array([1])})

However, np.save complains:

In [10]: from io import BytesIO In [11]: f = BytesIO() In [13]: np.save(f, b) /home/adrin/micromamba/envs/sklearn/lib/python3.12/site-packages/numpy/lib/format.py:380: UserWarning: metadata on a dtype is not saved to an npy/npz. Use another format (such as pickle) to store it. d['descr'] = dtype_to_descr(array.dtype)

We're not gonna save these anyway, but interesting. cc @seberg

Yes, we had to put this in because of h5py, you can't store metadata in npy format.

We are not saving it, but a user may end up saving an ndarray with the metadata attached. In this case:

If the unique values end up to be long, then it would increase their pickle size.

For np.save, they will get a warning that they can ignore. (But it's hard to tell if it's safe to ignore)

We are not modifying user's arrays. We create a new view and these arrays are not returned.

Looking through the code now, I see that nothing is returned. Moving forward, we'll need to be careful that the output of _check_targets is not returned by a public function.

That's a very good point, and I think it's a good idea to not modify y in any function that returns them. So I've moved attach_unique to the callers of _check_targets instead, so that in the future we don't have to worry about this point.

thomasjpfan

Minor comment, otherwise LGTM

thomasjpfan · 2024-09-05T16:09:48Z

sklearn/utils/_unique.py

+def attach_unique(*ys, return_tuple=False):
+    """Attach unique values of ys to ys and return the results.
+
+    The result is a view of y, and the metadata (unique) is not attached to y.


Can we include a comment here that starts that the output of attach_unique should never be returned from a public function?

Yep, added a comment for this.

doc/whats_new/v1.6.rst

sklearn/utils/_unique.py

glemaitre · 2024-09-05T17:23:59Z

LGTM as well.

Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>

PERF speedup classification_report by attaching unique values to dtyp…

6047d62

…e.metadata

github-actions bot added module:metrics module:utils labels Aug 29, 2024

thomasjpfan reviewed Aug 29, 2024

View reviewed changes

adrinjalali added 3 commits August 30, 2024 15:27

API cleanup

39339f8

Merge remote-tracking branch 'upstream/main' into unique-cache

9663448

changelog

7209447

adrinjalali marked this pull request as ready for review August 30, 2024 15:10

adrinjalali added 9 commits August 31, 2024 11:29

move caching out of _check_targets

11f4295

Merge remote-tracking branch 'upstream/main' into unique-cache

24afeae

add docstrings

fd31259

Merge remote-tracking branch 'upstream/main' into unique-cache

ee83711

Merge remote-tracking branch 'upstream/main' into unique-cache

9a51cea

use np.unique

075f918

Merge remote-tracking branch 'upstream/main' into unique-cache

45fa9f5

add tests

bb9537c

add another test

9ec3865

thomasjpfan approved these changes Sep 5, 2024

View reviewed changes

add comment

8b495cf

glemaitre requested review from ogrisel and glemaitre and removed request for ogrisel September 5, 2024 17:12

glemaitre approved these changes Sep 5, 2024

View reviewed changes

doc/whats_new/v1.6.rst Outdated Show resolved Hide resolved

sklearn/utils/_unique.py Outdated Show resolved Hide resolved

sklearn/utils/_unique.py Outdated Show resolved Hide resolved

sklearn/utils/_unique.py Outdated Show resolved Hide resolved

Guillaume's comments

c6ced7c

Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>

adrinjalali enabled auto-merge (squash) September 5, 2024 19:32

adrinjalali merged commit eb29207 into scikit-learn:main Sep 5, 2024
28 checks passed

adrinjalali deleted the unique-cache branch September 6, 2024 06:45

glemaitre mentioned this pull request Oct 18, 2024

roc_auc_score: incorrect result after merging #27412 #30079

Closed

Uh oh!

PERF speedup classification_report by attaching unique values to dtype.metadata #29738

PERF speedup classification_report by attaching unique values to dtype.metadata #29738

Uh oh!

Conversation

adrinjalali commented Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Sep 5, 2024

Uh oh!

Uh oh!

Uh oh!

adrinjalali commented Aug 29, 2024 •

edited

Loading

github-actions bot commented Aug 29, 2024 •

edited

Loading

thomasjpfan Aug 30, 2024 •

edited

Loading