Skip to content

PERF speedup classification_report by attaching unique values to dtype.metadata #29738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Sep 5, 2024

Conversation

adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Aug 29, 2024

Fixes #26808
Closes #26820

This is alternative to #26820 where we attach unique values to the dtype.metadata of a view on y.

This gets the same speedup as reported in #26820 but is a lot cleaner IMO.

WDYT @ogrisel @glemaitre @thomasjpfan

(I'm still working on speeding up np.unique independent of this)

Copy link

github-actions bot commented Aug 29, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: c6ced7c. Link to the linter CI: here

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty cool. I did not know you can attached metadata to a dtype.

unique = xp.unique_values(y)
try:
unique_dtype = np.dtype(y.dtype, metadata={"unique": unique})
return y.view(dtype=unique_dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you pickle or np.save this view, does the dtype metadata get serialized as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pickle:

In [1]: import numpy as np

In [2]: a = np.array([1])

In [4]: m_dtype = np.dtype(a.dtype, metadata={"unique": np.unique(a)})

In [5]: b = a.view(dtype=m_dtype)

In [7]: from pickle import loads, dumps

In [8]: loads(dumps(b)).dtype.metadata
Out[8]: mappingproxy({'unique': array([1])})

However, np.save complains:

In [10]: from io import BytesIO

In [11]: f = BytesIO()

In [13]: np.save(f, b)
/home/adrin/micromamba/envs/sklearn/lib/python3.12/site-packages/numpy/lib/format.py:380: UserWarning: metadata on a dtype is not saved to an npy/npz. Use another format (such as pickle) to store it.
  d['descr'] = dtype_to_descr(array.dtype)

We're not gonna save these anyway, but interesting. cc @seberg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we had to put this in because of h5py, you can't store metadata in npy format.

Copy link
Member

@thomasjpfan thomasjpfan Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not saving it, but a user may end up saving an ndarray with the metadata attached. In this case:

  • If the unique values end up to be long, then it would increase their pickle size.
  • For np.save, they will get a warning that they can ignore. (But it's hard to tell if it's safe to ignore)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not modifying user's arrays. We create a new view and these arrays are not returned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the code now, I see that nothing is returned. Moving forward, we'll need to be careful that the output of _check_targets is not returned by a public function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good point, and I think it's a good idea to not modify y in any function that returns them. So I've moved attach_unique to the callers of _check_targets instead, so that in the future we don't have to worry about this point.

@adrinjalali adrinjalali marked this pull request as ready for review August 30, 2024 15:10
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, otherwise LGTM

def attach_unique(*ys, return_tuple=False):
"""Attach unique values of ys to ys and return the results.

The result is a view of y, and the metadata (unique) is not attached to y.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include a comment here that starts that the output of attach_unique should never be returned from a public function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, added a comment for this.

@glemaitre glemaitre requested review from ogrisel and glemaitre and removed request for ogrisel September 5, 2024 17:12
@glemaitre
Copy link
Member

LGTM as well.

Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
@adrinjalali adrinjalali enabled auto-merge (squash) September 5, 2024 19:32
@adrinjalali adrinjalali merged commit eb29207 into scikit-learn:main Sep 5, 2024
28 checks passed
@adrinjalali adrinjalali deleted the unique-cache branch September 6, 2024 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up classification_report
4 participants