-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
PERF speedup classification_report by attaching unique values to dtype.metadata #29738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty cool. I did not know you can attached metadata to a dtype.
sklearn/utils/_unique.py
Outdated
unique = xp.unique_values(y) | ||
try: | ||
unique_dtype = np.dtype(y.dtype, metadata={"unique": unique}) | ||
return y.view(dtype=unique_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you pickle or np.save
this view, does the dtype metadata get serialized as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pickle:
In [1]: import numpy as np
In [2]: a = np.array([1])
In [4]: m_dtype = np.dtype(a.dtype, metadata={"unique": np.unique(a)})
In [5]: b = a.view(dtype=m_dtype)
In [7]: from pickle import loads, dumps
In [8]: loads(dumps(b)).dtype.metadata
Out[8]: mappingproxy({'unique': array([1])})
However, np.save
complains:
In [10]: from io import BytesIO
In [11]: f = BytesIO()
In [13]: np.save(f, b)
/home/adrin/micromamba/envs/sklearn/lib/python3.12/site-packages/numpy/lib/format.py:380: UserWarning: metadata on a dtype is not saved to an npy/npz. Use another format (such as pickle) to store it.
d['descr'] = dtype_to_descr(array.dtype)
We're not gonna save these anyway, but interesting. cc @seberg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we had to put this in because of h5py, you can't store metadata in npy format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not saving it, but a user may end up saving an ndarray with the metadata attached. In this case:
- If the unique values end up to be long, then it would increase their pickle size.
- For
np.save
, they will get a warning that they can ignore. (But it's hard to tell if it's safe to ignore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not modifying user's arrays. We create a new view and these arrays are not returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through the code now, I see that nothing is returned. Moving forward, we'll need to be careful that the output of _check_targets
is not returned by a public function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good point, and I think it's a good idea to not modify y
in any function that returns them. So I've moved attach_unique
to the callers of _check_targets
instead, so that in the future we don't have to worry about this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, otherwise LGTM
def attach_unique(*ys, return_tuple=False): | ||
"""Attach unique values of ys to ys and return the results. | ||
|
||
The result is a view of y, and the metadata (unique) is not attached to y. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include a comment here that starts that the output of attach_unique
should never be returned from a public function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, added a comment for this.
LGTM as well. |
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
Fixes #26808
Closes #26820
This is alternative to #26820 where we attach unique values to the
dtype.metadata
of a view ony
.This gets the same speedup as reported in #26820 but is a lot cleaner IMO.
WDYT @ogrisel @glemaitre @thomasjpfan
(I'm still working on speeding up
np.unique
independent of this)