Skip to content

Array elements are not hashable #109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
StefanieSenger opened this issue Dec 11, 2024 · 14 comments
Closed

Array elements are not hashable #109

StefanieSenger opened this issue Dec 11, 2024 · 14 comments

Comments

@StefanieSenger
Copy link

In scikit-learn we use array values as dict keys, but trying this with array_api_strict array dtypes fails:

import array_api_strict

labels= array_api_strict.asarray([4, 5, 6])
label_to_index = {entry: idx for idx, entry in enumerate(labels)}

TypeError: unhashable type: 'Array'

Since the dtypes are all numbers or boolean, they are immutable and - I was thinking - also hashable? For numpy and torch arrays they are.

Is this a bug or on purpose for some reason?

@ev-br
Copy link
Member

ev-br commented Dec 11, 2024

Is this a bug or on purpose for some reason?

Not sure about a reason, I'd guess it's because data-apis/array-api#582 did not arrive at an actionable conclusion.

For dtypes specifically, a workaround is to use str(dtype): this works today even if not mandated by the spec (so would be nice to clarify in the standard itself).

Your example though: are you trying to hash arrays themselves? this does not sound healthy since arrays are mutable.

@StefanieSenger
Copy link
Author

StefanieSenger commented Dec 11, 2024

Hi @ev-br

Your example though: are you trying to hash arrays themselves? this does not sound healthy since arrays are mutable.

Actually, this code is not trying to hash arrays, only the values. This is existing code from confusion_matrix and it works fine with numpy.

Not sure why Python raises unhashable type: 'Array'. Maybe because the dtypes from array_api_strict are type Array?

a workaround is to use str(dtype)

This raises the same error. (Edit: No, it doesn't; I had just gotten the same error one line below and hadn't looked propperly.) But we're looking to narrowly adhere to the array_api_strict constrains anyways. This code doesn't raise on any of the other array libraries we support.

@ev-br
Copy link
Member

ev-br commented Dec 11, 2024

{str(xp.float32): 1} surely works, but your code does something else: it relies on indexing returning scalars not arrays.

Changing a dict into a list:

In [9]: labels= xp.asarray([4, 5, 6])
   ...: [(entry, idx) for idx, entry in enumerate(labels)]
Out[9]: 
[(Array(4, dtype=array_api_strict.int64), 0),
 (Array(5, dtype=array_api_strict.int64), 1),
 (Array(6, dtype=array_api_strict.int64), 2)]

so your code tries to hash 0D arrays, and that fails. It does not fail with numpy only because indexing returns "array scalars", and those do not exist anywhere else. For instance, your snippet fails on CuPy just the same:

In [7]: import numpy as np

In [8]: type(np.arange(3)[0])
Out[8]: numpy.int64

In [9]: import cupy

In [10]: type(cupy.arange(3)[0])
Out[10]: cupy.ndarray

In [11]: import array_api_strict as xp

In [12]: type(xp.arange(3)[0])
Out[12]: array_api_strict._array_object.Array

In [13]: labels = cupy.asarray([1, 2, 3])

In [14]: {entry: idx for idx, entry in enumerate(labels)}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 {entry: idx for idx, entry in enumerate(labels)}

Cell In[14], line 1, in <dictcomp>(.0)
----> 1 {entry: idx for idx, entry in enumerate(labels)}

TypeError: unhashable type: 'ndarray'

So I suppose the solution is to call int/float explicitly:

In [15]: {float(entry): idx for idx, entry in enumerate(labels)}
Out[15]: {1.0: 0, 2.0: 1, 3.0: 2}

@ev-br
Copy link
Member

ev-br commented Dec 11, 2024

And if it feels that array_api_strict is not reasonable (it does sometimes!), the behavior is explicitly mandated by the spec:
the very last note of https://data-apis.org/array-api/latest/API_specification/indexing.html#return-values is explicitly calling out single-element indexing.

All that said, your OP on being able to actually use dtypes as dictionary keys makes total sense to me, so is something to consider evolving the standard for.

@StefanieSenger
Copy link
Author

StefanieSenger commented Dec 12, 2024

Hi @ev-br, thanks for your timely response and your effort.

So, I understand that indexing an array_api_strict - array (and other array types) would return another array:

labels= array_api_strict.asarray([4, 5, 6])
print(labels[0])
print(type(labels[0]))
Out:
4
<class 'array_api_strict._array_object.Array'>

... even if we're only indexing a single element.

I didn't expect that, but I am now a bit clearer on this, thank you.

@ev-br
Copy link
Member

ev-br commented Dec 12, 2024

To circle back to the question in the title: can we use dtypes as dict keys? This was discussed at length in the Array API meeting earlier today, and the conclusion is it's surprisingly hard, and it's not going to happen in 2024 at least. (see the meeting notes at https://hackmd.io/zn5bvdZTQIeJmb3RW1B-8g

So pragmatically, as of 12.12.2024 there are two options:

  • use str(dtype) as keys. This works but 1) is not the spec, so __str__ dunder may or may not exist in an array library, and 2) there is no guarantee that two different libraries both use, say, just "float32". So if you go this route, you may want to use the tuple keys, (lib_name, str(dtype))
  • DIY a level of indirection with your own typecode mapping. In pseudocode:
_typecodes = {xp.bool: 1, xp.int8: 2, xp.unit8: 3, ....}

and use these typecodes (the values of the _typecodes dict above ) as dict keys.

@seberg
Copy link
Contributor

seberg commented Dec 12, 2024

As @betatim pointed out, we talked about hashing arr.dtype instances, but it seems the issues is actually about hashing result values and the only way to do that is by converting them to Python scalars.

If you want another way to do it, you would have to invent some completely new API. NumPy will (often) return scalars and provides APIs to ensure you get scalars, but there are no scalars here and arrays are mutable so they cannot be hashed.

If anything, I think this issue may be asking for a .item() attribute like in NumPy?!

In hindsight, I have to say that I am not sure how much that part even matters. Do you really write codes = {xp.bool: 1} after an xp = get_array_module(arr) block?

@ev-br
Copy link
Member

ev-br commented Dec 12, 2024

Yes, the title and the actual issue here are different :-).

Hashing arrays is clearly out of the question (says he who was once storing array.to_string() into an sqlite database). Making a dict keyed on dtypes is still potentially useful. Dispatching to a collection of type-specific kernels, how else would you do it?

@seberg
Copy link
Contributor

seberg commented Dec 12, 2024

collection of type-specific kernels

Yeah, but do you build one for each xp library you see on the fly? Otherwise you need stable hashes across implementations. Which indeed would be a use-case for a stable/canonical representation.

@ev-br
Copy link
Member

ev-br commented Dec 13, 2024

Exactly. Which is why in the meeting I was suggesting to pivot to just mandating the existence of dtype.__str__ or dtype.name attribute. Then users can use str(xp) + str(dtype) as dict keys and that'd cover all use cases I can think of.

@betatim betatim changed the title array_api_strict dtypes not hashable? Array elements are not hashable Dec 17, 2024
@betatim
Copy link
Member

betatim commented Dec 17, 2024

I've changed the title to match the top comment. It was too confusing for my little brain :D

If we want to discuss hashing the dtype, maybe lets make a new issue?

@ev-br
Copy link
Member

ev-br commented Dec 17, 2024

Actually, I should have checked about dtypes themselves: they are hashable today.

In [1]: import array_api_strict as xp

In [2]: {xp.int8: 1, xp.int16: 2}
Out[2]: {array_api_strict.int8: 1, array_api_strict.int16: 2}

So I guess we can close this one since there's hardly a chance that arrays themselves will get hashable.

@betatim
Copy link
Member

betatim commented Dec 17, 2024

To summarise the original issue as well:

An array-api-strict array isn't hashable because it is mutable. Like a list in Python. There is nothing like a tuple in array-api-strict. The entry in the below code isn't just a single value (a scalar) but an zero dimensional array:

labels= array_api_strict.asarray([4, 5, 6])
label_to_index = {entry: idx for idx, entry in enumerate(labels)}

The way to make it work is to explicitly convert the entry to a Python scalar. You probably need a little helper for that to check if you want to int(entry) or float(entry). I thought maybe we could use entry[0] but that doesn't work.

@betatim betatim closed this as completed Dec 17, 2024
@StefanieSenger
Copy link
Author

Sorry for the confusion. I had expected to have an element returned when indexing a single element of an array_api_strict arrray, like in numpy; now I see that returning a 0dim array instead even makes more sense. And then from your discussion it also seems that there's a difference between an element and a dtype, which I had thought were the same all along.

My problem is resolved for over a week anyways. Thanks everyone for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants