Skip to content

MAINT: Refactor of core/_type_aliases.py #24411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

mtsokol
Copy link
Member

@mtsokol mtsokol commented Aug 13, 2023

Hi @rgommers @seberg @ngoldbaum,

Here I'm sharing a proposition of numpy/core/_type_aliases.py file refactor.
While working on removing other scalar aliases I had some difficulties understanding how allTypes/sctypeDict dictionaries are being created. For example, I noticed that the float alias is originally present in sctypeDict, then it's removed and then added again.

I decided to try to refactor this file to decompose the creation of these dictionaries into well-defined stages.
I think the purpose of this file is straightforward: Read types from the compiled multiarray module, and expose them as Python dictionaries, while mapping C naming conventions to Python conventions (e.g. interpret float as double instead of single).

I identified which "group" of entries is present in each dictionary (could it be modified for NumPy 2.0?):

allTypes sctypeDict
words (e.g. "float", "cdouble", "int_")
words+bits (e.g. "int16", "complex64")
symbols (e.g. "p", "L")
symbols+bytes (e.g. "c8", "b1")
abstracts (e.g. "inexact", "integer")
numbers (e.g. 1, 2, 3)
aliases (e.g. "complex_", "longfloat")
extra/other aliases (e.g. "bool")
...and there are just a few custom cases for both

Additionally, here's a decomposition of words and words+bits into three columns (/ symbol means that it may vary depending on 32 or 64-bit arch):

Canonical Python API name Python API "C-like" name Actual C type
bool
int8 byte char
uint8 ubyte unsigned char
int16 short short
uint16 ushort unsigned short
int32 intc int
uint32 uintc unsigned int
intp long
uintp unsigned long
int64 longlong longlong
uint64 ulonglong unsigned longlong
float16 half
float32 single float
float64 double double
float128 longdouble long double
complex64 csingle complex float
complex128 cdouble complex double
complex256 clongdouble complex long double
object_
bytes_
str_
void void
datetime64
timedelta64

In the refactored file I build each of these groups to then join them into final dicts. I do it in a loop in def _build_dicts(), where sized int & uint aliases must take under consideration a "priority" of types - in case at least two of them are same bitsize.
After that I rename and add aliases to the created groups. In the last stage I clean entries that aren't present in the original implementation (these are custom cases).

The last, and separate, stage creates sctypes which I copied, as the implementation was clear to me.

Important to mention, this modifies a crucial piece of code that could be tricky to debug (e.g. canonical int names are aliased incorrectly on some architecture (I tested it locally only on macbook with Intel)).
I renamed the old implementation to _expired_type_aliases.py and added a test test_new_vs_expired_type_aliases that checks if dictionaries created by the new implementation are exactly the same as done by the previous code. This should flag in the CI if the new implementation fails to mimic the previous one on a specific architecture. (After some time, the original impl could be completely removed)

Please share your feedback!

@rgommers
Copy link
Member

That's a very useful analysis and clear table, thanks @mtsokol. I think what may be useful for added insight here is explicitly listing all the names for the first two rows. I'd name those "canonical names for C types" and "canonical names in Python". There will be some adjustments here already, e.g. the actual C name float is what I'd expect us to call np.single. Something like (add complex types, str/object/intp/etc. too):

Canonical C name Canonical Python API name
half float16
single float32
double float64
longdouble n/a
byte int8
...
ubyte uint8
...
n/a bool

allTypes doesn't seem to be public, so I'm not sure we should focus on it in the first instance.

@mtsokol
Copy link
Member Author

mtsokol commented Aug 16, 2023

Sure! I included such table and I filled it with values coming from sctypeDict. Both columns are present in the dictionary (so all names can be used in python code) but they should clarify relationships between specific names.

allTypes is here to help understand how proposed implementation works (the refactored file builds allTypes and sctypeDict at the same time, as original version).

@rgommers
Copy link
Member

Sorry, maybe I should have suggested 3 columns. What I meant was the Python "C-like" and "Pythonic" names. The actual type in C could be useful too. So for this one for example:

int (only in C) | int32 / intc
-- | --

in Python code we'd prefer if people wrote np.int32. And the "C-like" name we also want to keep is intc. And any other aliases are on the list for removal.

The long/int64 one doesn't look quite correct. It's longlong in Python I believe as the "C-like" alias.

intp/uintp should be separate entries probably.

@mtsokol
Copy link
Member Author

mtsokol commented Aug 17, 2023

I updated the 2nd table with three columns ("Canonical Python", "C-like python", "compatible C type") - I wanted to clarify some rows.

The long/int64 one doesn't look quite correct. It's longlong in Python I believe as the "C-like" alias.

As int / long / longlong Python counterparts may vary depending on architecture, I checked mappings on my machine (Darwin x86_64) and Emscripten wasm32, and added all combinations that I found with / in these rows. WDYT?

intp/uintp should be separate entries probably.

Sure! Moved to separate rows.

@rgommers
Copy link
Member

As int / long / longlong Python counterparts may vary depending on architecture, I checked mappings on my machine (Darwin x86_64) and Emscripten wasm32, and added all combinations that I found with / in these rows. WDYT?

I think I'd write it like this:

Canonical Python API name Python API "C-like" name Actual C type
- intc int
- uintc unsigned int
intp ? long

So for C types that vary in size are exposed in the Python API, we don't have a canonical sized alias, and there should be one preferred name only. So C int becomes intc (or int_, but I think intc would be preferred), and C long becomes intp. I admit I don't know why we don't have a np.long or np.long_.

@mtsokol
Copy link
Member Author

mtsokol commented Aug 18, 2023

I think I'd write it like this:
Canonical Python API name Python API "C-like" name Actual C type
intc int
uintc unsigned int
intp ? long

So for C types that vary in size are exposed in the Python API, we don't have a canonical sized alias, and there should be one preferred name only. So C int becomes intc (or int_, but I think intc would be preferred), and C long becomes intp. I admit I don't know why we don't have a np.long or np.long_.

I updated the table so there are no multiple options depending on architecture (thus int_ is not in the table).

@rgommers
Copy link
Member

Getting there! Still some more to do for integers I think. What I'd like to get to is for each row to be unique. C int / long / longlong are separate types, so they should never be on the same row. I think the situation is:

@mtsokol
Copy link
Member Author

mtsokol commented Aug 18, 2023

Getting there! Still some more to do for integers I think. What I'd like to get to is for each row to be unique. C int / long / longlong are separate types, so they should never be on the same row. I think the situation is: ...

My previous update accidentally didn't get saved - I updated the table one more time. It should finally have unique values in each cell.

@rgommers
Copy link
Member

This looks quite good to me now, thanks Mateusz. Last thoughts for now:

  • we need to add bool as the canonical Python name, and keep bool_ as an alias because np.bool does not exist in 1.25.x/1.26.x
  • the C types for complex types can be added: complex float, complex double, and complex long double,

@mtsokol mtsokol force-pushed the type-aliases-refactor branch from 1a495ab to 109b1aa Compare August 29, 2023 14:59
@mtsokol mtsokol self-assigned this Aug 29, 2023
@mtsokol mtsokol force-pushed the type-aliases-refactor branch from 6eae269 to bdcb9ac Compare August 29, 2023 16:06
@mtsokol
Copy link
Member Author

mtsokol commented Sep 6, 2023

Closing as #24651 has been opened to hide outdated comments.

@mtsokol mtsokol closed this Sep 6, 2023
@mtsokol mtsokol deleted the type-aliases-refactor branch April 3, 2024 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants