Skip to content

MAINT: Refactor of numpy/core/_type_aliases.py #24651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

mtsokol
Copy link
Member

@mtsokol mtsokol commented Sep 6, 2023

Hi @seberg @rgommers,

I'm starting here a new PR for reviewing as the discussion in the original one concluded.


Here I'm sharing a proposition of numpy/core/_type_aliases.py file refactor.
While working on removing other scalar aliases I had some difficulties understanding how allTypes/sctypeDict dictionaries are being created. For example, I noticed that the float alias is originally present in sctypeDict, then it's removed and then added again.

Also, as NEP 52 suggest, float96 and float128 could be removed (extended precision) and it would also cover modifying _type_aliases.py. This refactor makes it easier to introduce new changes to the available aliases list.

I decided to try to refactor this file to decompose the creation of these dictionaries into well-defined stages.
I think the purpose of this file is straightforward: Read types from the compiled multiarray module, and expose them as Python dictionaries, while mapping C naming conventions to Python conventions (e.g. interpret float as double instead of single).

I identified which "group" of entries is present in each dictionary (could it be modified for NumPy 2.0?):

allTypes sctypeDict
words (e.g. "float", "cdouble", "int_")
words+bits (e.g. "int16", "complex64")
symbols (e.g. "p", "L")
symbols+bytes (e.g. "c8", "b1")
abstracts (e.g. "inexact", "integer")
numbers (e.g. 1, 2, 3)
aliases (e.g. "complex_", "longfloat")
extra/other aliases (e.g. "bool")
...and there are just a few custom cases for both

Additionally, here's a decomposition of words and words+bits into three columns:

Canonical Python API name Python API "C-like" name Actual C type
bool
int8 byte char
uint8 ubyte unsigned char
int16 short short
uint16 ushort unsigned short
int32 intc int
uint32 uintc unsigned int
intp intptr_t
uintp uintptr_t
int32 / int64 long long
uint32 / uint64 ulong unsigned long
longlong longlong
ulonglong unsigned longlong
float16 half
float32 single float
float64 double double
float128 longdouble long double
complex64 csingle complex float
complex128 cdouble complex double
complex256 clongdouble complex long double
object_
bytes_
str_
void void
datetime64
timedelta64

In the refactored file I build each of these groups to then join them into final dicts. I do it in a loop in def _build_dicts(), where sized int & uint aliases must take under consideration a "priority" of types - in case at least two of them are same bitsize.
After that I rename and add aliases to the created groups. In the last stage I clean entries that aren't present in the original implementation (these are custom cases).

The last, and separate, stage creates sctypes which I copied, as the implementation was clear to me.

Important to mention, this modifies a crucial piece of code that could be tricky to debug (e.g. canonical int names are aliased incorrectly on some architecture (I tested it locally only on macbook with Intel)).
I renamed the old implementation to _expired_type_aliases.py and added a test test_new_vs_expired_type_aliases that checks if dictionaries created by the new implementation are exactly the same as done by the previous code. This should flag in the CI if the new implementation fails to mimic the previous one on a specific architecture. (After some time, the original impl could be completely removed)

Please share your feedback!

@seberg
Copy link
Member

seberg commented Sep 7, 2023

We talked a bit about this and I tried to remove some C-side stuff and change the exposure a bit via: https://github.com/seberg/numpy/tree/typeinfo-remove (it could probably be simplified more, but at least 300 lines of weird C-code gone 🎉). That also removes the redefinition of the "integer priority" list in Python.

I am a bit unsure about float128. On the one hand, I hate it and want everyone to use longdouble. OTOH, I think it may be used (unnecessarily) quite a lot.
So, I am wondering if we can replace float128/float96 (when they exist) with a FutureWarning("This is a terrible alias for longdouble and may go away or be redefined to proper 128bit precision (which this does not have normally). We may also decide to remove longdouble support from NumPy at some future point.")?

@mtsokol
Copy link
Member Author

mtsokol commented Sep 9, 2023

We talked a bit about this and I tried to remove some C-side stuff and change the exposure a bit via: https://github.com/seberg/numpy/tree/typeinfo-remove (it could probably be simplified more, but at least 300 lines of weird C-code gone 🎉). That also removes the redefinition of the "integer priority" list in Python.

@seberg Thank you! I built typeinfo-remove branch locally and one thing that I noticed is that the typeinfo after refactor returns np.dtypes.Int64DType or numpy.dtypes.BoolDType as dict values instead of np.int64 and np.bool_ (it was type attribute in typeinfo). Should I transform them?

@seberg
Copy link
Member

seberg commented Sep 9, 2023

It should return the dtype, you are fetching the type(dtype) from it. Either way should work in practice via .type although no need for type(dtype) here. You can change it to return the scalar type also.

@eric-wieser
Copy link
Member

In that table, I'd argue intp and uintp should be intptr_t and uintptr_t, not long and unsigned long. As written that table is wrong on windows.

@eric-wieser
Copy link
Member

eric-wieser commented Sep 23, 2023

In fact, the table is also wrong on Linux where int64 is the C long type, not the C long long type. You can see that at https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.int_, which says that on 64-bit linux, int_ is the implementation of int64.

@mtsokol
Copy link
Member Author

mtsokol commented Sep 25, 2023

@eric-wieser thank you for your feedback!
I updated the table - would you have a spare minute to say if it's more accurate?

@mtsokol mtsokol mentioned this pull request Oct 5, 2023
@ngoldbaum
Copy link
Member

@mtsokol can you clear the merge conflict?

I believe all comments have been addressed so once this is mergeable I'd like to get this merged this week along with #24867. Please comment if you still have concerns.

@mtsokol mtsokol changed the title MAINT: Refactor of numpy/core/_type_aliases.py ~~MAINT: Refactor of numpy/core/_type_aliases.py~~ Oct 5, 2023
@mtsokol mtsokol changed the title ~~MAINT: Refactor of numpy/core/_type_aliases.py~~ MAINT: Refactor of numpy/core/_type_aliases.py Oct 5, 2023
@mtsokol mtsokol closed this Oct 5, 2023
@mtsokol mtsokol deleted the type-aliases-refactor-new-branch branch October 5, 2023 14:49
@mtsokol
Copy link
Member Author

mtsokol commented Oct 5, 2023

@mtsokol can you clear the merge conflict?

I believe all comments have been addressed so once this is mergeable I'd like to get this merged this week along with #24867. Please comment if you still have concerns.

Sorry for confusion!
This was the old one - I created new PR that has Sebastian's changes: #24679

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants