Skip to content

ENH/API: Change flexible types to indicate resizability with elsize == -1 #8970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

eric-wieser
Copy link
Member

@eric-wieser eric-wieser commented Apr 21, 2017

This means that zero-size strings/unicode/void are now properly supported - in that S0, V0, and U0 always mean zero-size, not resizable. Follows from #6430, cc @embray

Some changes resulting from this:

  • np.dtype('S') and np.dtype('S0') now mean different things - "resizable" and "empty"

  • np.dtype(x).itemsize is None when x is: str, bytes, unicode, np.void, 'S', 'U', 'V'
    Previously, it was 0

  • np.dtype([('a', float), ('b', str)]) is now a ValueError
    Previously, it implied np.dtype([('a', float), ('b', str, 0)]), with the last field as size 0, likely masking bugs. This explicit form works without errors

  • np.empty(10, str) now has dtype S0, not S1. - removed since this change gains nothing
    We could be a little stricter here, and make it a ValueError as above

For the last two, we could take a more gradual deprecation path if necessary, keeping the existing behaviour but with a FutureWarning

It's probably worth adding a #define PyDescr_Resizable -1 somewhere, and using that everywhere - but I'm sure of the best name, nor the best file to put it in

@embray
Copy link
Contributor

embray commented Apr 21, 2017

I think this all makes sense to me. I never paid as close attention to the cases of using str and bytes in constructing dtypes since it's a bit too vague and inexplicit. But as long as that's to be supported I think these semantics make sense.

return NULL;
}
PyTuple_SET_ITEM(tuple, 0, type);
PyTuple_SET_ITEM(tuple, 1, one);
Copy link
Contributor

@juliantaylor juliantaylor Apr 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be just Py_BuildValue('(Oi)', type, 1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I did it this way for speed - perhaps premature optimization

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type needs casting to PyObject then.
The error checking above, while it won#t happen in practice may throw errors with static reference checker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for speed PyTuple_Pack is usually good

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Think I might go for Py_BuildValue anyway here, since creating zeros of string type doesn't sound like something that should be a bottleneck

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be surprised if what I have here is slower than PyTuple_Pack, since that at best does a loop with these lines in

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyTuple_Pack is (marginally) slower than directly setting a few entries but faster than Py_BuildValue if you already have objects.
Here it would be useful to save some lines of code, this doesn't seem to be a performance relevant path to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, changed to use Py_BuildValue("Ni", (PyObject*)type, 1)

``PyArray_Descr.elsize`` is now ``-1`` for unsized flexible dtypes
------------------------------------------------------------------
Previously it was ``0`` - but that made it impossible to distinguish unsized
from sized-to-0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding and example and dtypes is better, e.g. unsized dtypes (``S``) from sized-to-zero dtypes (``S0``)

it should also mention that the python-side returns None for unsized dtypes.

Copy link
Member Author

@eric-wieser eric-wieser Apr 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this too breaking a change? Any C code that constructs flexible dtypes manually is in for a bad time - but hopefully nothing ever should, since PyArray_DescrFromTypeObject(Py_String) etc already do the right things

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it previously returned a number now it doesn't. Likely not a big deal in practice but it should be mentioned.

Copy link
Member Author

@eric-wieser eric-wieser Apr 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I mean from the C api perspective. But yes, I agree it at the very least needs to be in the release notes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it has the potential to break others code that assumes itemsize is always positive.
But at least it should be an easy to detect error as it will probably crash by trying to allocate 2GiB of memory.

Copy link
Member Author

@eric-wieser eric-wieser Apr 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking code that assumed setting elsize to 0 would allow it to resize would be surprised when it no longer does

PyTuple_SET_ITEM(tuple, 0, type);
PyTuple_SET_ITEM(tuple, 1, one);

status = PyArray_DescrConverter(tuple, &type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work with overlapping input and output?

Copy link
Member Author

@eric-wieser eric-wieser Apr 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't overlap - tuple contains a copy of the value in type, not a pointer to the variable &type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really? I though setitem will just increase the reference count of type by 1 not actually make a copy of it.

Copy link
Member Author

@eric-wieser eric-wieser Apr 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SET_ITEM actually doesn't touch the reference count at all - it steals the reference.

But SetItem copies type but increases the reference count of *type. We would be free to set type = NULL after adding it to the tuple, and the tuple would still contain the right thing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but PyArray_DescrConverter would still be working inplace, its output type is the same object as its input type. It probably works fine but that should be checked.

Copy link
Member Author

@eric-wieser eric-wieser Apr 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it isn't, because PyArray_DescrConverter does essentially *(&type) = CreateNewDescr. It never touches the original object - because normally that original object is NULL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it does this it doesn't work because creating a new type into type overwrites the input before it is read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh it does work as its &type not type never mind then

Copy link
Member Author

@eric-wieser eric-wieser Apr 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it doesn't, because the input is no longer the type variable - it's the copy of the pointer the type variable held, which is stored inside tuple. So for example

type = (PyObject *) 0x00001230
Tuple_SETITEM(tuple, 0, type);
Tuple_GETITEM(tuple, 0);  // returns (PyObject *) 0x00001230
type = NULL
Tuple_GETITEM(tuple, 0);  // still returns (PyObject *) 0x00001230 - we didn't store a pointer to type

juliantaylor
juliantaylor previously approved these changes Apr 21, 2017
Copy link
Contributor

@juliantaylor juliantaylor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good with some minor comments

@eric-wieser
Copy link
Member Author

Any suggestions for what to call PyDescr_Resizable, and where to put it?

@juliantaylor
Copy link
Contributor

hm it seems to crash astropys testsuite which would make it a too invasive change for now.

@eric-wieser
Copy link
Member Author

Which part causes the crash? It might be that we can just warn and keep the old behaviour in those places

@juliantaylor
Copy link
Contributor

no clue, astropys test system is not exactly easy to use

@eric-wieser
Copy link
Member Author

Comments addressed, hopefully.

@mhvk: Mind taking a look at the astropy failure, and seeing whether it can be worked around here?

@juliantaylor
Copy link
Contributor

seems to be division by 0 in numpy/core/src/multiarray/methods.c:1672

@eric-wieser
Copy link
Member Author

eric-wieser commented Apr 21, 2017

Nice catch! There are more of those elsewhere too...

@MSeifert04
Copy link
Contributor

no clue, astropys test system is not exactly easy to use

If you feel it's too complicated please open an issue on the astropy tracker or on the astropy-dev mailing list. It would be definetly helpful to get this kind of feedback - at least I would like to know what makes it "not easy to use". :)

Sorry for being off-topic here.

@juliantaylor
Copy link
Contributor

juliantaylor commented Apr 21, 2017

@MSeifert04
pytests just eats up crashes so you can't debug (probably via forking so you could via inferiors but that is annoying)
python astropy/io/ascii/tests/test_c_reader.py and submodule tests python -c "import astropy.io; astropy.io.test()" don't work

@mhvk
Copy link
Contributor

mhvk commented Apr 21, 2017

@eric-wieser - I cannot seem to install this branch:

numpy/core/src/multiarray/ctors.c: In function ‘PyArray_Zeros’:
numpy/core/src/multiarray/ctors.c:2852:41: warning: multi-character character constant [-Wmultichar]
         PyObject *tuple = Py_BuildValue('Ni', (PyObject *)type, 1);
                                         ^~~~
numpy/core/src/multiarray/ctors.c:2852:41: warning: passing argument 1 of ‘_Py_BuildValue_SizeT’ makes pointer from integer without a cast [-Wint-conversion]
In file included from /usr/include/python3.5m/Python.h:114:0,
                 from numpy/core/src/multiarray/ctors.c:2:
/usr/include/python3.5m/modsupport.h:35:24: note: expected ‘const char *’ but argument is of type ‘int’
 PyAPI_FUNC(PyObject *) _Py_BuildValue_SizeT(const char *, ...);
                        ^~~~~~~~~~~~~~~~~~~~
numpy/core/src/multiarray/ctors.c:2856:9: error: ‘status’ undeclared (first use in this function)
         status = PyArray_DescrConverter(tuple, &type);
         ^~~~~~
numpy/core/src/multiarray/ctors.c:2856:9: note: each undeclared identifier is reported only once for each function it appears in

@mhvk
Copy link
Contributor

mhvk commented Apr 21, 2017

@juliantaylor - astropy uses pytest, calling parts of tests is done differently:

python3 setup.py test -P io
python3 setup.py test -t astropy/io/ascii/tests/test_c_reader.py

@eric-wieser
Copy link
Member Author

eric-wieser commented Apr 21, 2017

@mhvk: fixed. Note that this last commit is fixing bugs that were already present, just really hard to come across before this changeset

/*
Grow PyArray_DATA(ret):
this is similar for the strategy for PyListObject, but we use
50% overallocation => 0, 4, 8, 14, 23, 36, 56, 86 ...
*/
elcount = (i >> 1) + (i < 4 ? 4 : 2) + i;
if (elcount <= NPY_MAX_INTP/elsize) {
new_data = PyDataMem_RENEW(PyArray_DATA(ret), elcount * elsize);
if (!npy_mul_with_overflow_intp(elcount, elsize, &nbytes)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your mul_with_overflow are wrong, its *result, a, b

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alarming that runtest.py -b on windows does not warn me about that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our travis warning check catches it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, but that's pretty late in the pipeline, and it runs a bunch of other tests first

@eric-wieser eric-wieser force-pushed the dtype-S0 branch 3 times, most recently from 14e5c4e to 9f746e6 Compare April 21, 2017 15:28
@eric-wieser
Copy link
Member Author

(rebased on top of #8977)

@homu
Copy link
Contributor

homu commented Apr 23, 2017

☔ The latest upstream changes (presumably #8971) made this pull request unmergeable. Please resolve the merge conflicts.

eric-wieser added a commit to eric-wieser/numpy that referenced this pull request Nov 3, 2017
This allows us to change how flexible types with no length are represented in future, to allow zero-size dtypes (numpy#8970).
@eric-wieser
Copy link
Member Author

(Rebased on top of #9953)

@mattip
Copy link
Member

mattip commented Aug 7, 2018

additional rebase needed

This allows empty strings to be unambigiously specified.

Unsized strings continue to promote to single-character strings.
@eric-wieser eric-wieser force-pushed the dtype-S0 branch 2 times, most recently from d9b2d23 to bc393a8 Compare October 10, 2020 14:09
@eric-wieser eric-wieser added 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. 63 - C API Changes or additions to the C API. Mailing list should usually be notified. component: numpy.dtype and removed 55 - Needs work labels Oct 10, 2020
Base automatically changed from master to main March 4, 2021 02:03
@charris charris added the 52 - Inactive Pending author response label Apr 6, 2022
@charris charris closed this Apr 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 52 - Inactive Pending author response 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. 63 - C API Changes or additions to the C API. Mailing list should usually be notified. component: numpy._core component: numpy.dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants