Skip to content

ENH: structured datatype safety checks #5548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 5, 2015

Conversation

ahaldane
Copy link
Member

@ahaldane ahaldane commented Feb 9, 2015

This PR tries to solve some issues related structured datatypes, especially view/getfield/setfield. Previously views of structured arrays containing objects were completely disabled. This commit adds more lenient check for whether an object-array view is allowed, and adds similar checks to getfield/setfield

(This PR has been edited around 3/4/15. I originally planned other changes which I will submit as separate PRs)

I believe this PR solves issues #2599, #3256, #3286, #2346, #3253. Issue #5081 is also now partially fixed, but requires further thought (please take a look at the example there). It also fixes the new scipy issue scipy/scipy#4589.

Only one test needed modification to pass: test_objectarray_setfield checks that setfield with objects is disallowed, and expects a RuntimeError but I give it TypeError.

Summary

Previously views of structured arrays containing objects were completely disabled, which was a problem since views are needed in a few important cases such as when indexing multiple fields of a structured array, or in __getitem__ in record arrays.

With help from @jaimefrio I now have functions that carefully check whether a view into array would interpret non-Object memory as an object, or vice versa, and allow the view if there is no problem. Big thanks to Jaime for help, code, and explanations on this. I think the checks are compatible with his PR #5508. I check that 1) No memory that is not an object will be interpreted as a object, 2) No memory containing an object will be interpreted as an arbitrary type. Both cases can cause segfaults.

I also added similar checks to PyArray_GetField and PyArray_SetField. Previsouly, the user could view arbitrary data as a python Object, and it was fairly easy to make cases where a segfault would happen.

Note, however, that there are still "unsafe" operations (wrt Objects) in numpy which I hope to examine in future PRs, when I have time.

How to keep track of objects

There is a design choice here worth discussing. As part of the check for objects when taking a view, we need to go though the new dtypes's fields and make sure none of them overlap with memory containing an object in the old dtype (unless the field is also an object). Consider this potential view:

>>> basearr = np.array([(1,2,3),(4,5,6)], 
...                    dtype=[('f0', 'i8'), ('f1', 'O'), ('f2', 'i8')])
>>> varr = basearr.view({'names': ['f0','f2'], 
...                      'formats': ['i8','i8'], 
...                      'offsets': [0,16]})
>>> v2arr = varr.view({'names': ['f0'], 
...                    'formats': ['i8'], 
...                    'offsets': [8]})  

The problem would be that writing to v2arr will overwrite the Object in basearr. We need to prevent this, but the challenge is that varr doesn't "know" that there is an object at position 8. This raises the question: How do we figure out where the objects are in memory to prevent them from being viewed?

The strategy I have taken in this PR is to only allow views into fields "known" to the parent array. In the example above, the v2arr view would be disallowed, and only views of varr into offsets 0 and 16 would be allowed. (specifically, views into bytes 0-8 and 16-24). The downside is that many views are no longer "reversible", since once you take a view that "forgets" what was at an offset, you can never view the memory at that offset again in further views. While this constraint only really has to apply to object arrays (as tested with NPY_ITEM_HASOBJECT, which I think is inherited in all views) I think for consistency it should apply to all structured arrays.. That is, all 'partial' views (views that have less fields than the original array) would be irreversible.

An alternate strategy is to too look up all the object positions in the "base" object of the array, varr.base in this case. I assume that the "base" array (if it is an array) contains the list of all object offsets, which I use to check for view overlaps. I implemented this in a separate branch (partial version here), but it was difficult because arr.base does not always lead to an array object. For example it may be a buffer, in which case I disallow views containing object types, but that's not a big problem because users aren't allowed to create object arrays from buffers anyway. Another problem is (void) scalar structured types, as they do not list their "base" object even after being converted to a 0-d array. To fix this I made void scalar types' base attribute point to the original array. However, there are still some cases where I cannot get to the original array - for example nditer buffers the source array, which causes about 5 unit tests to fail, which I have not solved yet (but I could try). The advantage would be that views are reversible.

The irreversible view strategy seems better because it touches less parts, has fewer edge cases, and seems to me less likely to go wrong. It also works right now. I also have not imagined any scenarios where inability to reverse a view would be a big problem, and no related unit tests fail.

Note that even if the 'base array' strategy is not good, perhaps it's still a good idea to make void scalars "base" attribute point to the original array, as another PR? I admit I am not totally clear on the true purposes of "base" objects.

In this PR I also made some small tweaks to recarrays: I don't think they need to implement their own view function as a result of my last PRs.

@ahaldane ahaldane force-pushed the view_objects_owned branch 6 times, most recently from dd549f5 to 2322a79 Compare February 9, 2015 17:10
@ahaldane ahaldane force-pushed the view_objects_owned branch from 2322a79 to ab604c4 Compare March 4, 2015 05:29
@ahaldane ahaldane changed the title WIP: structured datatype safety checks ENH: structured datatype safety checks Mar 4, 2015
@ahaldane ahaldane force-pushed the view_objects_owned branch 3 times, most recently from 8c5736f to 73e4afe Compare March 5, 2015 19:38
@ahaldane
Copy link
Member Author

ahaldane commented Mar 5, 2015

All right, I think this is done. I'd appreciate any comments or suggestions!

count = 1
for dim in dtype.shape:
count *= dim
fields.extend((typ,off + dtype.base.itemsize*j)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after comma

@jaimefrio
Copy link
Member

Thanks for putting this together, @ahaldane. There has obviously been a lot of hard thinking going on!

I will go on reviewing later, just two things for now, aside from all the style nitpicks:

  1. It would be nice to add tests for the new functionality in _internal.py, without relying on other stuff. For instance, we will not be able to check the tiling of views unless we merge something like WIP: make views of different type checks less strict #5508, but it would be nice to know that it works properly beforehand.
  2. I also think that the overlap checks could be made smarter, probably by sorting the fields by offset and fusing adjacent non-objects together, then checking them in order. That can be left for another PR, especially if we add proper tests for those functions now.
  3. I'm not sure if you already had, but if you haven't, it would be a good idea to ping the list on the specific topic of "should views into structured arrays be reversible?"

@ahaldane
Copy link
Member Author

Thanks for looking over it @jaimefrio.

I think I've fixed the style issues, and I also made a few other small changes. I made it a second commit for clarity on what's changing.

I'll look into adding some extra tests, and I agree it might be possible to make the overlap checks smarter. I'll also ping the list.

/* check that we are not reinterpreting memory containing Objects */
_numpy_internal = PyImport_ImportModule("numpy.core._internal");
if (_numpy_internal == NULL) {
Py_DECREF(newtype);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only indent more than 4 spaces if the if condition spans more than 1 line.

@ahaldane ahaldane force-pushed the view_objects_owned branch from 00a31fe to 432ddcd Compare June 5, 2015 04:56
@ahaldane
Copy link
Member Author

ahaldane commented Jun 5, 2015

Static allocation is now in, using #5940. It looks like I don't need to set the error string if npy_cache_pyfunc fails, since it will already be set. If that's fine, then I think this is done.

@ahaldane ahaldane force-pushed the view_objects_owned branch 2 times, most recently from f3903e3 to 958bf63 Compare June 5, 2015 05:05
Previously views of structured arrays containing objects were completely
disabled.  This commit adds more lenient check for whether an object-array
view is allowed, and adds similar checks to getfield/setfield

Fixes numpy#2346. Fixes numpy#3256. Fixes numpy#2599. Fixes numpy#3253. Fixes numpy#3286.
Fixes numpy#5762.
@ahaldane ahaldane force-pushed the view_objects_owned branch from 958bf63 to e2cd6fa Compare June 5, 2015 05:14
@charris
Copy link
Member

charris commented Jun 5, 2015

@jaimefrio My concerns have been taken care of.

@jaimefrio
Copy link
Member

It shall then be merged, @charris ;-)

jaimefrio added a commit that referenced this pull request Jun 5, 2015
ENH: structured datatype safety checks
@jaimefrio jaimefrio merged commit a99c775 into numpy:master Jun 5, 2015
@jaimefrio
Copy link
Member

Thanks, @ahaldane!

@ahaldane
Copy link
Member Author

ahaldane commented Jun 5, 2015

Thanks to both of you too!

ahaldane added a commit to ahaldane/numpy that referenced this pull request Oct 27, 2015
Because of slowdowns caused by the view safety checks introduced in
 numpy#5548 they are removed here for 1.10. The plan is to reintroduce
a better version of the checks in 1.11.
charris pushed a commit to charris/numpy that referenced this pull request Oct 27, 2015
Because of slowdowns caused by the view safety checks introduced in
 numpy#5548 they are removed here for 1.10. The plan is to reintroduce
a better version of the checks in 1.11.
jaimefrio pushed a commit to jaimefrio/numpy that referenced this pull request Mar 22, 2016
Because of slowdowns caused by the view safety checks introduced in
 numpy#5548 they are removed here for 1.10. The plan is to reintroduce
a better version of the checks in 1.11.
@ahaldane ahaldane deleted the view_objects_owned branch January 17, 2018 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants