Skip to content

BUG: automatically convert recarray dtype to np.record #5943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 19, 2015

Conversation

ahaldane
Copy link
Member

@ahaldane ahaldane commented Jun 5, 2015

As discussed in #3581, this PR automatically converts the dtype of np.recarrays to np.record.

To get unit tests to pass I also had to add a setattr to the MaskedArray class: MaskedArrays did not support assignment to dtype attribute, as demonstrated in the following examples:

>>> a = np.zeros(4, dtype='f4,i4')
>>> m = np.ma.array(a)
>>> m.dtype = np.dtype('f8')
>>> m   # Exception
>>> a.view(dtype='f8', type=np.ma.MaskedArray) #Exception

MaskedArray.setattr now catches assignments to dtype and updates the mask accordingly.

Possible issues with this PR, if anyone has any ideas:

  1. Viewing as a.view(np.recarray) now changes the dtype. It's not totally clear to me this won't break anything. (Example: It broke maskedarrays above, but due to a maskedaray bug).

  2. Another example of the last point: Views are not reversible.

>>> a = np.zeros(4, 'f4,i4')
>>> b = a.view(np.recarray).view(np.ndarray)
>>> b.dtype
dtype((numpy.record, [('f0', '<f4'), ('f1', '<i4')]))
  1. Note (not really a problem): Attempt to create record arrays of non-structured type does not set the dtype to np.record (since it is not possible to do so). This is not new to this PR.
>>> a = np.zeros(5, 'f8')
>>> np.rec.array(a)
array([ 0.,  0.,  0.,  0.], 
      dtype=float64).view(numpy.recarray)

Benefits:

I still need to finish writing unit tests

@ahaldane ahaldane force-pushed the record_finalize branch 4 times, most recently from bf2597f to 742801f Compare June 5, 2015 18:10
@ahaldane
Copy link
Member Author

ahaldane commented Jun 5, 2015

ugh there are still two test failures, deep in MaskedArray code. I'll work on this later.

@charris
Copy link
Member

charris commented Jun 5, 2015

My sympathies ;)

@mhvk
Copy link
Contributor

mhvk commented Jun 5, 2015

@ahaldane - I hadn't thought about the reversibility. Since that cannot possibly be guaranteed anyway, I think it is not a major problem, but perhaps it is good in the documentation to mention how one would get a regular structured array out (i.e., reverse the dtype change). Is it just rec.view(dtype=np.dtype(np.void, rec.dtype), np.ndarray)?

@ahaldane
Copy link
Member Author

ahaldane commented Jun 8, 2015

Update, here's what's new

  1. Updated docs for how to reverse the view. @mhvk That makes sense. And I think the reverse view needs to be
rec.view(rec.dtype.fields or rec.dtype, np.ndarray)

to take into account non-structured (non-void) record arrays. IE, both of the following record arrays

rec = np.rec.array(ones(4, dtype='f4,i4')) # structured, dtype is np.void
rec = np.rec.array(ones(4, dtype='f8'))    # unstructured, dtype is np.float64
  1. While working out the view above, I noticed that non-void record arrays can be created/represented using rec.array despite not having np.record dtype, so I modified recarray.__repr__ accordingly.

  2. I removed the recarray.view method. The only case it differed from ndarray.view was for non-structured dtype, in which case it returned an ndarray instead of a (non-structured) recarray. But this breaks chained views:

>>> rec.view('f8').view('f4,i4')

The user probably wants the final result to be a record array, but the middle view currently converts to ndarray. There are already a fair number of places non-structured recarrays turn up (see above), so no reason to hide them.

  1. Hopefully clarified the docs for the dtype((base_dtype, new_dtype)) form of dtype specification.

As for the MaskedArray ssues, they pass tests now but I think there may still be something wrong. I still need to think more about it.

@mhvk
Copy link
Contributor

mhvk commented Jun 8, 2015

@ahaldane - I had a quick look through the latest PR and it seems OK (but as I don't use recarray much myself, someone else should definitely look as well).

@ahaldane
Copy link
Member Author

Updated with extra tweaks to the MaskedArray part. (I think the recarray part is done).

I feel better about the MaskedArray changes now. As a reminder, this needed to happen to pass unit tests related to masked record arrays.

A problem was that the masked singleton's shape was being overwritten if I wasn't careful (another example of #5806). I think my changes are safe wrt this now. As an extra precaution I also tweaked MaskedArray.anom by quitting early if the result is masked, to avoid any extra manipulation of masked.

I also removed code that updates the mask's shape after the dtype is updated. I am not sure what its purpose was, and removing it doesn't affect any unit tests. Maskedarray doesn't support dtype changes that change the shape anyway. Maybe I have to think more about it, but right now it seems OK to me, and is easily changed back.

def __array_finalize__(self, obj):
if type(obj) is not type(self):
# invoke __setattr__
self.dtype = self.dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahaldane - In principle, at this point self.dtype is obj.dtype -- I think it would read better to just write self.dtype = obj.dtype, and expand your comment. e.g.,

    # invoke __setattr__, which for void dtypes will do
    # self.dtype = sb.dtype((record, obj.dtype))
    self.dtype = obj.dtype

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded the comment, but I still have to use self.dtype since the condition is "self.dtype is not obj.dtype" ;)

@ahaldane
Copy link
Member Author

Updated with better comment + expanded tests, plus added the shape checks mentioned above back in. When adding the new tests I finally figured out why the shape manipultion was there - it raises a ValueError if the dtype change cannot apply to the mask.

Eg, when changing from 'f4,f4' to 'f8', the mask would change from [('f0', '?'), ('f1', '?')] to bool, but it is unclear how to construct the new mask values from the old mask - should you OR the old mask? Currently MA just disallows that dtype change. 'f4,f4' to 'f4' on the other hand has no such problem.

if type(obj) is not type(self):
# invoke __setattr__, which for void dtypes will do
# self.dtype = sb.dtype((record, self.dtype))
self.dtype = self.dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahaldane - it should be:

class myarray(np.ndarray):
    def __array_finalize__(self, obj):
        print(self, obj)
        print(type(self), type(obj), type(self) is type(obj))
        if obj is not None:
            print(self.dtype, obj.dtype, self.dtype is obj.dtype)

# Now the three ways we can get to __array_finalize__
myarray((4,), buffer=np.arange(4.))
# [ 0.  1.  2.  3.] None
# <class '__main__.myarray'> <class 'NoneType'>
# -->  myarray([ 0.,  1.,  2.,  3.])
mya[:2]
# [ 0.  1.] [ 0.  1.  2.  3.]
# <class '__main__.myarray'> <class '__main__.myarray'>
# float64 float64 True
# --> myarray([ 0.,  1.])
np.arange(4.).view(myarray)
# [ 0.  1.  2.  3.] [ 0.  1.  2.  3.]
# <class '__main__.myarray'> <class 'numpy.ndarray'> False
# float64 float64 True
# --> Out[15]: myarray([ 0.,  1.,  2.,  3.])

I think this is just a problem with your if statement; it should be

if obj is not None and type(self) is not type(obj):
    self.dtype = obj.dtype

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm being too clever, but I think we want the dtype to be set to np.record in the case of an explicit constructor. I still think the clause I have is the right one.

class TestClass(np.ndarray):
    def __array_finalize__(self, obj):
        print(obj is not None and type(self) is not type(obj), 
              type(obj) is not type(self))

a = np.arange(10)

t1 = a.view(TestClass)  # view fron ndarray
t2 = TestClass(4)       # explicit constructor
t3 = t1[:2]             # slice
t4 = t1.view(TestClass) # view from same class

This prints out (left is your if, right is my if)

(True, True)
(False, True)
(False, False)
(False, False)

I think the right is what I want. Is that correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, if you need it to be set also for explicit construction then, yes, your case is the right one. Since this was not obvious to me, and therefore may not be obvious to a future person looking at the code, maybe write the comment as:

# Both when constructing a new instance (obj=None) and when viewing
# from another class -- obj.view(type(self)) -- we invoke __setattr__,
# which for void dtypes will do self.dtype = sb.dtype((record, self.dtype)).

@mhvk
Copy link
Contributor

mhvk commented Jun 19, 2015

@ahaldane - nice catch on the mask change with incompatible dtype. While oring the mask seems OK for the example you give, it does seem safest to just disallow it, at least for now!

@ahaldane
Copy link
Member Author

@mhvk the detailed comment is a good idea. Done.

@mhvk
Copy link
Contributor

mhvk commented Jun 19, 2015

Great, looks all OK to me!

@ahaldane
Copy link
Member Author

Don't merge, I found a problem:

>>> r = np.rec.array(np.ones(4, dtype='f4,f4'))
>>> r.view('f8').dtype
dtype('float64')

will fix soon.

@ahaldane
Copy link
Member Author

Oh never mind, that is a scalar recarray. It is working as expected, I just confused myself.

But it did make me notice, perhaps a better if clause in array_finalize (as we were discussing above) is

if self.dtype.type is not record

I think that's clearer, right?

@charris
Copy link
Member

charris commented Jun 19, 2015

Agree.

Viewing an ndarray as a np.recarray now automatically converts
the dtype to np.record.

This commit also fixes assignment to MaskedArray's dtype attribute,
fixes the repr of recarrays with non-structured dtype, and removes
recarray.view so that viewing a recarray as a non-structured dtype
no longer converts to ndarray type.

Fixes numpy#3581
@mhvk
Copy link
Contributor

mhvk commented Jun 19, 2015

Yes, better, now the test is much more self-explanatory.

charris added a commit that referenced this pull request Jun 19, 2015
BUG: automatically convert recarray dtype to np.record
@charris charris merged commit e42bea5 into numpy:master Jun 19, 2015
@charris
Copy link
Member

charris commented Jun 19, 2015

@ahaldane @mhvk Thanks, merged. I hope it is correct ;)

@ahaldane
Copy link
Member Author

Thanks, and thanks @mhvk for helpful comments & feedback.

ahaldane added a commit to ahaldane/numpy that referenced this pull request Oct 18, 2015
Record array views were updated in numpy#5943 to return np.record dtype
where possible, but forgot about the case of sub-arrays.

That's fixed here, so accessing subarray fields by attribute or index
works sensibly, as well as viewing a record array as a subarray dtype,
and printing subarrays.

This also happens to fix numpy#6459, since it affects the same lines.

Fixes numpy#6497 numpy#6459
charris pushed a commit to charris/numpy that referenced this pull request Oct 18, 2015
Record array views were updated in numpy#5943 to return np.record dtype
where possible, but forgot about the case of sub-arrays.

That's fixed here, so accessing subarray fields by attribute or index
works sensibly, as well as viewing a record array as a subarray dtype,
and printing subarrays.

This also happens to fix numpy#6459, since it affects the same lines.

Fixes numpy#6497 numpy#6459
jaimefrio pushed a commit to jaimefrio/numpy that referenced this pull request Mar 22, 2016
Record array views were updated in numpy#5943 to return np.record dtype
where possible, but forgot about the case of sub-arrays.

That's fixed here, so accessing subarray fields by attribute or index
works sensibly, as well as viewing a record array as a subarray dtype,
and printing subarrays.

This also happens to fix numpy#6459, since it affects the same lines.

Fixes numpy#6497 numpy#6459
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants