Skip to content

ENH: Various improvements to Maskedarray repr #9792

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 20, 2017

Conversation

eric-wieser
Copy link
Member

@eric-wieser eric-wieser commented Sep 29, 2017

While we're playing the "break the doctests" game...

  • Remove trailing newline (probably safe in doctests)
  • Fix alignment of wrapped lines (maybe safe in doctests
  • Add commas (almost makes it valid python, breaks all doctests)

Before:

>>> np.ma.array(np.arange(30), mask=np.arange(30) % 3 == 0)
masked_array(data = [-- 1 2 -- 4 5 -- 7 8 -- 10 11 -- 13 14 -- 16 17 -- 19 20 -- 22 23 -- 25 26
 -- 28 29],
             mask = [ True False False  True False False  True False False  True False False
  True False False  True False False  True False False  True False False
  True False False  True False False],
       fill_value = 999999)

>>> np.ma.array(np.eye(2).astype(int), mask=[[1, 0], [0, 0]], dtype=np.int8)
masked_array(data =
 [[-- 0]
 [0 1]],
             mask =
 [[ True False]
 [False False]],
       fill_value = 999999)

After

>>> np.ma.array(np.arange(30), mask=np.arange(30) % 3 == 0)
masked_array(data = [--, 1, 2, --, 4, 5, --, 7, 8, --, 10, 11, --, 13, 14,
                     --, 16, 17, --, 19, 20, --, 22, 23, --, 25, 26, --,
                     28, 29],
             mask = [ True, False, False,  True, False, False,  True,
                     False, False,  True, False, False,  True, False,
                     False,  True, False, False,  True, False, False,
                      True, False, False,  True, False, False,  True,
                     False, False],
       fill_value = 999999)
>>> np.ma.array(np.eye(2).astype(int), mask=[[1, 0], [0, 0]], dtype=np.int8)
masked_array(data =
 [[--, 0],
  [0, 1]],
             mask =
 [[ True, False],
  [False, False]],
       fill_value = 999999)

After:

>>> np.ma.array(np.arange(30), mask=np.arange(30) % 3 == 0)
masked_array(data=[--, 1, 2, --, 4, 5, --, 7, 8, --, 10, 11, --, 13, 14,
                   --, 16, 17, --, 19, 20, --, 22, 23, --, 25, 26, --, 28,
                   29],
             mask=[ True, False, False,  True, False, False,  True, False,
                   False,  True, False, False,  True, False, False,  True,
                   False, False,  True, False, False,  True, False, False,
                    True, False, False,  True, False, False],
       fill_value=999999)

>>> np.ma.array(np.eye(2).astype(int), mask=[[1, 0], [0, 0]], dtype=np.int8)
masked_array(
  data=[[--, 0],
        [0, 1]],
  mask=[[ True, False],
        [False, False]],
  fill_value=999999,
  dtype=int8)

@eric-wieser eric-wieser added the component: numpy.ma masked arrays label Sep 29, 2017
@eric-wieser
Copy link
Member Author

eric-wieser commented Sep 29, 2017

That last case is still pretty ugly, and I'd be tempted to change it to

masked_array(
 data = [[--, 0],
         [0, 1]],
 mask = [[ True, False],
         [False, False]],
 fill_value = 999999)

That is, move the kwargs to the start of the line at the expense of not aligning them.

@eric-wieser
Copy link
Member Author

eric-wieser commented Sep 29, 2017

almost makes it valid python

If we replaced the -- with __, then it would be fully evaluable with eval(repr(arr), vars(np.ma)). It'd then be reasonable to have np.ma.__ an alias to np.ma.masked

numpy/ma/core.py Outdated
@@ -25,6 +25,7 @@
import sys
import operator
import warnings
import textwrap
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already imported elsewhere in numpy based on looking at sys.modules, so shouldn't impact startup time

@ahaldane
Copy link
Member

ahaldane commented Oct 1, 2017

I'm thinking about it :)

This seem like right time for any changes, but we have to balance it somehow to avoid making arbitrary changes that just annoy people.

Also on my mind is the fact that we might start a new maskedarray implementation. If some of these changes are too big to put here, we might put them in a new implementation.

In specifics: I like the __ idea a lot, but it's probably too different from the current output. It seems like a great idea for a reimplementation. Adding commas seems like a very desirable change, so much that I might almost be in favor of breacing doctests for them, not sure.

The newline changes might be OK.

Another idea: the mask = part of the repr seems redundant since masked values already have '--' in the data part, so maybe one day we could remove that entirely. But maybe it is there to avoid confusion involving nomask.

@ahaldane
Copy link
Member

ahaldane commented Oct 1, 2017

Re: A reimplementation of MaskedArray:

I think better missing-value support could be a big boost for numpy. It is quite poor at it compared to other languages like R. Wes McKinney even mentioned missing value support in his recent blog post justifying his desire to move pandas to use apache arrow instead of numpy (though I think pandas does its own thing with nans right now).

I am aware there are a bunch of different ways we could implement missing value support, but a well though-out MaskedArray implementation seems like an easy one, given we can work with the current MaskedArray code. There are ways in which I think a good MaskedArray implementation could be better than the alternatives such as the sentinel values used in R.

I think __array_ufunc__ is in now, so we could try using it. I just spent an hour going through the maskedarray code and removed a ton of code related to nomask, hardmask, sharemask, and removing functions that seem redundant. Lots of fun! But it's in heavy dissaray and not compiling now, the only code I have actually written is a little in in __new__, __array_finalize__, __getitem__. (Probably not worth looking at at this point, but I put the files here.. feel free to try from scratch too)

@eric-wieser
Copy link
Member Author

I like the __ idea a lot, but it's probably too different from the current output.

Note that this is already configurable with np.ma.masked_print_option.set_display('__')

@charris
Copy link
Member

charris commented Oct 3, 2017

justifying his desire to move pandas to use apache arrow

And he is one of the designers of apache arrow. My guess is that pandas will make the move at some point in the future as apache arrow was in part motivated by a desire to have something better adapted to big, really big, tabular data.

I'm thinking about what the role of NumPy will be in a future where pandas has moved on and multi-terabyte data is commonly worked with. I also recall @mwiebe's comment that NumPy is quite slow and inefficient. Some things can be improved, but to some extent we will be locked in by current functionality, striding and such. I do think we need better strings, dtypes, and missing value handling.

@ahaldane
Copy link
Member

ahaldane commented Oct 4, 2017

@charris, What comment was that? I don't recall. And what is the stride problem, again?

Numpy's speed and low abstraction level are definitely some of my main motivations for using it. I basically use it as a way to get near-C speeds, but with all of python's convenience. I've been vaguely keeping track of cases in my programs that I felt more numpy optimizations were possible, mainly cases we might avoid temporary arrays or add short-circuiting.... but I'm interested to hear more.

Also, numpy has always seemed to me like the lowlevel base for other projects like scipy, dask and numba, and I don't see it ever really going away. The thing that isn't totally clear to me is numpy's relation to projects like dynd, blaze, plures. I sort of thought they were attempts at numpy 2.0 which would solve all our legacy problems, and we would all transition to them, but they seem slow in coming. Or maybe the way to view them is as experiments, large parts of which we might paste in into numpy, or build numpy on them.

As for pandas, I think it is pretty natural for it to move to arrow. Pandas always seemed to me much more like a relational database interface for python. (I actually don't use pandas much since I don't work with very relational data).

@charris
Copy link
Member

charris commented Oct 4, 2017

What comment was that? I don't recall. And what is the stride problem, again?

Mark made the comment some years ago when Julian started making some of his speed improvements using vectorized instructions. The problem with strides is that they can break locality so that cache is not used efficiently. There is really no easy way around that as strides and subarrays are one of the most useful features of NumPy.

I'm thinking about sustainability. Technical obsolescence is certainly one way to lose a niche, OTOH, there are limits to what can be done with NumPy and still be NumPy. So I'm trying to clarify for myself exactly where NumPy fits in.

I think DyND was effectively an attempt at NumPy 2.0, but NumPy was too established at that point for DyND to really gain a foothold. Unfortunate timing, really. However, it has explored some good ideas that we should take a look at.

@ahaldane
Copy link
Member

ahaldane commented Oct 4, 2017

Ah, perhaps you are referring to his NEP for a Optimizing Iterator/UFunc Performance (link). I'll read it through.

By the way, I notice again that there is an NEP for a groupby function. We had a discussion about this a year ago on the mailing list, and the conclusion was "don't try it, we don't want to step on panda's toes". We should think about it again if pandas goes its own way.

@charris charris added this to the 1.14.0 release milestone Oct 21, 2017
@charris
Copy link
Member

charris commented Oct 21, 2017

Any decision on this?

@charris
Copy link
Member

charris commented Oct 23, 2017

Another comment on masked array. There was an attempt to make masks part of ndarray that ended getting shot down on account of the basic C structure getting two new items. I don't think that is considered a relevant criterion these days and we might want to revisit the idea.

As to this PR, a decision one way or the other would be helpful as I would like to branch 1.14 in the not too distant future.

@eric-wieser
Copy link
Member Author

Involved in the decision, thoughts on this comment would be appreciated too.

@ahaldane
Copy link
Member

It's very tough to decide these things because the costs and benefits are hard to weigh.

In the "for" camp, I personally think the proposed modifications are nicer. I like the version which is pushed to the left. But should we go further? For instance, I might prefer not to print out the mask= at all, since that info is already visible through the location of the '--' in the data, and then we could get rid of the data=prefix and just print the data array.

In the "against" camp, the old output isn't that bad. Also, I think there is less motivation for this than for the ndarray printing changes we have made so far. For the ndarray changes we managed to make a "legacy" mode which preserved the old behavior, but we don't have that here. Judging from comments we got, some people do care about having a legacy mode. Also, the ndarray printing changes had a clearer goal to fix inconsistencies in array printing which were impeding other bug fixes and improvements. In contrast the changes here don't really "fix" anything, they are just prettier.

This is a change that, if we were working on a "numpy 2.0" branch, I would be all for. But after writing the comments above, I think the lack of a legacy mode might push me over to the "no" side. I might be convinced otherwise.

@eric-wieser what do you think?

@mhvk
Copy link
Contributor

mhvk commented Oct 23, 2017

Although I think it is an improvement, I'd also hesitate to put this in... Especially since I agree there is no real need to display the mask. But it does depend a bit on what we think the likelihood of a ma 2.0 is..

A propos ma 2.0, I just happened to think about it yesterday: it might be easiest to split it in two cases, one that is strictly a container class that propagates a mask with the same shape as whatever the data is (i.e., or the mask for any binary operation; propagate it for mean, etc.; here, the mask behaves like a nan), and a subclass specific to numerical data that has the logic of masking inf, nan etc. (and doing the equivalent of nanmean etc.). With __array_ufunc__ the first should be nearly trivial, and the second a modest addition.
There might be some case to be made for making the numerical one work with float only (so one can use things like nanmean instead of rewriting averaging routines).


a = np.ma.arange(2000)
a[1:50] = np.ma.masked
assert_equal(
repr(a),
'masked_array(data = [0 -- -- ..., 1997 1998 1999],\n'
' mask = [False True True ..., False False False],\n'
Copy link
Member Author

@eric-wieser eric-wieser Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this repr already changed once after 1.13 in #9815

@eric-wieser
Copy link
Member Author

eric-wieser commented Oct 24, 2017

That's fun - github is showing those in date order, not DAG order... The two MAINT commits are extracted to #9913, anyway.

@eric-wieser
Copy link
Member Author

eric-wieser commented Oct 24, 2017

For instance, I might prefer not to print out the mask= at all, since that info is already visible through the location of the '--' in the data, and then we could get rid of the data=prefix and just print the data array.

It's surprisingly hard to reconstruct a masked array from just the repr otherwise - np.ma.array([1, 2, np.ma.masked]) errors, for instance - so I think we need to keep the mask = until that's fixed

In the "against" camp, the old output isn't that bad.

These are pretty irritating:

>>> np.ma.array(np.eye(4))
masked_array(data =
 [[ 1.  0.  0.  0.]    #really?
 [ 0.  1.  0.  0.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  0.  1.]],
             mask =
 False,
       fill_value = 1e+20)

>>> np.ma.array([datetime.now()]*2, dtype=object)
masked_array(data = [datetime.datetime(2017, 10, 23, 21, 25, 58, 273000)
 datetime.datetime(2017, 10, 23, 21, 25, 58, 273000)],  # why?
             mask = False,
       fill_value = ?)  # the heck is this?

when they could be

>>> np.ma.array(np.eye(4))
masked_array(data =
 [[ 1.  0.  0.  0.]
  [ 0.  1.  0.  0.]
  [ 0.  0.  1.  0.]
  [ 0.  0.  0.  1.]],
             mask =
 False,
       fill_value = 1e+20)

>>> np.ma.array([datetime.now()]*2, dtype=object)
masked_array(data = [datetime.datetime(2017, 10, 23, 21, 25, 58, 273000)
                     datetime.datetime(2017, 10, 23, 21, 25, 58, 273000)],
             mask = False,
       fill_value = '?')  # oh, it's a string

@mhvk
Copy link
Contributor

mhvk commented Oct 24, 2017

Hmm, I'm getting in the let's-do-it camp. But if we do it, I'd go all the way to putting data and mask on new lines (when needed), as you suggested (#9792 (comment))

@eric-wieser eric-wieser changed the base branch from master to maintenance/1.13.x October 27, 2017 04:22
@eric-wieser eric-wieser changed the base branch from maintenance/1.13.x to master October 27, 2017 04:22
@eric-wieser
Copy link
Member Author

(that was because I was too lazy to rebase)

go all the way to putting data and mask on new lines

I'm confused - the comment you link to is suggesting combining lines that are currently wrapped, not inserting more new lines

@ahaldane
Copy link
Member

One thing that makes me more in favor is that it seems many of the people who care about doctests test continuously against numpy master. So if this PR is going to cause problems, we will know before the 1.14 release so will have a chance to reconsider. Also, assuming we don't make tons of MaskedArray changes before 1.14 release, this should be easy to revert.

@eric-wieser
Copy link
Member Author

It probably wouldn't be too hard to hijack the arrayprint legacy mode here to allow it to be disabled along with the other new behaviour - the diff is now not very big

@mhvk
Copy link
Contributor

mhvk commented Oct 27, 2017

@eric-wieser - I meant your suggestion to change multi-line reprs to

masked_array(
 data = [[--, 0],
         [0, 1]],
 mask = [[ True, False],
         [False, False]],
 fill_value = 999999)

(although I'd remove the spaces around = while you are at it...)

@charris
Copy link
Member

charris commented Nov 11, 2017

Where are we with this?

@eric-wieser
Copy link
Member Author

eric-wieser commented Nov 14, 2017

How do we feel about MaskedArray.__repr__ peeking at np.get_printoptions()['legacy'] in order to switch this behavior?

@eric-wieser
Copy link
Member Author

although I'd remove the spaces around = while you are at it...)

@mhvk: Should I do that for the "short" repr as well?

@mhvk
Copy link
Contributor

mhvk commented Nov 14, 2017

Yes, makes sense to look at get_printoptions -- supposedly that is how the user can know, so fine for us to use it internally too!

And, yes, I would remove them also in the short repr, it just looks more like an actual function call then, where spaces around equal signs are not PEP8y.

@charris
Copy link
Member

charris commented Nov 18, 2017

If this is made to depend on the legacy mode, I'll put it in. But be warned that I'm about to start pushing off the remaining PRs marked for 1.14 to 1.15.

@ahaldane
Copy link
Member

Yeah, checking np.get_printoptions()['legacy'] seems totally fair.

@eric-wieser
Copy link
Member Author

eric-wieser commented Nov 19, 2017

Ok, overhauled with:

  • Checking of `legacy == '1.13' that uses the old code path
  • Addition of dtype=, since that was previously missing in important cases
  • @mhvk's requested removal of spaces around equals (PEP8-inspired, I guess)
  • The proposed new >1D format, which leaves data= at the start of the line.
  • Short release note

I've updated the top comment with the new output, and added some more tests.

@@ -1175,6 +1175,28 @@ def dtype_is_implied(dtype):
return dtype.type in _typelessdata


def dtype_short_repr(dtype):
Copy link
Member Author

@eric-wieser eric-wieser Nov 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it might belong at dtype.__format__ in future, but for now I just wanted to avoid code duplication

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is also duplicate from arrayprint.py, I was just modifying it in #10032. Maybe I can rebase than on this PR and we can move the function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused - this file is arrayprint - where is the other duplicate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to split off a PR with just the first two commits, which we can merge first?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry I thought this was ma/core.py. This change is fine.

numpy/ma/core.py Outdated
keys.append('dtype')

# choose what to indent each keyword with
min_indent = 2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0, 1, or 4 might also be good choices.

dtype_needed = (
not np.core.arrayprint.dtype_is_implied(self.dtype) or
np.all(self.mask) or
self.size == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plain ndarrays also have a special case for size-0, more than 1 dimension:

>>> np.empty((0,1))
array([], shape=(0, 1), dtype=float64)

But I'm a bit ambivalent about that behavior since shape isn't a valid array keyword. But I don't have a better way of signalling that the shape is >1d.

Compare to np.ma.empty((0,1)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked, that behavior goes back to before numpy existed!

Copy link
Member Author

@eric-wieser eric-wieser Nov 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An idea: Maybe print np.empty((0, 1)) as empty((0, 1), dtype=float64)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty neat. I'm on board.

I'll try it out as an extra commit in the newline PR.

Copy link
Member Author

@eric-wieser eric-wieser Nov 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't - I want to get the newline PR in as is so that I can rebase this and dtype_short_repr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or actually, we can try in a new PR to avoid cluttering things.

Copy link
Member

@ahaldane ahaldane Nov 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok I didn't see your reply. Agreed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can always create a new PR with common commits with the old PR - if you do that, and the old PR is merged, then you can avoid having to rebase by switching the "base branch" twice in the github UI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that some rebasing will be needed anyway since you'll touch the typename code... we'll see.

@ahaldane
Copy link
Member

Besides the nd size=0 thing, looks good.

I think the current behavior is pretty good, and the 1d vs 2d difference works for me. The fact that this now line-wraps properly for long arrays is a big improvement! (I might even add a test for that).

One possible change: Use long mode for 1d arrays if they don't fit in a signle line.

@eric-wieser
Copy link
Member Author

One possible change: Use long mode for 1d arrays if they don't fit in a signle line.

This requires us to stringify the array twice, once for each indentation level, which will be twice as slow on large arrays. If you think this is acceptable, then I'll go for it

@eric-wieser
Copy link
Member Author

eric-wieser commented Nov 19, 2017

The fact that this now line-wraps properly for long arrays is a big improvement! (I might even add a test for that).

I think there already is at least one such test Good catch, that path is not tested - added now

@ahaldane
Copy link
Member

ahaldane commented Nov 19, 2017

One possible change: Use long mode for 1d arrays if they don't fit in a signle line.

This requires us to stringify the array twice, once for each indentation level, which will be twice as slow on large arrays. If you think this is acceptable, then I'll go for it

That's true. I'll leave it to your discretion.

There might be a way to avoid most of the cost with some simple length check. Eg, if there are more than max_line_width/2 elements use the long form, since each element comes with at least a comma and a space. That might be a bit too hacky though. (I mean, only do the double-stringify check after the length check tells you the short-form might be possible).

* Commas are now used within data and mask
* dtypes are shown for float32, int8 etc, where they previously weren't
* Wrapped fields are now correctly indented
* Spaces removed around = to match PEP8
@eric-wieser
Copy link
Member Author

Rebased to use the cleaned-up typename calculations from #10032.

Have slightly improved the wrapping logic to allow any single-row array to appear on one line, but haven't gone as far as doing the double-stringification version.

masked_array(data=[[1, 2, 3]],
mask=[[False, False, False]],
fill_value=999999,
dtype=int8)''')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest revision changed this case.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

@ahaldane
Copy link
Member

The single-row change looks good, LGTM too.

I'll merge then, thanks Eric!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants