-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Various improvements to Maskedarray repr #9792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
That last case is still pretty ugly, and I'd be tempted to change it to masked_array(
data = [[--, 0],
[0, 1]],
mask = [[ True, False],
[False, False]],
fill_value = 999999) That is, move the kwargs to the start of the line at the expense of not aligning them. |
If we replaced the |
numpy/ma/core.py
Outdated
@@ -25,6 +25,7 @@ | |||
import sys | |||
import operator | |||
import warnings | |||
import textwrap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already imported elsewhere in numpy based on looking at sys.modules
, so shouldn't impact startup time
I'm thinking about it :) This seem like right time for any changes, but we have to balance it somehow to avoid making arbitrary changes that just annoy people. Also on my mind is the fact that we might start a new maskedarray implementation. If some of these changes are too big to put here, we might put them in a new implementation. In specifics: I like the The newline changes might be OK. Another idea: the |
Re: A reimplementation of MaskedArray: I think better missing-value support could be a big boost for numpy. It is quite poor at it compared to other languages like R. Wes McKinney even mentioned missing value support in his recent blog post justifying his desire to move pandas to use apache arrow instead of numpy (though I think pandas does its own thing with nans right now). I am aware there are a bunch of different ways we could implement missing value support, but a well though-out MaskedArray implementation seems like an easy one, given we can work with the current MaskedArray code. There are ways in which I think a good MaskedArray implementation could be better than the alternatives such as the sentinel values used in R. I think |
Note that this is already configurable with |
And he is one of the designers of apache arrow. My guess is that pandas will make the move at some point in the future as apache arrow was in part motivated by a desire to have something better adapted to big, really big, tabular data. I'm thinking about what the role of NumPy will be in a future where pandas has moved on and multi-terabyte data is commonly worked with. I also recall @mwiebe's comment that NumPy is quite slow and inefficient. Some things can be improved, but to some extent we will be locked in by current functionality, striding and such. I do think we need better strings, dtypes, and missing value handling. |
@charris, What comment was that? I don't recall. And what is the stride problem, again? Numpy's speed and low abstraction level are definitely some of my main motivations for using it. I basically use it as a way to get near-C speeds, but with all of python's convenience. I've been vaguely keeping track of cases in my programs that I felt more numpy optimizations were possible, mainly cases we might avoid temporary arrays or add short-circuiting.... but I'm interested to hear more. Also, numpy has always seemed to me like the lowlevel base for other projects like scipy, dask and numba, and I don't see it ever really going away. The thing that isn't totally clear to me is numpy's relation to projects like dynd, blaze, plures. I sort of thought they were attempts at numpy 2.0 which would solve all our legacy problems, and we would all transition to them, but they seem slow in coming. Or maybe the way to view them is as experiments, large parts of which we might paste in into numpy, or build numpy on them. As for pandas, I think it is pretty natural for it to move to arrow. Pandas always seemed to me much more like a relational database interface for python. (I actually don't use pandas much since I don't work with very relational data). |
Mark made the comment some years ago when Julian started making some of his speed improvements using vectorized instructions. The problem with strides is that they can break locality so that cache is not used efficiently. There is really no easy way around that as strides and subarrays are one of the most useful features of NumPy. I'm thinking about sustainability. Technical obsolescence is certainly one way to lose a niche, OTOH, there are limits to what can be done with NumPy and still be NumPy. So I'm trying to clarify for myself exactly where NumPy fits in. I think DyND was effectively an attempt at NumPy 2.0, but NumPy was too established at that point for DyND to really gain a foothold. Unfortunate timing, really. However, it has explored some good ideas that we should take a look at. |
Ah, perhaps you are referring to his NEP for a Optimizing Iterator/UFunc Performance (link). I'll read it through. By the way, I notice again that there is an NEP for a groupby function. We had a discussion about this a year ago on the mailing list, and the conclusion was "don't try it, we don't want to step on panda's toes". We should think about it again if pandas goes its own way. |
Any decision on this? |
Another comment on masked array. There was an attempt to make masks part of As to this PR, a decision one way or the other would be helpful as I would like to branch 1.14 in the not too distant future. |
Involved in the decision, thoughts on this comment would be appreciated too. |
It's very tough to decide these things because the costs and benefits are hard to weigh. In the "for" camp, I personally think the proposed modifications are nicer. I like the version which is pushed to the left. But should we go further? For instance, I might prefer not to print out the In the "against" camp, the old output isn't that bad. Also, I think there is less motivation for this than for the ndarray printing changes we have made so far. For the ndarray changes we managed to make a "legacy" mode which preserved the old behavior, but we don't have that here. Judging from comments we got, some people do care about having a legacy mode. Also, the ndarray printing changes had a clearer goal to fix inconsistencies in array printing which were impeding other bug fixes and improvements. In contrast the changes here don't really "fix" anything, they are just prettier. This is a change that, if we were working on a "numpy 2.0" branch, I would be all for. But after writing the comments above, I think the lack of a legacy mode might push me over to the "no" side. I might be convinced otherwise. @eric-wieser what do you think? |
Although I think it is an improvement, I'd also hesitate to put this in... Especially since I agree there is no real need to display the mask. But it does depend a bit on what we think the likelihood of a ma 2.0 is.. A propos ma 2.0, I just happened to think about it yesterday: it might be easiest to split it in two cases, one that is strictly a container class that propagates a mask with the same shape as whatever the data is (i.e., |
numpy/ma/tests/test_core.py
Outdated
|
||
a = np.ma.arange(2000) | ||
a[1:50] = np.ma.masked | ||
assert_equal( | ||
repr(a), | ||
'masked_array(data = [0 -- -- ..., 1997 1998 1999],\n' | ||
' mask = [False True True ..., False False False],\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this repr
already changed once after 1.13 in #9815
31fce10
to
083730b
Compare
That's fun - github is showing those in date order, not DAG order... The two MAINT commits are extracted to #9913, anyway. |
It's surprisingly hard to reconstruct a masked array from just the repr otherwise -
These are pretty irritating: >>> np.ma.array(np.eye(4))
masked_array(data =
[[ 1. 0. 0. 0.] #really?
[ 0. 1. 0. 0.]
[ 0. 0. 1. 0.]
[ 0. 0. 0. 1.]],
mask =
False,
fill_value = 1e+20)
>>> np.ma.array([datetime.now()]*2, dtype=object)
masked_array(data = [datetime.datetime(2017, 10, 23, 21, 25, 58, 273000)
datetime.datetime(2017, 10, 23, 21, 25, 58, 273000)], # why?
mask = False,
fill_value = ?) # the heck is this? when they could be >>> np.ma.array(np.eye(4))
masked_array(data =
[[ 1. 0. 0. 0.]
[ 0. 1. 0. 0.]
[ 0. 0. 1. 0.]
[ 0. 0. 0. 1.]],
mask =
False,
fill_value = 1e+20)
>>> np.ma.array([datetime.now()]*2, dtype=object)
masked_array(data = [datetime.datetime(2017, 10, 23, 21, 25, 58, 273000)
datetime.datetime(2017, 10, 23, 21, 25, 58, 273000)],
mask = False,
fill_value = '?') # oh, it's a string |
083730b
to
65e9610
Compare
Hmm, I'm getting in the let's-do-it camp. But if we do it, I'd go all the way to putting data and mask on new lines (when needed), as you suggested (#9792 (comment)) |
(that was because I was too lazy to rebase)
I'm confused - the comment you link to is suggesting combining lines that are currently wrapped, not inserting more new lines |
One thing that makes me more in favor is that it seems many of the people who care about doctests test continuously against numpy master. So if this PR is going to cause problems, we will know before the 1.14 release so will have a chance to reconsider. Also, assuming we don't make tons of MaskedArray changes before 1.14 release, this should be easy to revert. |
It probably wouldn't be too hard to hijack the arrayprint legacy mode here to allow it to be disabled along with the other new behaviour - the diff is now not very big |
@eric-wieser - I meant your suggestion to change multi-line reprs to masked_array(
data = [[--, 0],
[0, 1]],
mask = [[ True, False],
[False, False]],
fill_value = 999999) (although I'd remove the spaces around |
Where are we with this? |
How do we feel about |
@mhvk: Should I do that for the "short" repr as well? |
Yes, makes sense to look at And, yes, I would remove them also in the short repr, it just looks more like an actual function call then, where spaces around equal signs are not PEP8y. |
If this is made to depend on the legacy mode, I'll put it in. But be warned that I'm about to start pushing off the remaining PRs marked for 1.14 to 1.15. |
Yeah, checking |
65e9610
to
d6fa4f6
Compare
Ok, overhauled with:
I've updated the top comment with the new output, and added some more tests. |
@@ -1175,6 +1175,28 @@ def dtype_is_implied(dtype): | |||
return dtype.type in _typelessdata | |||
|
|||
|
|||
def dtype_short_repr(dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it might belong at dtype.__format__
in future, but for now I just wanted to avoid code duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is also duplicate from arrayprint.py, I was just modifying it in #10032. Maybe I can rebase than on this PR and we can move the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused - this file is arrayprint
- where is the other duplicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to split off a PR with just the first two commits, which we can merge first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry I thought this was ma/core.py
. This change is fine.
numpy/ma/core.py
Outdated
keys.append('dtype') | ||
|
||
# choose what to indent each keyword with | ||
min_indent = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0, 1, or 4 might also be good choices.
dtype_needed = ( | ||
not np.core.arrayprint.dtype_is_implied(self.dtype) or | ||
np.all(self.mask) or | ||
self.size == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plain ndarrays also have a special case for size-0, more than 1 dimension:
>>> np.empty((0,1))
array([], shape=(0, 1), dtype=float64)
But I'm a bit ambivalent about that behavior since shape isn't a valid array
keyword. But I don't have a better way of signalling that the shape is >1d.
Compare to np.ma.empty((0,1))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked, that behavior goes back to before numpy existed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An idea: Maybe print np.empty((0, 1))
as empty((0, 1), dtype=float64)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty neat. I'm on board.
I'll try it out as an extra commit in the newline PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't - I want to get the newline PR in as is so that I can rebase this and dtype_short_repr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or actually, we can try in a new PR to avoid cluttering things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok I didn't see your reply. Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can always create a new PR with common commits with the old PR - if you do that, and the old PR is merged, then you can avoid having to rebase by switching the "base branch" twice in the github UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that some rebasing will be needed anyway since you'll touch the typename code... we'll see.
Besides the nd size=0 thing, looks good. I think the current behavior is pretty good, and the 1d vs 2d difference works for me. The fact that this now line-wraps properly for long arrays is a big improvement! (I might even add a test for that). One possible change: Use long mode for 1d arrays if they don't fit in a signle line. |
This requires us to stringify the array twice, once for each indentation level, which will be twice as slow on large arrays. If you think this is acceptable, then I'll go for it |
|
d6fa4f6
to
9b1ac84
Compare
That's true. I'll leave it to your discretion. There might be a way to avoid most of the cost with some simple length check. Eg, if there are more than |
9b1ac84
to
0d09caa
Compare
* Commas are now used within data and mask * dtypes are shown for float32, int8 etc, where they previously weren't * Wrapped fields are now correctly indented * Spaces removed around = to match PEP8
0d09caa
to
8bac6ee
Compare
Rebased to use the cleaned-up Have slightly improved the wrapping logic to allow any single-row array to appear on one line, but haven't gone as far as doing the double-stringification version. |
masked_array(data=[[1, 2, 3]], | ||
mask=[[False, False, False]], | ||
fill_value=999999, | ||
dtype=int8)''') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest revision changed this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great!
The single-row change looks good, LGTM too. I'll merge then, thanks Eric! |
While we're playing the "break the doctests" game...
Before:
AfterAfter: