Skip to content

CLN: In DecimalArray implement aliases as properties #26900

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

CLN: In DecimalArray implement aliases as properties #26900

wants to merge 2 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Jun 17, 2019

Much safer, since _data may be assigned to. I got bitten by this.

  • tests passed (as long as it doesn't break existing)
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

Much safer, since _data may be assigned to
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a tests to validate this assertion

def data(self):
return self._data

# those aliases are currently not working due to assumptions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't added commented code unless its germane (its not here)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which assertion do you mean?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were 2 commented out attributes. I tuned them into properties with the rest, then commented them out to not delete what the author saw fit to leave in. Not sure what to do here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is fine to keep this in comments.

But for briefness, I think it is enough to have only one property (I would take data) and then you can simply do items = data as alias, and keep the values and _values in a similar way as alias in comments.

@codecov
Copy link

codecov bot commented Jun 17, 2019

Codecov Report

Merging #26900 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26900      +/-   ##
==========================================
- Coverage   91.87%   91.86%   -0.01%     
==========================================
  Files         180      180              
  Lines       50712    50712              
==========================================
- Hits        46590    46586       -4     
- Misses       4122     4126       +4
Flag Coverage Δ
#multiple 90.45% <ø> (ø) ⬆️
#single 41.12% <ø> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/frame.py 96.88% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update baa77c3...93972dc. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 17, 2019

Codecov Report

Merging #26900 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26900      +/-   ##
==========================================
- Coverage   91.87%   91.86%   -0.02%     
==========================================
  Files         180      180              
  Lines       50743    50712      -31     
==========================================
- Hits        46620    46586      -34     
- Misses       4123     4126       +3
Flag Coverage Δ
#multiple 90.45% <ø> (-0.01%) ⬇️
#single 41.12% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/io/formats/printing.py 85.56% <0%> (-1.17%) ⬇️
pandas/util/_test_decorators.py 93.22% <0%> (-0.44%) ⬇️
pandas/core/frame.py 96.88% <0%> (-0.13%) ⬇️
pandas/core/indexes/multi.py 95.67% <0%> (-0.07%) ⬇️
pandas/core/series.py 93.64% <0%> (-0.02%) ⬇️
pandas/core/generic.py 94.2% <0%> (-0.01%) ⬇️
pandas/core/strings.py 98.93% <0%> (ø) ⬆️
pandas/core/indexes/interval.py 96.44% <0%> (ø) ⬆️
pandas/core/indexes/base.py 96.71% <0%> (ø) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d47947a...914840a. Read the comment docs.

@ghost ghost mentioned this pull request Jun 17, 2019
5 tasks
@jorisvandenbossche
Copy link
Member

Much safer, since _data may be assigned to. I got bitten by this.

What do you mean with "got bitten by this" ?

This is only a test array, so it might be that it is actually good that those are writable, to make sure we don't mess with such attributes.

@ghost
Copy link
Author

ghost commented Jun 19, 2019

What do you mean with "got bitten by this" ?

I added a round method and test to DecimalArray, and this made a bug possible.

This is only a test array, so it might be that it is actually good that those are writable,
to make sure we don't mess with such attributes.

I don't understand. if they are writable of course you can mess with them. But if they were properties (this PR), you can't do anything to them, and if you tried, there would be an exception).

Can't see how this helps testing, on the contrary. It's a bad pattern, and EA authors who look for examples or a related EA from which to build on may pick this up and gain bad habits.

@jorisvandenbossche
Copy link
Member

What do you mean with "got bitten by this" ?

I added a round method and test to DecimalArray, and this made a bug possible.

Sorry, that's still too cryptic. How did that made a bug possible by having those writable attributes?

It's a bad pattern, and EA authors who look for examples or a related EA from which to build on may pick this up and gain bad habits.

Then let's make the comment around it more clear about that?

We mainly care about having those names on the array, less so whether they are attributes or properties.
But by making them properties, we given even more attention to it IMO, as they take more space:)

@ghost
Copy link
Author

ghost commented Jun 19, 2019

Sorry, that's still too cryptic. How did that made a bug possible by having those writable attributes?

My implementation of round return a result by calling self.copy(), and then assigning the calculated result to the _data attribute of the copy, then returning it. The data attribute then still pointed at the old values, and when data was used elsewhere it returned stale data.

I should have used a constructor (and changed it), but if data had been a true alias for _data, and not just a different attribute initialized to the same value, that bug could not have happened. That's why I claim it's a bad pattern.

Does that clarify things?

@ghost
Copy link
Author

ghost commented Jun 19, 2019

Then let's make the comment around it more clear about that?

Sure. But from the comments I've been getting here it seems like in my attempts I either fail to make myself clear, or that I come across as too verbose. I'd love to be both succinct and clear, but it turns out to be quite a challenge for me personally :).

@jorisvandenbossche
Copy link
Member

My implementation of round return a result by calling self.copy(), and then assigning the calculated result to the _data attribute of the copy, then returning it. The data attribute then still pointed at the old values, and when data was used elsewhere it returned stale data.

Ah, now I understand! I overloooked that behaviour of python (I didn't think about the fact that if you overwrite .data, it would not update automatically .data and ._items). Yes, that's a good reason to make those properties.

The data attribute then still pointed at the old values, and when data was used elsewhere it returned stale data.

That's actually a "bug" (or at least wrong usage) you discovered, as it was the idea that it is _data that should be used consistently throughout the DecimalArray implementation for the internal data. So you can change that occurence of .data to ._data in _reduce.

I'd love to be both succinct and clear, but it turns out to be quite a challenge for me personally :).

That's not for you personally. That's a challenge for all of us (and my replies also often result in longer responses than I initially wanted. But I personally try to err on the clear but verbose side than brief but not clear enough).

@jorisvandenbossche
Copy link
Member

Reminder: we prefer you do not force push :)

self._dtype = DecimalDtype(context)

# Certain pandas function currently refer to these attributes
# expecting them to return a numpy array of values. here we
# just treat them as synonyms for `self._data`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct, so some historical note (not that you should have known, so no blame!): pandas relied in several places on doing things like getattr(object, 'values') to ensure to get the underlying ndarray values of an object if it was not yet an ndarray (eg where the object could have been an Index or Series). That's something we don't really like and are trying to get more structure in (but also for us, the pandas code is huge and complex). But so those attributes we added here are rather to ensure that we are not using those wrongly. As from pandas code, we should never directly access the "underlying data" of an EA; all interaction with the EA should go through the official documented interface.

So those attributes here were added in order to have an EA exercise the tests that has some additional attributes that are not part of the official interface, to make sure they don't accidentally mess up some functionality covered in the tests.

And since values and _values are still commented out, it seems we were (at the time) not yet sure that it would work, or it probably did not yet work. It would be good (but not here in this PR) to at some point try to fix this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See eg #20735 as one example of this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, I see. Not brief, but clear!

Copy link
Author

@ghost ghost Jun 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this PR is trying to correct something which was put there as a canary, to catch these bugs? that still doesn't make sense. If you want to catch accesses to these, this is a bad way to do it, since you're relying on the access to have an effect you test for. It won't catch reads using the wrong attribute name for example.

Why not make them properties which throw when accessed?

Copy link
Member

@jorisvandenbossche jorisvandenbossche Jun 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to to ensure pandas does not try to "access" the value (which this tests are a way to do), but I agree that "setting" the value (which we also don't want pandas to do) can throw an error. So converting to the property instead of attribute is indeed fine.

Copy link
Author

@ghost ghost Jun 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could the original approach catch reads? OTOH, converting to properties will completely hide the problem even of writes, unless the properties are boobytrapped.

Maybe I still don't understand what you were trying to accomplish.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: uncomment the values and _values aliasing, run the decimals tests, and see if any test fail.

Tests will be failing, because pandas is assuming it can do getattr(obj, 'values') on some object and expect that it extracts the data of eg a Series, but not the private data of an EA. See #20735 for an example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. and what about data and _items which weren't commented out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the tests are working, pandas does not call those (anymore). Or at least not in a way that gives incorrect results.
But we added them here to have that tested.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the failing test due to _values, and why it's difficult to figure out the cases. This PR is kind of mired in that purgatory, so I think I'll just close. Ok?

@ghost ghost closed this Jun 21, 2019
@ghost ghost deleted the clean_decimal branch July 31, 2019 16:19
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants