Skip to content

WIP, MAINT: scalar float str match array #28696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

tylerjereddy
Copy link
Contributor

@tylerjereddy tylerjereddy commented Apr 12, 2025

  • Fixes BUG: Floating point printing / string conversion is incorrect #28679, but mostly an initial demo for now...

  • This patch sends scalar floating point string representations through the exact dragon representation path rather than the unique path for the reasons outlined in the matching ticket. It does not concern itself with whether the final integer.0 is printed or excluded via integer.. The main argument in favor is that it is pretty weird to follow a different path for 0-D arrays vs. multi-element arrays. The main argument against is likely the potential shenanigans this kind of large-scale reprsentation change could entail downstream if folks already have shims around it and if the rounded representations are not truly bugs but just different design decisions. Also, the impact on round-tripping behaviors in some cases looks like it may cause quite some degree of annoyance for us based on remaining local test failures for this branch.

  • A hypothesis test was added that fails before and passes after the switch to the exact scalar float representations. hypothesis was chosen because the idiosyncrasies of asymmetric IEEE floating point number lines are well known and trying to test them with a few hard-coded manual cases seems like a recipe for missing something.

  • I think the NumPy team tends not to like partial fixes, so if we do this we may also want to apply the same decision to scientific notation representations, which I have not touched here yet, but the changes here should suffice to provide enough of a preview for decision making on the matter.

  • The stringdtype testing changes are rather egregious, so that's probably enough of a sample of potential drawbacks to weigh the design considerations for feedback. More gruesome changes would likely be needed for other tests based on local experiments--I see some other failures related to array/scalar/string round-tripping guarantees that would need careful thought and that are broken by this set of adjustments.

[ci skip] [skip ci]

* Fixes numpygh-28679, but mostly an initial demo for now...

* This patch sends scalar floating point string representations
through the exact dragon representation path rather than the unique
path for the reasons outlined in the matching ticket. It does not
concern itself with whether the final `integer.0` is printed or excluded
via `integer.`. The main argument in favor is that it is pretty weird
to follow a different path for 0-D arrays vs. multi-element arrays.
The main argument against is likely the potential shenanigans this
kind of large-scale reprsentation change could entail downstream
if folks already have shims around it and if the rounded representations
are not truly bugs but just different design decisions.

* A `hypothesis` test was added that fails before and passes after
the switch to the exact scalar flot representations. `hypothesis`
was chosen because the idiosyncrasies of asymmetric IEEE floating
point number lines are well known and trying to test them with
a few hard-coded manual cases seems like a recipe for missing something.

* I think the NumPy team tends not to like partial fixes, so if we do
this we may also want to apply the same decision to scientific
notation representations, which I have not touched here yet, but the
changes here should suffice to provide enough of a preview for decision
making on the matter.

* The `stringdtype` testing changes are rather egregious, so that's
probably enough of a sample of potential drawbacks to weigh the
design considerations for feedback. More gruesome changes would
likely be needed for other tests based on local experiments--I see
some other failures related to array/scalar/string round-tripping
guarantees that would need careful thought and that are broken
by this set of adjustments.

[ci skip] [skip ci]
("csingle",
"(0.100000001490116119384765625+0.100000001490116119384765625j)"),
("cdouble",
"(0.1000000000000000055511151231257827021181583404541015625+0.1000000000000000055511151231257827021181583404541015625j)"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@f2013519

Here is an example of the kinds of shenanigans the NumPy team would have to deal with if we just naively switch to DigitMode_Exact as I've done in the demo here.

I believe the shims needed for certain round tripping behaviors that were previously solid may also be rather messy.

Not saying the team won't make changes, but at first glance.. yikes!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make it a more surgical fix to use the exact representation only when,

abs(x) > flintmax(precision)

This means we will be dealing only with integers. And so the exact representation will contain the same number of digits as the rounded one. That would fix the cases I mentioned without affecting these test cases.

I would agree the larger hammer probably does not make sense. Most people do not want to see endless digits on their screen :). The current algorithm does well for these scenarios.

@@ -705,7 +705,7 @@ format_@name@(@type@ val, npy_bool scientific,
}
else {
return Dragon4_Positional_@Name@(&val,
DigitMode_Unique, CutoffMode_TotalLength, precision,
DigitMode_Exact, CutoffMode_TotalLength, precision,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scientific notation is just above this, so would be equally easy to change.

But... the testsuite shims currently added in this PR are just a small sample of the tradeoffs we'd have to consider, so I won't invest more time here until that's all considered and weighed/balanced.

* Guard exact float printing for scalars to
trigger only in the "integer" case.

* Revert some testing shims that are no longer needed;
a single shim in `test_stringdtype.py` is still needed
for half precision.
if typename == "float16":
assert sres[0] == "0.0999755859375"
else:
assert sres[0] == "0.1"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@f2013519 interesting idea--I went with if (val == (long long)val) to guard the "exact" prints to "exact" integers, and it almost worked--this was the only shim I needed in the entire testsuite for things to pass locally. That probably means there is something I'm not quite getting--the StringDtype business is probably something Nathan can comment on though.

I'll go ahead and flush the CI test matrix here to see if there are any other issues reviewers would want to consider (32-bit, other platforms/architectures, etc.). Something tells me CI will find some other failures, but we'll have to wait and see.

@tylerjereddy
Copy link
Contributor Author

Yeah, looks like CI is finding just 1 failure so far, for the new hypothesis test I added, for this case:

Falsifying example: test_gh_28679(
    dt_val=(dtype('<f4'), 99999996),
)

So, the code obviously isn't perfect yet, but the extent of the testsuite failures is actually kind of minor with the latest shim. I should move on for now--this at least provides a preview with a small diff and an example of the remaining challenges to deal with.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented in #28679 (comment), I'm not sure it really is better to typeset exactly, as it gives a false sense of precision. But maybe best to discuss further in #28679 - in the meantime, it is very helpful to see what a possible solution would entail.

@tylerjereddy
Copy link
Contributor Author

I think this PR has served its purpose of prototyping something and catalyzing a discussion.

gh-28703 appears to now be the preferred solution, and is also from a first-time contributor, which is even better, so let's close this and focus on that. I've repurposed the property based test drafted here to aid in the review process over there re: scientific notation transition points for arrays and scalars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Floating point printing / string conversion is incorrect
3 participants