Skip to content

BUG: revise string_from_pyobj/try_pyarr_from_string with respect to malloc and copy. #18759

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 28, 2021

Conversation

pearu
Copy link
Contributor

@pearu pearu commented Apr 12, 2021

Fixes #18431.

@charris
Copy link
Member

charris commented Apr 12, 2021

Note that you can simplify by using raw strings to dispense with most of the escaped characters.

In [14]: r""" "a"\t \ """                                                       
Out[14]: ' "a"\\t \\ '

Just avoid having more than three " in a row.

@charris
Copy link
Member

charris commented Apr 12, 2021

We also prefer the style

if (something) {
    foo;
}
else {
    bar;
}

Although it is hard to enforce with outside code coming in.

@pearu
Copy link
Contributor Author

pearu commented Apr 13, 2021

We also prefer the style ...

Applied the style to the changeset of this PR.

@pearu pearu requested review from charris and eric-wieser April 13, 2021 15:17
@pearu
Copy link
Contributor Author

pearu commented Apr 14, 2021

I have marked this PR as a draft because the handling of nulls is still inconsistent. For instance, the second item in the output of #15311 (comment) is unexpected: all the nulls are replaced except the last one resulting bytes object with length 4 while it should be fixed to 5.

I am starting to think that the current approach of replacing nulls with spaces is wrong. IIRC, the approach tries to fix a Fortran I/O issue where nulls are not included in the output. But this issue appears to be a non-issue: the fixed-size character arrays have the correct width, just when outputting it, null values take zero width.
Please, correct me if the reason for replacing nulls with spaces is something else.

So my plan is to use the same interpretation of nulls that numpy uses. For instance:

>>> np.array(b"\00a\00\00")
array(b'\x00a', dtype='|S4')
>>> np.array(b"\00a\00\00").tobytes()
b'\x00a\x00\x00'

The trailing nulls are not included in the repr output of the ndarray object while these exist as seen in the bytes buffer output.

This may be a BC sensitive change and requires updating f2py docs but the docs are broken anyway: #15311

@eric-wieser
Copy link
Member

I've very little experience with fortran, and have no idea why replacing the nulls with spaces is a good idea. I'll defer to your judgement here.

@seberg
Copy link
Member

seberg commented Apr 14, 2021

In general, stripping all trailing nulls seems like the safest bet to me as well (including not replacing NULLs, although I have no clue about possible backward incompatibility). Using NULL bytes inside the string is fairly brittle to begin with, so I am willing to make the bet that the combination of NULL bytes and string wrapping just isn't a thing (and if it is will fail loudly).

@pearu pearu force-pushed the gh-18431-string_from_pyobj branch from e2a3f6e to 7f5523f Compare May 10, 2021 14:46
@pearu pearu requested a review from eric-wieser May 23, 2021 20:04
@pearu
Copy link
Contributor Author

pearu commented May 26, 2021

If nobody will object, I'll merge this PR in two days in case there is no further reviewing activity.

Co-authored-by: Eric Wieser <wieser.eric@gmail.com>
Copy link
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now, thanks - just a few minor style comments, and a possible follow-up

@pearu
Copy link
Contributor Author

pearu commented May 26, 2021

@pearu pearu merged commit 4c93c93 into numpy:main May 28, 2021
@pearu pearu deleted the gh-18431-string_from_pyobj branch May 28, 2021 04:29
@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Jun 8, 2021

The scipy tests with the "nightly" numpy wheel are experiencing Python interpreter crashes in the code that uses the Fortran library L-BFGS-B (see scipy/scipy#14203). f2py is used to wrap that code. Running a git bisect on numpy shows that the crashes started with numpy commit 8992459, which is part of this PR. I haven't yet looked at that commit very closely, so I don't know if the problem is a numpy bug, or if the changes exposed a scipy bug. I'll look into it, but another set of eyes on the PR might find a problem that was missed in the initial review.

I created a separate issue for this: #19201

@rgommers
Copy link
Member

@pearu @melissawm I'd like to revert this PR until gh-19201. The bug is most likely in this PR; even if it's not and the bug is in SciPy then we still have a problem because leaving this in breaks the most recent SciPy release; and this has broken SciPy CI which prevents working on a release blocker.

It would also be good to use the SciPy test suite as the f2py test suite. That will prevent this kind of thing showing up in CI afterwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: F2Py Invalid read in string_from_pyobj
8 participants