Skip to content

FEA add ReadonlyArrayWrapper #20903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Sep 6, 2021
Merged

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Aug 31, 2021

Reference Issues/PRs

This contributes to #10624 and also #20567.

What does this implement/fix? Explain your changes.

This adds the class ReadonlyArrayWrapper (in utils). This class wraps an array/buffer/memoryview x, e.g. x_new = ReadonlyArrayWrapper(x). Even if x is readonly, x_new can now be passed as argument to function(dype[:] x) which should only be done, if this function does not modify its argument.
This is meant as workaround for const fused type memoryviews, e.g. function(const floating[:] x).

Any other comments?

Should be used CAREFULLY!

Source

Originally proposed by @da-woods in #20567 (comment).

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for setting a dedicated PR, @lorentzenchr.

Here are some remarks.

@lorentzenchr
Copy link
Member Author

FYI @jeremiedbb @lesteve @thomasjpfan @ogrisel (but first the big big long anticipated release ...:smirk:)

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an.. interesting hack. I experimented with writing using:

def _write_one(NUM_TYPES[:] x):
    x[0] = 100

The following test segfaults:

def test_readonly_array_wrapper_write():
    x = np.arange(10).astype(np.int32)
    x_readonly = create_memmap_backed_data(x)

    x_readonly_warp = ReadonlyArrayWrapper(x_readonly)
    _write_one(x_readonly_warp)

While using a flag works and writes to the underlying array:

def test_readonly_array_wrapper_write_flag():

    x = np.arange(10).astype(np.int32)
    x_readonly = x.copy()
    x_readonly.flags["WRITEABLE"] = False

    x_readonly_warp = ReadonlyArrayWrapper(x_readonly)
    _write_one(x_readonly_warp)
    assert x_readonly[0] == 100

So if we want to use this hack, we need to make sure to test with memmaped data.

@lorentzenchr
Copy link
Member Author

So if we want to use this hack, we need to make sure to test with memmaped data.

It is tested for both (see tests):

if readonly == "flag":
    x_readonly = x.copy()
    x_readonly.flags["WRITEABLE"] = False
else:
    x_readonly = create_memmap_backed_data(x)

@jeremiedbb jeremiedbb self-requested a review September 1, 2021 10:05
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played with it a bit. Looks like an acceptable temporary workaround for me. I'm however a bit uncomfortable to add a new unused utility class to the codebase. There's a risk that we never follow up with PRs using it and it stays forever as unused code. I'd rather have the PR include at least one use of the new class.

I tried to use it in KMeans and it does not make the code more complex, just wrapping X at 2 places. I realized that KMeans (and MiniBatchKMeans) almost already handled the readonly case because of necessary copies before accessing cython code (the only issue was for predict / score). I ran some benchmarks and did not notice any speed change.

I directly pushed the commit to use it in KMeans here but feel free to revert if you think it hurts the PR.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is tested for both (see tests):

The test checks for a Cython function that does not write to the buffer, which is good to have.

I was thinking of ways to prevents us from accidentally defining Cython functions that writes to ReadonlyWrapper. Passing in memmapped data is the easy choice since it segfaults. In practice when reviewing PRs, we need to make sure the Cython function with ReadonlyWrapper is tested with memmapped data.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of thoughts:

  1. This workaround makes for a nicer developer experience since memoryviews can be passed all the way down. This enables nicer indexing and we no longer need to do stride math.
  2. We have to be sure that the Cython function using plain memoryviews do not write to the buffer. In the case of KMeans, we need to depend on the common tests with memmapped data. This adds some maintenance overhead.
  3. The caller needs to know that the Cython function does not write to the buffer in order to safely create ReadonlyArrayWrapper.

Overall, I am slightly in favor (+0.25). (I like seeing memoryviews everywhere)

# expecting C alinged 2D array. XXX: Can be
# replaced by const memoryview when cython min
# version is >= 0.3
floating[:, ::1] X, # IN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that np.ndarray[...] can not passed here because of the gil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A real advantage of this PR 😏

@lorentzenchr lorentzenchr changed the title FEA add ReadonlyWrapper FEA add ReadonlyArrayWrapper Sep 1, 2021
@lorentzenchr
Copy link
Member Author

@thomasjpfan Regarding your concerns, I agree with you. Note, however, that this is meant as a temporary fix that can be removed with the arrival of Cython 3.0. We should use this fix CAREFULLY and only where needed.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KMeans use case is convincing, the source code is more natural to follow.

The common tests should already cover the majority of the test cases that @thomasjpfan want.

+1 for merge on my side.

@lorentzenchr
Copy link
Member Author

Interim status: +1.25.
Do we get another >=+0.75?

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am convinced.

@ogrisel ogrisel merged commit c0e5d1b into scikit-learn:main Sep 6, 2021
@ogrisel
Copy link
Member

ogrisel commented Sep 6, 2021

Merged! Thanks @lorentzenchr (and everybody else contributing to this discussion ;)

@lorentzenchr lorentzenchr deleted the readonly_wrapper branch September 6, 2021 16:20
adrinjalali pushed a commit that referenced this pull request Sep 7, 2021
Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants