-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEA add ReadonlyArrayWrapper #20903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA add ReadonlyArrayWrapper #20903
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for setting a dedicated PR, @lorentzenchr.
Here are some remarks.
FYI @jeremiedbb @lesteve @thomasjpfan @ogrisel (but first the big big long anticipated release ...:smirk:) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an.. interesting hack. I experimented with writing using:
def _write_one(NUM_TYPES[:] x):
x[0] = 100
The following test segfaults:
def test_readonly_array_wrapper_write():
x = np.arange(10).astype(np.int32)
x_readonly = create_memmap_backed_data(x)
x_readonly_warp = ReadonlyArrayWrapper(x_readonly)
_write_one(x_readonly_warp)
While using a flag works and writes to the underlying array:
def test_readonly_array_wrapper_write_flag():
x = np.arange(10).astype(np.int32)
x_readonly = x.copy()
x_readonly.flags["WRITEABLE"] = False
x_readonly_warp = ReadonlyArrayWrapper(x_readonly)
_write_one(x_readonly_warp)
assert x_readonly[0] == 100
So if we want to use this hack, we need to make sure to test with memmaped data.
It is tested for both (see tests): if readonly == "flag":
x_readonly = x.copy()
x_readonly.flags["WRITEABLE"] = False
else:
x_readonly = create_memmap_backed_data(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played with it a bit. Looks like an acceptable temporary workaround for me. I'm however a bit uncomfortable to add a new unused utility class to the codebase. There's a risk that we never follow up with PRs using it and it stays forever as unused code. I'd rather have the PR include at least one use of the new class.
I tried to use it in KMeans and it does not make the code more complex, just wrapping X at 2 places. I realized that KMeans (and MiniBatchKMeans) almost already handled the readonly case because of necessary copies before accessing cython code (the only issue was for predict / score). I ran some benchmarks and did not notice any speed change.
I directly pushed the commit to use it in KMeans here but feel free to revert if you think it hurts the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is tested for both (see tests):
The test checks for a Cython function that does not write to the buffer, which is good to have.
I was thinking of ways to prevents us from accidentally defining Cython functions that writes to ReadonlyWrapper
. Passing in memmapped data is the easy choice since it segfaults. In practice when reviewing PRs, we need to make sure the Cython function with ReadonlyWrapper
is tested with memmapped data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of thoughts:
- This workaround makes for a nicer developer experience since memoryviews can be passed all the way down. This enables nicer indexing and we no longer need to do stride math.
- We have to be sure that the Cython function using plain memoryviews do not write to the buffer. In the case of KMeans, we need to depend on the common tests with memmapped data. This adds some maintenance overhead.
- The caller needs to know that the Cython function does not write to the buffer in order to safely create
ReadonlyArrayWrapper
.
Overall, I am slightly in favor (+0.25). (I like seeing memoryviews everywhere)
sklearn/cluster/_k_means_elkan.pyx
Outdated
# expecting C alinged 2D array. XXX: Can be | ||
# replaced by const memoryview when cython min | ||
# version is >= 0.3 | ||
floating[:, ::1] X, # IN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that np.ndarray[...]
can not passed here because of the gil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A real advantage of this PR 😏
@thomasjpfan Regarding your concerns, I agree with you. Note, however, that this is meant as a temporary fix that can be removed with the arrival of Cython 3.0. We should use this fix CAREFULLY and only where needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The KMeans use case is convincing, the source code is more natural to follow.
The common tests should already cover the majority of the test cases that @thomasjpfan want.
+1 for merge on my side.
Interim status: +1.25. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am convinced.
Merged! Thanks @lorentzenchr (and everybody else contributing to this discussion ;) |
Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
Reference Issues/PRs
This contributes to #10624 and also #20567.
What does this implement/fix? Explain your changes.
This adds the class
ReadonlyArrayWrapper
(in utils). This class wraps an array/buffer/memoryviewx
, e.g.x_new = ReadonlyArrayWrapper(x)
. Even ifx
is readonly,x_new
can now be passed as argument tofunction(dype[:] x)
which should only be done, if this function does not modify its argument.This is meant as workaround for const fused type memoryviews, e.g.
function(const floating[:] x)
.Any other comments?
Should be used CAREFULLY!
Source
Originally proposed by @da-woods in #20567 (comment).