-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: support float and longdouble in FFT using C++ pocketfft version #25711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ah, OK, this switches to the C++ version, but uses only its 1D transforms! |
That is actually a (small or not?) issue with using the gufuncs, it may be tricky to expand to arbitrary dimensions in a nice way for those (which was mentioned to me, I admit; although I didn't realize it might be an issue any time soon). And yes, so long as this looks all fine, it seems fine to not do the N-D versions. |
Indeed, I went the easy route, based on the gufunc I had just introduced (mostly to be able to allow |
No, threading/vectorization only affects 2D and higher transforms. I have played with applying this to 1D as well, but only in |
Just to be clear, do you mean that vectorization helps for FTs over multiple axes, or FTs over a single axis in higher-D arrays? From the code, it looks like the latter, which might still work with the ufunc, as we get an extra axis passed in.
OK, that seems reasonable for follow-up, then. More generally, it may well make sense to use your iterator too, and/or copy what scipy does... I definitely went for what seemed the fastest route... |
The various FFT routines in `numpy.fft` have gained an ``out`` | ||
`numpy.fft` support for different precisions and in-place calculations | ||
---------------------------------------------------------------------- | ||
The various FFT routines in `numpy.fft` now support calculations in float, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They have always supported single precision but have upcast the result. We should be clear that this is a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this sound good?
The various FFT routines in `numpy.fft` now do their calculations natively in
float, double, or long double precision, depending on the input precision,
instead of always calculating in double precision. Hence, the calculation will
now be less precise for single and more precise for long double precision.
The data type of the output array will now be adjusted accordingly.
Furthermore, all FFT routines have gained an ``out`` argument that can be used
for in-place calculations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it makes sense to have a dedicated change note that points out the single precision change explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the numpy policy on using versionchanged
. Should that be added to the fft docstrings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it makes sense to have a dedicated change note that points out the single precision change explicitly.
I don't know how the numpy release notes are amalgamated, but for 2.0 it would be good to highlight all the compatibility breaks somewhere in the release notes.
Yes, it works for transforming just a single axis in a multi-D array too. The idea is to "bundle" a few of the 1D transforms into a single one of SIMD type. |
The |
@mreineck - OK, I'll look into using |
Calling |
Nice!
The cache size may help, assuming it's a useful setting in single-threaded mode (@mreineck?). I'd leave the threading along, we can't do much with that. In SciPy it's controlled by a
Probably an oversight indeed, since the SciPy licenses list has it (here). So +1 for adding it.
Adding Build time wise this PR in its current state looks fine - impact of C++ is okay overall, see quick profiling result below. Binary size is fine too, it's ~240 kb for a dev (debugoptimized) build. A note though that the SciPy pocketfft extension is more than 4x larger, so enabling all functionality that SciPy uses in the NumPy build is probably not the way to go (it'd be ~25% of the size of ![]() |
It will work in single-threaded mode. However, my feeling is that the plan caching isn't really worth it: it may cause large memory consumption (see, e.g., scipy/scipy#12297) for very little performance gain (unless you call very short 1D FFTs repeatedly). |
30eeaad
to
102ac99
Compare
OK, I now have a version that calls the vectorized pocketfft version if possible. It still has the direct loops for the case that Some remaining questions:
|
@rgommers - I would love to have a proper |
I'm not really opposed, but it is a change, with technically only increased binary size as a penalty and no upsides for Windows and macOS arm64 users. Doing nothing and continuing to convert to I'm not sure how serious the plan was to include all the pocketfft functionality that SciPy needs into NumPy - but binary size does seem to become relevant then. If's it's <0.1 MB I'm not too worried, but I wouldn't really want to pay 0.25 MB or more for long double loops that no one has asked for. |
For comparison, |
Thanks for the comparison, that is helpful. Note that those are I did a quick test building a wheel from this branch; the pocketfft extension is 508 kb uncompressed, 208 kb compressed in it (total wheel size 7.8 MB; |
OK, great! The other transforms may not be as bad as one would think, since they mostly use the same code. But anyway, not sure those are needed, and definitely not relevant here. |
Its only limited to core only, I thought it was a global flag |
Are we sure those flags are seen in Lines 398 to 401 in 2634f80
|
OK, I just empirically confirmed that that flag does not get passed on. And with that, that I can catch C++ exceptions. I'll push that up in a bit. |
If its about memory allocations, I think we should use our custom allocations over malloc or new. |
@seiko2plus - the problem is that the allocations (and possible other errors) happen inside My current solution is just to put everything in a |
I'd really recommend not doing that. Most likely these custom operations are C-style and therefore will silently return NULL on error, which goes against fundamental assumptions C++ codes are making. You'd have to patch every place in pocketfft where memory is allocated and add error handling code. |
p.s. Note that I now on purpose use |
To me this looks exactly like the right solution. It's located directly at the interface, does not require any big patching and can be extended in case more detailed error handling is required. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a deep review (although I doubt that is needed at this point!).
Just commenting on two things so they don't get lost (one bug, one suggestion to try to have a more C++ pattern for catching the exceptions).
Submodule+license seem still two things that would be nice to do as well.
numpy/fft/_pocketfft_umath.cpp
Outdated
void *buff = NULL; | ||
if (step_out != sizeof(std::complex<T>)) { | ||
buff = pocketfft::detail::aligned_alloc(sizeof(std::complex<T>), | ||
nout * sizeof(std::complex<T>)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a small problem here that buff
is a void *
and not a pocketfft:detail::arr<std::complex<T>>
. Which I think means that if the plan->exec
fails below no cleanup will occur.
My C++ is a bit trial and error, but I think:
/* Create a temporary array to hold a buffer if op is not contiguous */
bool contiguous_op = (step_out != sizeof(std::complex<T>));
pocketfft::detail::arr<std::complex<T>> tmp_arr(contiguous_op ? 0 : nout);
should be an option (an array of length 0 doesn't do a malloc so that is fine).
And below:
std::complex<T> *op_or_buff = (contiguous_op ? (std::complex<T> *)op : tmp_arr.data());
(Clearly, the dealloc below then becomes unnecessary, which is the whole point.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that makes a lot of sense! I implemented it (though with an opposite bool, buffered
).
Note that the C++ version is header only. We just link to upstream via a git submodule.
@seberg - thanks! Both suggestions were really nice, and make the code a lot cleaner. I pushed a revised version that uses them and that also replaces the addition of the C++ pocketfft header with a new submodule (which includes the license). |
p.s. In the checklist above, this just leaves the question of whether this is tested thoroughly enough. I could adjust more tests to try different |
p.p.s. Well the above about tests was not actually true, but now it is, so I ticked my "more tests" item... |
@@ -381,15 +389,6 @@ def test_all_1d_norm_preserving(self): | |||
assert_allclose(x_norm, | |||
np.linalg.norm(tmp), atol=1e-6) | |||
|
|||
@pytest.mark.parametrize("dtype", [np.half, np.single, np.double, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is now done much more thoroughly in test_identity_long_short
and test_identity_long_short_reversed
Also adjust python side to allow multiple double types.
Also make sure every part uses copy_input and copy_output consistently, and avoid compiling multithreading.
Also use internal pocketfft arr class to allocate the possible temporary buffers, so we do not have to worry about deallocating them any more.
Is there anything more to do for this? (C++ pocketfft in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not managing to concentrate very deeply on it, but those two C++ things were basically what I noticed and the rest looks fine.
Since nobody voiced other concerns, let's get this in. (It isn't vastly changed after all, since it is porting the C code.)
Thanks for this great new code!
Using the C++ version of pocketfft turned out to be not that much of a hassle. With it, we can support all the
dtype
s that it supports: float, double, and long double.@asmeurer, @mreineck - comments are most welcome! (My C++ is non-existent so this really is just a C file with templates thrown in, mostly using trial & error...)
fixes #17801
EDIT: some things to decide:
LICENSES.md
in thepocketfft
directory. (The licence wasn't there for the C version, but that seems to have been an oversight.)If that is something we should catch, please advice how to interceptdealt with usingthrow
...try/catch
.