-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Use an aligned allocator for NumPy? #5312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here is one example of an allocator that should work on all platforms. It is shamelessly based on this: There are not many ways to do this and similar code is floating around on the net, so extending it in this way is probable ok. (And besides it does not implement realloc.) Dropping this code into This platform-independent code could possibly be replaced with #define NPY_MEMALIGN 32 /* 16 for SSE2, 32 for AVX, 64 for Xeon Phi */
static NPY_INLINE
void *PyArray_realloc(void *p, size_t n)
{
void *p1, **p2, *base;
size_t old_offs, offs = NPY_MEMALIGN - 1 + sizeof(void*);
if (NPY_UNLIKELY(p != NULL)) {
base = *(((void**)p)-1);
if (NPY_UNLIKELY((p1 = PyMem_Realloc(base,n+offs)) == NULL)) return NULL;
if (NPY_LIKELY(p1 == base)) return p;
p2 = (void**)(((Py_uintptr_t)(p1)+offs) & ~(NPY_MEMALIGN-1));
old_offs = (size_t)((Py_uintptr_t)p - (Py_uintptr_t)base);
memmove(p2,(char*)p1+old_offs,n);
} else {
if (NPY_UNLIKELY((p1 = PyMem_Malloc(n + offs)) == NULL)) return NULL;
p2 = (void**)(((Py_uintptr_t)(p1)+offs) & ~(NPY_MEMALIGN-1));
}
*(p2-1) = p1;
return (void*)p2;
}
static NPY_INLINE
void *PyArray_malloc(size_t n)
{
return PyArray_realloc(NULL, n);
}
static NPY_INLINE
void *PyArray_calloc(size_t n, size_t s)
{
void *p;
if (NPY_UNLIKELY((p = PyArray_realloc(NULL,n*s)) == NULL)) return NULL;
memset(p, 0, n*s);
return p;
}
static NPY_INLINE
void PyArray_free(void *p)
{
void *base = *(((void**)p)-1);
PyMem_Free(base);
} |
I have already a branch which adds an aligned allocator, i'll dig it out. By using something like this we throw away the option to uses pythons tracemalloc framework and sparse memory (there is no aligned_calloc). |
Presumably one could pass in alignment in the context data of |
I do think engaging with the python devs on this before 3.5 is a good idea,
|
The question with regard to f2py was what alignment Fortran would need, not the minimum requirement of C. Speed is also an issue. Both indexing and SIMD works better if the data is properly aligned. |
A reason for using aligned allocator could indeed be speed, and ensuring SSE/AVX |
Note a 32-byte alignment is recommended for AVX by Intel : https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors |
@pitrou And 64 byte alignment is recommended for Xeon Phi. Take a look at the comment behind the definition of |
The main complication on providing aligned allocation is that ATM we can
|
So the CPython issue is at http://bugs.python.org/issue18835. |
Given the complications with realloc(), it might not be realistic to expect CPython to solve this in the 3.5 timeframe. Numpy should perhaps use its own aligned allocated wrapper instead (which should be able to defer to the PyMem API, and take advantage of tracemalloc, anyway). |
Code for such an allocator is included above. I don't understand @juliantaylor 's argument, but he probably understands this better than me. I can understand what he meant about calloc though. A calloc is not simply a malloc and a memset to zero. A memset will require the OS to fetch the pages before they are needed. AFAIK there is no PyMem_Calloc. |
Actually CPython 3.5 has By the way @sturlamolden, your snippet redefines PyArray_Malloc and friends, but array allocation seems to use PyDataMem_NEW. Am I misunderstanding something? |
Another thought is that aligned allocation may be wasteful for small arrays. Perhaps there should be a threshold below which standard allocation is used? |
The allocators are called PyArray_malloc and PyArray_free in NumPy 1.9. A lot is changed in NumPy 1.10. |
Are you sure? PyArray_NewFromDescr_int() calls npy_alloc_cache() and npy_alloc_cache() calls PyDataMem_NEW(). |
Numpy has multiple allocation interfaces, and they don't have very obvious On Thu, Jan 15, 2015 at 7:48 PM, Sturla Molden notifications@github.com
Nathaniel J. Smith |
It seems PyDataMem_NEW calls |
@njsmith Yeah, we should rationalize the allocation macros some day... I'd start with the one used to allocate dimensions for ndarray (IIRC). |
Introduce two new functions, get_data_alignment() and set_data_alignment() which allow setting the guaranteed alignment at runtime.
I created PR #5457 with a patch. Feedback on the approach would be nice. |
As far as I know there is currently no benefit to using an aligned On Fri, Jan 16, 2015 at 7:15 PM, Antoine Pitrou notifications@github.com
Nathaniel J. Smith |
With Numba we determined that AVX vector instructions required a 32-byte alignment for optimal performance. If you compile Numpy with AVX enabled (requires specific compiler options, I guess), alignment should make a difference too. |
Out of curiosity, do you have any real-world measurements? I ask b/c there On Fri, Jan 16, 2015 at 8:16 PM, Antoine Pitrou notifications@github.com
Nathaniel J. Smith |
fwiw on my i5-4210u I see no significant difference between 16 and 32 byte aligned data in a simple load add store test, the minimum cycle count seems lower by 5% but median and 10th percentile is identical to 1% |
Is that with AVX? |
The "Python aligned allocator" solution I suggested is a hack. I think offering alignment in the Python interfaces would be nice, but the right way to do that would be to handle alignment at the C level. |
This feature would be very helpful to me. I am using an FPGA device (Altera A10GX) where the DMA controller requires 64-byte aligned data to be used, this speeds up my code by 40x(!!!). I suspect that @nachiket has the same problem as me. I wrote something similar to what @eamartin is using but this is a bit of a hack. |
I definitely encourage 64 byte alignment:
|
Here we are almost 5 years later. Any thoughts on making this (64-byte alignment in particular) a standard feature?.. |
This cython code is now in NumPy. Of course, this doesn't change the default. |
my 2cents: An aligned allocator would help when interface with hardware devices and kernel level calls. These interfaces might benefit from aligning the buffers to pages. |
In merging randomgen, we gained |
That would certainly be useful, @mattip. Would it be possible to access this functionality from Python as well? |
Ripped off my code, did ya? 😂 |
It looks like I am no loger contributor for the code I have written for NumPy 🧐: https://github.com/numpy/numpy/blob/v1.17.2/numpy/_build_utils/src/apple_sgemv_fix.c |
Probably got lost at some point in the move from randomgen to numpy. I
think I had a record that this was yours.
…On Sun, Nov 17, 2019, 10:29 Sturla Molden ***@***.***> wrote:
It looks like I am no loger contributor for the code I have written for
NumPy 🧐:
https://github.com/numpy/numpy/blob/v1.17.2/numpy/_build_utils/src/apple_sgemv_fix.c
https://github.com/numpy/numpy/blob/v1.17.2/numpy/random/src/aligned_malloc/aligned_malloc.h
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5312?email_source=notifications&email_token=ABKTSRLWKGXUFE4OK53SJMDQUEMJJA5CNFSM4AYDJQ42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEIIWSY#issuecomment-554732363>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKTSRM7IZIKPGKT4D2W4IDQUEMJJANCNFSM4AYDJQ4Q>
.
|
what was the original source of the code? |
The link is dead, but it was adapted from an aligned malloc that looked like this: https://tianrunhe.wordpress.com/2012/04/23/aligned-malloc-in-c/ |
It was a github post from Sturla. There was no original code file.
…On Sun, Nov 17, 2019, 12:04 Matti Picus ***@***.***> wrote:
what was the original source of the code?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5312?email_source=notifications&email_token=ABKTSRKKJC4K6C4LW4GFYULQUEXNDA5CNFSM4AYDJQ42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEIKJZY#issuecomment-554738919>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKTSRIVMQFEJ5EP227PXL3QUEXNDANCNFSM4AYDJQ4Q>
.
|
This is for the aligned malloc.
On Sun, Nov 17, 2019, 12:14 Kevin Sheppard <kevin.k.sheppard@gmail.com>
wrote:
… It was a github post from Sturla. There was no original code file.
On Sun, Nov 17, 2019, 12:04 Matti Picus ***@***.***> wrote:
> what was the original source of the code?
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#5312?email_source=notifications&email_token=ABKTSRKKJC4K6C4LW4GFYULQUEXNDA5CNFSM4AYDJQ42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEIKJZY#issuecomment-554738919>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ABKTSRIVMQFEJ5EP227PXL3QUEXNDANCNFSM4AYDJQ4Q>
> .
>
|
Does everyone who contributes 50 lines of code to Numpy get a dedicated copyright header? I might go through my contributions and see if any applies :-) |
You are and will always be:)
Nope. We try to avoid encoding such things inside the source code, since that will always be wildly incomplete and hard to maintain. We do ask people to list themselves in |
I am going to close the issue. Happy about a new one though! (It seems most of the discussion is simply outdated and should be re-evaluated based on Matti's work in gh-17582) Note that it is now – with the next release – possible to write a context manager outside NumPy to swap in an aligned allocator. Which alleviates the need to push it directly into NumPy and gives a chance for much clearer testing of the benefits. |
Also there is a tracking issue with follow up items ( #20193 ). One of them is allocators with specific alignment ( #20193 (comment) ) |
Regarding the f2py regression in NumPy 1.9 with failures on 32-bit Windows, the question is whether NumPy should start to use an allocator which gives guaranteed alignment.
scipy/scipy#4168
The text was updated successfully, but these errors were encountered: