Use an aligned allocator for NumPy? #5312

sturlamolden · 2014-11-26T08:17:36Z

Regarding the f2py regression in NumPy 1.9 with failures on 32-bit Windows, the question is whether NumPy should start to use an allocator which gives guaranteed alignment.

scipy/scipy#4168

sturlamolden · 2014-11-26T08:25:52Z

Here is one example of an allocator that should work on all platforms. It is shamelessly based on this:

https://sites.google.com/site/ruslancray/lab/bookshelf/interview/ci/low-level/write-an-aligned-malloc-free-function

There are not many ways to do this and similar code is floating around on the net, so extending it in this way is probable ok. (And besides it does not implement realloc.)

Dropping this code into numpy/core/include/numpy/ndarraytypes.hshould ensure that freshly allocated ndarrays are properly aligned on all platforms.

This platform-independent code could possibly be replaced with posix_memalign() on POSIX and _aligned_malloc() on Windows. However, combining posix_memalign() with realloc() is not possible, so implementing it ourselves is probably better.

#define NPY_MEMALIGN 32   /* 16 for SSE2, 32 for AVX, 64 for Xeon Phi */ 

static NPY_INLINE
void *PyArray_realloc(void *p, size_t n)
{
    void *p1, **p2, *base;
    size_t old_offs, offs = NPY_MEMALIGN - 1 + sizeof(void*);    
    if (NPY_UNLIKELY(p != NULL)) {
        base = *(((void**)p)-1);
        if (NPY_UNLIKELY((p1 = PyMem_Realloc(base,n+offs)) == NULL)) return NULL;
        if (NPY_LIKELY(p1 == base)) return p;
        p2 = (void**)(((Py_uintptr_t)(p1)+offs) & ~(NPY_MEMALIGN-1));
        old_offs = (size_t)((Py_uintptr_t)p - (Py_uintptr_t)base);
        memmove(p2,(char*)p1+old_offs,n);    
    } else {
        if (NPY_UNLIKELY((p1 = PyMem_Malloc(n + offs)) == NULL)) return NULL;
        p2 = (void**)(((Py_uintptr_t)(p1)+offs) & ~(NPY_MEMALIGN-1));   
    }
    *(p2-1) = p1;
    return (void*)p2;
}    

static NPY_INLINE
void *PyArray_malloc(size_t n)
{
    return PyArray_realloc(NULL, n);
}

static NPY_INLINE
void *PyArray_calloc(size_t n, size_t s)
{
    void *p;
    if (NPY_UNLIKELY((p = PyArray_realloc(NULL,n*s)) == NULL)) return NULL;
    memset(p, 0, n*s);
    return p;
}

static NPY_INLINE        
void PyArray_free(void *p)
{
    void *base = *(((void**)p)-1);
    PyMem_Free(base);
}

juliantaylor · 2014-11-26T09:10:34Z

I have already a branch which adds an aligned allocator, i'll dig it out.

By using something like this we throw away the option to uses pythons tracemalloc framework and sparse memory (there is no aligned_calloc).
@njsmith would you be willing to engage with python devs again to add yet another allocator to their slot before 3.5 is released? They already added calloc only for us, would be a schame if we now couldn't use it.

sturlamolden · 2014-11-26T09:27:25Z

Presumably one could pass in alignment in the context data of PyMemAllocatorEx? But NumPy has to support Python versions from 2.6 and up, so doing this in Python 3.5 might not solve the problem.

njsmith · 2014-11-26T11:13:05Z

I do think engaging with the python devs on this before 3.5 is a good idea,
but I still am not convinced we have a good reason to use an aligned
allocator in the near term. It cannot possibly be the case that struct {
double, double } actually requires better-than-malloc alignment on win32 or
SPARC, because if that were true then nothing would work.
On 26 Nov 2014 09:10, "Julian Taylor" notifications@github.com wrote:

I have already a branch which adds an aligned allocator, i'll dig it out.

By using something like this we throw away the option to uses pythons
tracemalloc framework and sparse memory (there is no aligned_calloc).
@njsmith https://github.com/njsmith would you be willing to engage with
python devs again to add yet another allocator to their slot before 3.5 is
released? They already added calloc only for us, would be a schame if we
now couldn't use it.

—
Reply to this email directly or view it on GitHub
#5312 (comment).

sturlamolden · 2014-11-26T11:24:33Z

The question with regard to f2py was what alignment Fortran would need, not the minimum requirement of C. Speed is also an issue. Both indexing and SIMD works better if the data is properly aligned.

pv · 2014-11-26T11:43:00Z

A reason for using aligned allocator could indeed be speed, and ensuring SSE/AVX
compatibility would remove the numerical jitter that comes from taking different
code paths for differently aligned data.
.
f2py is older than the ISO C binding standard in Fortran, and the way it
works is essentially the de facto standard way on interfacing Fortran
with C, used extensively by everyone. In light of this experience, it's
clear that alignment provided by system malloc is sufficient for the
Fortran compilers that matter in practice for us.

pitrou · 2014-12-05T15:51:06Z

Note a 32-byte alignment is recommended for AVX by Intel : https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors

sturlamolden · 2014-12-05T16:44:05Z

@pitrou And 64 byte alignment is recommended for Xeon Phi. Take a look at the comment behind the definition of NPY_MEMALIGN in my code example.

njsmith · 2014-12-05T17:17:16Z

The main complication on providing aligned allocation is that ATM we can
either hook into the tracemalloc infrastructure xor do aligned allocation,
and fixing this will require some coordination with CPython upstream (see
#4663).
On 5 Dec 2014 16:44, "Sturla Molden" notifications@github.com wrote:

@pitrou https://github.com/pitrou And 64 bit is recommended for Xeon
Phi. Take a look at the comment behind the definition of NPY_MEMALIGN in
my code example.

—
Reply to this email directly or view it on GitHub
#5312 (comment).

pitrou · 2014-12-05T18:27:45Z

So the CPython issue is at http://bugs.python.org/issue18835.

pitrou · 2015-01-15T13:43:23Z

Given the complications with realloc(), it might not be realistic to expect CPython to solve this in the 3.5 timeframe. Numpy should perhaps use its own aligned allocated wrapper instead (which should be able to defer to the PyMem API, and take advantage of tracemalloc, anyway).

sturlamolden · 2015-01-15T16:13:06Z

Code for such an allocator is included above. I don't understand @juliantaylor 's argument, but he probably understands this better than me.

I can understand what he meant about calloc though. A calloc is not simply a malloc and a memset to zero. A memset will require the OS to fetch the pages before they are needed. AFAIK there is no PyMem_Calloc.

pitrou · 2015-01-15T16:47:40Z

Actually CPython 3.5 has PyMem_Calloc and friends.
I think @juliantaylor was considering the implementation case of using OS functions (posix_memalign, etc.). But that doesn't sound necessary.

By the way @sturlamolden, your snippet redefines PyArray_Malloc and friends, but array allocation seems to use PyDataMem_NEW. Am I misunderstanding something?

pitrou · 2015-01-15T16:48:38Z

Another thought is that aligned allocation may be wasteful for small arrays. Perhaps there should be a threshold below which standard allocation is used?
Also, should the alignment be configurable?

sturlamolden · 2015-01-15T19:47:59Z

The allocators are called PyArray_malloc and PyArray_free in NumPy 1.9. A lot is changed in NumPy 1.10.

pitrou · 2015-01-15T19:50:21Z

Are you sure? PyArray_NewFromDescr_int() calls npy_alloc_cache() and npy_alloc_cache() calls PyDataMem_NEW().

njsmith · 2015-01-15T19:52:09Z

Numpy has multiple allocation interfaces, and they don't have very obvious
names. PyArray_malloc/free are used for "regular" allocations (e.g. object
structs). Data buffers (ndarray ->data pointers, temporary buffers inside
ufuncs, etc.), however, are allocated via PyDataMem_NEW.

On Thu, Jan 15, 2015 at 7:48 PM, Sturla Molden notifications@github.com
wrote:

The allocators are called PyArray_malloc and PyArray_free in Numpy 1.9.

—
Reply to this email directly or view it on GitHub
#5312 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

sturlamolden · 2015-01-15T19:59:10Z

It seems PyDataMem_NEW calls malloc in NumPy 1.9.
https://github.com/numpy/numpy/blob/3975e095013119cfdbb9405ca95e6c723eb862d3/numpy/core/src/multiarray/alloc.c

charris · 2015-01-15T20:44:41Z

@njsmith Yeah, we should rationalize the allocation macros some day... I'd start with the one used to allocate dimensions for ndarray (IIRC).

Introduce two new functions, get_data_alignment() and set_data_alignment() which allow setting the guaranteed alignment at runtime.

pitrou · 2015-01-16T19:15:31Z

I created PR #5457 with a patch. Feedback on the approach would be nice.

njsmith · 2015-01-16T20:14:45Z

As far as I know there is currently no benefit to using an aligned
allocator in numpy?

On Fri, Jan 16, 2015 at 7:15 PM, Antoine Pitrou notifications@github.com
wrote:

I created PR #5457 #5457 with a
patch. Feedback on the approach would be nice.

—
Reply to this email directly or view it on GitHub
#5312 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

pitrou · 2015-01-16T20:16:26Z

With Numba we determined that AVX vector instructions required a 32-byte alignment for optimal performance. If you compile Numpy with AVX enabled (requires specific compiler options, I guess), alignment should make a difference too.

njsmith · 2015-01-16T20:51:02Z

Out of curiosity, do you have any real-world measurements? I ask b/c there
are so many factors that play into these things (different overhead/speed
trade-offs at different array sizes, details of memory allocators -- which
also act differently at different array sizes -- etc.) that I find it hard
to guess whether one ends up with like a 0.5% end-to-end speedup or a 50%
end-to-end speedup or what.

On Fri, Jan 16, 2015 at 8:16 PM, Antoine Pitrou notifications@github.com
wrote:

With Numba we determined that AVX vector instructions required a 32-byte
alignment for optimal performance. If you compile Numpy with AVX enabled
(requires specific compiler options, I guess), alignment should make a
difference too.

—
Reply to this email directly or view it on GitHub
#5312 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

juliantaylor · 2015-01-16T21:21:04Z

fwiw on my i5-4210u I see no significant difference between 16 and 32 byte aligned data in a simple load add store test, the minimum cycle count seems lower by 5% but median and 10th percentile is identical to 1%

sturlamolden · 2015-01-17T01:01:04Z

Is that with AVX?

eamartin · 2017-05-06T01:24:02Z

The "Python aligned allocator" solution I suggested is a hack. I think offering alignment in the Python interfaces would be nice, but the right way to do that would be to handle alignment at the C level.

vellamike · 2017-07-14T00:08:50Z

This feature would be very helpful to me. I am using an FPGA device (Altera A10GX) where the DMA controller requires 64-byte aligned data to be used, this speeds up my code by 40x(!!!). I suspect that @nachiket has the same problem as me. I wrote something similar to what @eamartin is using but this is a bit of a hack.

mborgerding · 2018-05-30T12:00:01Z

I definitely encourage 64 byte alignment:

that is the cache line size
it is suitable for any SIMD alignment up to AVX512

aldanor · 2019-06-05T14:36:36Z

Here we are almost 5 years later.

Any thoughts on making this (64-byte alignment in particular) a standard feature?..

bashtage · 2019-06-05T15:47:49Z

This cython code is now in NumPy. Of course, this doesn't change the default.

hmaarrfk · 2019-06-17T20:59:43Z

my 2cents: An aligned allocator would help when interface with hardware devices and kernel level calls. These interfaces might benefit from aligning the buffers to pages.

mattip · 2019-09-27T06:54:21Z

In merging randomgen, we gained PyArray_realloc_aligned and friends. Should we move these routines into numpy/core/include ?

jakirkham · 2019-11-05T06:34:37Z

That would certainly be useful, @mattip. Would it be possible to access this functionality from Python as well?

sturlamolden · 2019-11-17T09:43:18Z

In merging randomgen, we gained PyArray_realloc_aligned and friends. Should we move these routines into numpy/core/include ?

Ripped off my code, did ya? 😂

sturlamolden · 2019-11-17T10:29:18Z

It looks like I am no loger contributor for the code I have written for NumPy 🧐:

https://github.com/numpy/numpy/blob/v1.17.2/numpy/_build_utils/src/apple_sgemv_fix.c
https://github.com/numpy/numpy/blob/v1.17.2/numpy/random/src/aligned_malloc/aligned_malloc.h

bashtage · 2019-11-17T10:39:02Z

Probably got lost at some point in the move from randomgen to numpy. I think I had a record that this was yours.

…

On Sun, Nov 17, 2019, 10:29 Sturla Molden ***@***.***> wrote: It looks like I am no loger contributor for the code I have written for NumPy 🧐: https://github.com/numpy/numpy/blob/v1.17.2/numpy/_build_utils/src/apple_sgemv_fix.c https://github.com/numpy/numpy/blob/v1.17.2/numpy/random/src/aligned_malloc/aligned_malloc.h — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5312?email_source=notifications&email_token=ABKTSRLWKGXUFE4OK53SJMDQUEMJJA5CNFSM4AYDJQ42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEIIWSY#issuecomment-554732363>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKTSRM7IZIKPGKT4D2W4IDQUEMJJANCNFSM4AYDJQ4Q> .

mattip · 2019-11-17T12:04:12Z

what was the original source of the code?

sturlamolden · 2019-11-17T12:14:24Z

The link is dead, but it was adapted from an aligned malloc that looked like this:

https://tianrunhe.wordpress.com/2012/04/23/aligned-malloc-in-c/

bashtage · 2019-11-17T12:14:29Z

It was a github post from Sturla. There was no original code file.

…

On Sun, Nov 17, 2019, 12:04 Matti Picus ***@***.***> wrote: what was the original source of the code? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5312?email_source=notifications&email_token=ABKTSRKKJC4K6C4LW4GFYULQUEXNDA5CNFSM4AYDJQ42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEIKJZY#issuecomment-554738919>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKTSRIVMQFEJ5EP227PXL3QUEXNDANCNFSM4AYDJQ4Q> .

sturlamolden · 2019-11-17T12:15:06Z

#5312 (comment)

bashtage · 2019-11-17T12:15:17Z

This is for the aligned malloc. On Sun, Nov 17, 2019, 12:14 Kevin Sheppard <kevin.k.sheppard@gmail.com> wrote:

…

It was a github post from Sturla. There was no original code file. On Sun, Nov 17, 2019, 12:04 Matti Picus ***@***.***> wrote: > what was the original source of the code? > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#5312?email_source=notifications&email_token=ABKTSRKKJC4K6C4LW4GFYULQUEXNDA5CNFSM4AYDJQ42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEIKJZY#issuecomment-554738919>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABKTSRIVMQFEJ5EP227PXL3QUEXNDANCNFSM4AYDJQ4Q> > . >

pitrou · 2019-11-17T12:17:32Z

Does everyone who contributes 50 lines of code to Numpy get a dedicated copyright header? I might go through my contributions and see if any applies :-)

rgommers · 2019-11-17T17:52:45Z

It looks like I am no longer contributor for the code I have written for NumPy 🧐:

You are and will always be:)

Does everyone who contributes 50 lines of code to Numpy get a dedicated copyright header? I might go through my contributions and see if any applies :-)

Nope. We try to avoid encoding such things inside the source code, since that will always be wildly incomplete and hard to maintain. We do ask people to list themselves in THANKS.txt; I'm looking at a better alternative to that, because that file often gives merge conflicts.

seberg · 2021-11-05T19:44:41Z

I am going to close the issue. Happy about a new one though! (It seems most of the discussion is simply outdated and should be re-evaluated based on Matti's work in gh-17582)

Note that it is now – with the next release – possible to write a context manager outside NumPy to swap in an aligned allocator. Which alleviates the need to push it directly into NumPy and gives a chance for much clearer testing of the benefits.

jakirkham · 2021-11-05T19:47:16Z

Also there is a tracking issue with follow up items ( #20193 ). One of them is allocators with specific alignment ( #20193 (comment) )

sturlamolden mentioned this issue Nov 26, 2014

Lots of arpack test failures on windows 32 bits with numpy 1.9.1 scipy/scipy#4168

Closed

sturlamolden changed the title ~~Use an aligned allocator for NumPy~~ Use an aligned allocator for NumPy? Nov 26, 2014

cournape added the 01 - Enhancement label Nov 26, 2014

sturlamolden mentioned this issue Nov 26, 2014

TST: win32 also does not provide 16 byte alignment #4907

Merged

pitrou added a commit to pitrou/numpy that referenced this issue Jan 16, 2015

Fix numpy#5312: use an aligned allocator.

ed49339

Introduce two new functions, get_data_alignment() and set_data_alignment() which allow setting the guaranteed alignment at runtime.

yaroslavvb mentioned this issue Feb 23, 2018

4x slowdown in feed_dict in tf-nightly-gpu tensorflow/tensorflow#17233

Closed

This was referenced Oct 7, 2018

Noise().genAsGrid() Memory Leak robbmcleod/pyfastnoisesimd#18

Closed

Aligned Memory Allocation disabled on Win32 robbmcleod/pyfastnoisesimd#19

Closed

robbmcleod mentioned this issue Dec 6, 2018

Array Alignment LiberTEM/LiberTEM#188

Open

mattip mentioned this issue Jan 13, 2019

Aligned allocator for numpy (Trac #568) #1166

Closed

mattip mentioned this issue Sep 27, 2019

API: restructure and document numpy.random C-API #14604

Merged

jakirkham mentioned this issue Nov 18, 2019

Allocating aligned memory cupy/cupy#2647

Closed

ttsesm mentioned this issue Jun 20, 2020

any example showing how to use the wrapper? sampotter/python-embree#2

Open

mattip mentioned this issue Oct 6, 2020

ENH: allow using aligned memory allocation, or exposing an API for memory management #17467

Closed

seberg closed this as completed Nov 5, 2021

Use an aligned allocator for NumPy? #5312

Use an aligned allocator for NumPy? #5312

Comments

sturlamolden commented Nov 26, 2014

sturlamolden commented Nov 26, 2014

juliantaylor commented Nov 26, 2014

sturlamolden commented Nov 26, 2014

njsmith commented Nov 26, 2014

sturlamolden commented Nov 26, 2014

pv commented Nov 26, 2014

pitrou commented Dec 5, 2014

sturlamolden commented Dec 5, 2014

njsmith commented Dec 5, 2014

pitrou commented Dec 5, 2014

pitrou commented Jan 15, 2015

sturlamolden commented Jan 15, 2015

pitrou commented Jan 15, 2015

pitrou commented Jan 15, 2015

sturlamolden commented Jan 15, 2015

pitrou commented Jan 15, 2015

njsmith commented Jan 15, 2015

sturlamolden commented Jan 15, 2015

charris commented Jan 15, 2015

pitrou commented Jan 16, 2015

njsmith commented Jan 16, 2015

pitrou commented Jan 16, 2015

njsmith commented Jan 16, 2015

juliantaylor commented Jan 16, 2015

sturlamolden commented Jan 17, 2015

eamartin commented May 6, 2017

vellamike commented Jul 14, 2017 • edited Loading

mborgerding commented May 30, 2018

aldanor commented Jun 5, 2019

bashtage commented Jun 5, 2019

hmaarrfk commented Jun 17, 2019

mattip commented Sep 27, 2019

jakirkham commented Nov 5, 2019

sturlamolden commented Nov 17, 2019

sturlamolden commented Nov 17, 2019

bashtage commented Nov 17, 2019 via email

mattip commented Nov 17, 2019

sturlamolden commented Nov 17, 2019

bashtage commented Nov 17, 2019 via email

sturlamolden commented Nov 17, 2019

bashtage commented Nov 17, 2019 via email

pitrou commented Nov 17, 2019

rgommers commented Nov 17, 2019

seberg commented Nov 5, 2021

jakirkham commented Nov 5, 2021

vellamike commented Jul 14, 2017 •

edited

Loading