ENH: Configurable allocator #17582

mattip · 2020-10-19T06:56:59Z

Fixes gh-17467. Adds a public struct to hold memory manipulation routines PyDataMem_Handler and two new API functions PyDataMem_SetHandler to replace the current routines with the new ones, and PyDataMem_GetHandlerName to get the string name of the current routines (either globally or for a specific ndarray object). This also changes the size of the ndarray object to hold the PyDataMem_Handler active when it was created so subsequent actions on its data memory will remain consistent.

Tests and documentation are included. Along the way, I found some places in the code where the current policy is inconsistent (all data memory handling should have gone through npy_*_cache not PyDataMem_*) so even if this is rejected it might improve the cache handling.

The PyDataMem_Handler has fields to override memcpy, these are currently not implemented: memcpy in the code base is untouched. I think this PR is invasive enough as-is, if desired memcpy can be handled in a follow-up PR.

Once the general idea undergoes some review, this should hit the mailing list.

mattip · 2020-10-19T07:04:39Z

numpy/core/src/multiarray/getset.c

@@ -384,7 +384,8 @@ array_data_set(PyArrayObject *self, PyObject *op)
    }
    if (PyArray_FLAGS(self) & NPY_ARRAY_OWNDATA) {
        PyArray_XDECREF(self);
-        PyDataMem_FREE(PyArray_DATA(self));


This should have been npy_free_cache

I think this raises a possible risk in these changes - as I understand it, it didn't matter if the npy_*_cache functions were used for the allocation and not the free or vice versa; everything would behave correctly, it would just affect the cache capacity.

With the changes in this PR, either:

We make the same promise to user hooks, that things you free may be deallocated by someone else or vice versa

We need to audit every free in our code (and likely, accept bug reports for a couple releases for segfaults due to ones we miss)

I suppose as long as our default allocator is happy to accept raw mallocs and frees, this won't actually result in a regression.

numpy/core/src/multiarray/item_selection.c

mattip · 2020-10-19T07:07:30Z

numpy/core/src/multiarray/methods.c

@@ -1991,7 +1991,8 @@ array_setstate(PyArrayObject *self, PyObject *args)
    }

    if ((PyArray_FLAGS(self) & NPY_ARRAY_OWNDATA)) {
-        PyDataMem_FREE(PyArray_DATA(self));


The memory handling in this file ignored npy_*_cache

Eh, github showed my comment twice, but it was only there once...

After lots of debugging (first with valgrind, but that only showed me that size 1 (or 0) allocations were added on the cache as larger ones... And annoying, slow and careful stepping through with gdb:

At this point the size of the array is already modified, so freeing (and adding to cache) just doesn't work unless you calculate the size earlier (and maybe move the free to an earlier point). This function is a bit weird, maybe it makes sense to NULL data right away? Its likely still buggy, but at least we can't run into a double-free then.

mattip · 2020-10-19T07:08:47Z

numpy/core/tests/test_mem_policy.py

+        # so this "if" will be false, preventing infinite recursion
+        #
+        # if needed, debug this by setting extra_argv=['-vvx']
+        np.core.test(verbose=0, extra_argv=[])


This call will run all the np.core tests with the new memory routines. Since the new routines allocs but then returns a pointer shifted by 64 bytes, if the calls are mismatched glibc will segfault when freeing the wrong address.

numpy/testing/_private/extbuild.py

numpy/core/include/numpy/ndarraytypes.h

doc/source/reference/c-api/data_memory.rst

eric-wieser · 2020-10-19T13:33:59Z

doc/source/reference/c-api/data_memory.rst

+
+    .. code-block:: c
+
+        typedef struct {


It would be good to reference the builtin PyMemAllocatorEx somewhere in this section and compare the two (https://docs.python.org/3/c-api/memory.html#customize-memory-allocators).

Will add. The original npy_*_cache set of functions uses PyDataMem_* functions, which it seems preceeded these more sophisticated interfaces and directly used malloc/calloc/free.

Documented. The Python ones do not use our non-documented-but-public PyDataMem_EventHookFunc callbacks. So I could see a path where we deprecate PyDataMem_EventHookFunc and move to the Python memory management strategies, although that would mean

if someone by chance implemented a PyDataMem_EventHookFunc callback it would no longer work.

in order to override data allocations, a user would also override other random memory allocations

mattip · 2020-10-19T15:22:10Z

rebase and force-psuhed to clear merge conflict

seberg

A fairly quick pass, looks nice! And the addition of a test run helper that build test modules from strings is probably overdue :).

Additionally to the inline comments, I think I would like an NPY_USE_DEBUG_ALLOCATOR=1 test run, so that we run the full test suit with an allocator that would blow up if there are incompatibilities/code paths that were forgotten. I guess all that would be needed is allocate always 32 bytes or so extra bytes and store the shifted pointer (probably must remain aligned, or the test suit will go down in flames).

I don't know the needs of those who would like to see such an allocator well, though, so it would be interesting whether we could anticipate the API to grow.

numpy/core/src/multiarray/ctors.c

numpy/core/include/numpy/ndarraytypes.h

mattip · 2020-10-19T19:47:12Z

II guess all that would be needed is allocate always 32 bytes or so extra bytes and store the shifted pointer (probably must remain aligned, or the test suit will go down in flames).

Moving the pointer is what the test I wrote does, either it is not documented well enough or I didn't understand the comment.

seberg · 2020-10-19T19:49:54Z

Sorry @mattip, I did not read the tests carefully yet, my main point was, that it would be nice to run the full test suit with a custom allocator set, and I don't think you are doing that yet?

mattip · 2020-10-19T20:04:27Z

run the full test suit

Correct. I am only running the tests under numpy.core, and even that only under the slow decorator. Running all the tests would double the test suite running time. I did run the complete test suite locally (by changing np.core.test() to np.test()) and it passed.

seberg · 2020-10-19T20:13:01Z

@mattip I was thinking we could run a single CI job with it? That would test it thoroughly also in the future without really adding much time to the whole process.

tdimitri · 2020-10-20T04:29:42Z

Thanks Matti. Just downloaded the version and took a look... here are my thoughts.

There should be a PyDataMem_GetHandler that just returns the current handler, as opposed to trying return it in PyDataMem_SetHandler(NULL) which has the side effect of resetting it. This would be the first function I would call anyway as I might map some allocator functions that I do not support yet to the existing function.
I would change the struct PyDataMem_Handler (this allowed? this the first time it is being exposed as public?)

This is what I currently see.

typedef struct {
    char name[128];  /* multiple of 64 to keep the struct aligned */
    PyDataMem_AllocFunc *alloc;
    PyDataMem_ZeroedAllocFunc *zeroed_alloc;
    PyDataMem_FreeFunc *free;
    PyDataMem_ReallocFunc *realloc;
    PyDataMem_CopyFunc *host2obj;  /* copy from the host python */
    PyDataMem_CopyFunc *obj2host;  /* copy to the host python */
    PyDataMem_CopyFunc *obj2obj;  /* copy between two objects */
} PyDataMem_Handler;

I would try to plumb this structure for the future. I would put a struct version high/low. And I would also include FEATURE bits for the memory allocation. I imagine this structure will change in the future as new features are possible.
The current allocator understands huge pages for Linux and can cache small sizes. It does not allocate with byte alignment. So to me, those are three feature bits right there...
Understand Huge/Large Pages
Can Cache (or Can Cache small sizes)
Guarantees Byte Alignment

For small vs large it would be nice to know what the cutoff sizes for both are.. and maybe they are settable in the future.
For guaranteed byte alignment, what byte alignment is guaranteed?
What about thread safe allocator... the current numpy allocator is not thread safe for small sizes due to its caching of small memory (the cache part is not thread safe). Why care if the allocator is thread safe?... because the current numpy extension we are writing adds threads, and sometimes those threads allocate temporary buffers when doing work. Be nice to use a thread safe allocator that caches so that threads could get quick memory.

Another feature might be GPU aware or NUMA node aware.

Then if each allocator chains to the next (see the next pointer below).. that would be a nice feature.

So the struct might end up looking something like this...

typedef struct {
   PyDataMem_Handler* next;  /* next in chain, null terminated */
   int16 versionhi;   // current version of pydatamem_handler
   int16 versionlow;
   int32 structsize;   // size of this struct
   int64 feature_bits;  // 64 feature bits including has large pages, numa aware, thread safe, cancache, etc.
   char name[128];  /* multiple of 64 to keep the struct aligned */
..
   int64   cachesmallsize;  // up to what size is small memory cached
   int64   cachelargesize;  // at what size is large memory allocated
   int64   bytealignment;  // what byte alignment is guaranteed
..
};

Then I don't know if PyDataMem_SetHandler can fail, but seems possible. So is NULL returned on failure?

mattip · 2020-10-20T08:44:56Z

@tdimitri thanks for the review.

Maybe PyDataMem_SetHandler should not return anything. I am not comfortable with the user overriding only some of the functions with no internal checks. Perhaps another idea would be to force the user to set at least alloc/free/realloc, and if the others are NULL fill them in with defaults.

Yes, this is the first time this struct is being exposed. We use the NumPy-wide C_API_VERSION to specify the version. The idea of versioning public structures has come up before, see this discussion. We should set an across-the-board policy for whether it matters and if so how to deal with it.

I am concerned with "feature creep" when adding bits that describe aspects of the functions. Since the PyDataMem_Handler is a public structure, changing it will requires recompiling the entire scientific python ecosystem. If downstream users think the struct should be more flexible, we should make it private and access it with a different API that only exposes it as a void *handle. Any functionality can be coded into the name. I am curious what you expect users to do with the information on cachesmallsize or bytealignment?

tdimitri · 2020-10-20T12:23:44Z

I agree that any memory handler has to follow some rules like chain to the previous handler, and fill in any NULL function pointers with the previous value or as you point out alloc/free/realloc.

In both Linux and NT kernels, it is common to chain and use the first member of the struct to chain. There are macros in both kernels to do this. I think this is a good practice that I have used myself.

As per feature creep, the flip side is no or few new features? How often is this really going to change and force a stack recompile?

As per cachesmallsize... I might defer to the existing memory allocator for smallsizes, but I would need to compare sizes to know when to call it.

Information like this (and more) can be returned in an info call. Per byte alignment, let's say the current allocator already does 32 byte alignment, and I have AVX-2526 -- then no need to change. But then we detect AVX-512 capability on another computer, now I need to replace handler with 64 byte alignment.

What happens in the future when a version 1.0 memory handler calls SetHandler that is version 2.0 ? That might fail, which is why I thought it could fail.

On another note, not sure why char name[128]; /* multiple of 64 to keep the struct aligned */ is used instead of
const char* name. If the struct holds permanent function pointers, then it can hold a pointer to a string as well.

For riptable, we can also flush the cache or return statistics on the cache. For instance, we return the hit/miss ratio. We also have a timeout for the garbage collector that can be changed.

In numpy, code can be written to flush the cache, but there is no mechanism for it.

Another possible addition to the Struct is a field reserved (void * ownerdefined) for the Struct owner.

Then if memory handlers are chained, might need a way to remove a "name" from the chain (or move to front).

seberg · 2020-10-20T14:39:42Z

Sounds like there is enough reason that we may want to grow the struct/feature list in the future. So it probably does make sense to make the struct opaque and maybe version it. I suppose going with a FromSpec API would work as well, but is maybe more than necessary.
Some things like cache hits/misses, seems like it probably doesn't need to be defined in NumPy, but I wonder if we could extend things for example to expose not just a name, but a Python object?

leofang · 2020-10-20T20:53:36Z

If downstream users think the struct should be more flexible, we should make it private and access it with a different API that only exposes it as a void *handle.

Another possible addition to the Struct is a field reserved (void * ownerdefined) for the Struct owner.

I think having a void* in the struct would be useful for downstream libraries to hack in additional fields. For example, in the GPU world the memory allocation/deallocation could be stream-ordered, so seems useful to me if I could attach a stream pointer to the malloc/free routines. I haven't put much though into this PR so just throwing out a few bits off top of my head 😅

tdimitri · 2020-10-21T13:47:52Z

Any solution that provides 1) PyDataMem_GetHandler , 2) can chain and I can walk the chain, 3) can remove a handler from the chain. and 4) indicates what version of the memory handler (which i suppose too old a version can be rejected) in the future I am ok with and will write code to use. If up to me, I would not make the struct opaque. Thank you.

mattip · 2020-10-21T13:53:56Z

Could you give a concrete use case for walking the chain of methods? What problem does this solve?

tdimitri · 2020-10-21T14:27:19Z

If there are only two possible memory handlers at any point; The new one and the default one... then chaining might not matter. But I believe there can be > 2 as any module can try to Set the memory handler. Especially because arrays can stick around for the life of the process and only the memory handler that allocated can properly delete it.

Per walking the chain -- aside from just general debugging to know who has hooked what... (which i prefer is exposed and not hidden) I could walk the chain, get to the last member of the chain or search for the "default" handler name. Then I would make note of the default alloc handler.

Then when a small memory request came in, I might call the default handler since I know it already has a cache for small memory < 1024 in size.

In general anytime I wanted to "punt" to the default routine I could.
Or, I could see I was already in the chain, but not first. That might help me understand why I am not seeing memory allocations.

Or there could be two useful memory handlers in the chain -- perhaps one that is NUMA aware -- or for Windows it can allocate large pages (numpy does not currently support it). Maybe a module in the future does support it. I might search for and call that memory handler for large objects. windows large page support

From my experience, it is common to be able call the next hook see:CallNextHookEx

Or here is an example of chaining signal handlers in linux chain signal handler -- but what if the previous memory handler is removed? Then what was returned in SetMemoryHandler() (the previous one) is now gone but I still have a pointer to it.

I might just call if (next) next->alloc() since next-> is always up to date if I wanted to punt the alloc.

mattip · 2020-10-22T06:00:28Z

It seems this is complicated enough that a short NEP is needed.

mattip · 2020-10-26T14:25:39Z

Since this changes the PyArrayObject_fields struct, it will hit the same problems that gh-16938 hit in this long discussion:it turns out that we cannot modify C_ABI_VERSION without breaking other projects.

leofang · 2020-11-24T15:50:43Z

Just curious about the status of this PR?

mattip · 2021-10-25T15:56:22Z

Can we get to a point where if we got NULL here, we use the normal free for deallocating and give a warning that mentions the base-object workaround/encourages to open a NumPy issue for support if necessary?

@seberg I added a warning in e81eff4.

jakirkham · 2021-10-25T18:33:40Z

Sorry random question that I've been thinking about. Would it be possible to use this with shared memory?

seberg

Thanks Matti and Elias, great work keeping up with all the API request changes and introducing the context var!
As far as I am aware the public API is de-facto settled and this has gotten some pretty thorough review. The last real concern I had was the OWNDATA "transfer" which is addressed now.

I think at this point, progress on the PR would give diminishing returns and it seems stable enough to ship.

@numpy/numpy-developers If anyone has API concerns or finds any fixup requests, please do not hesitate to note them, I expect we still have some wiggle-room (but personally would be surprised about any serious change). But the next release is also soon ;).

Notes, to move to a new issue:

Release note is missing?
I would like a bit "hand holding" for the arr->flags |= NPY_OWNDATA user. I.e. tell users in some place, (maybe the flag documentation?) linked from the warning: Please use use a PyCapsule as the array base (this is simple, but you have to know about PyCapsule). (I could look into that myself.)
PyCapsule_GetPointer technically can set an error and may need the GIL? I am not worried enough about this to delay, but the functions look like they are meant to be callable without the GIL (I doubt we use it, and I doubt it really can error/need the GIL, but maybe we can make it a bit nicer anyway).
The void functions use (normally) small allocation: I think it would make more sense to just use the Python allocators rather than a (possibly slow) aligned allocator.

doc/neps/nep-0049.rst

seberg · 2021-10-25T18:54:06Z

Lets try this 🎉, thanks again!

seberg · 2021-10-25T19:03:52Z

Sorry random question that I've been thinking about. Would it be possible to use this with shared memory?

Not sure I understand the question, but maybe I don't know the shared-memory use-case well enough. This allows you to customize allocations. I suppose you could do something:

with allocation_policy_designed_for_sharing():
    arr = np.arange(100000)

share_data(arr)  # Knows how `arr`'s data was allocated and may have customized where/how it was allocated.

But does that help?!

leofang · 2021-10-25T19:08:31Z

Great work @mattip and everyone! Looking forward to the evolution of this allocator! @eliaskoromilas's numpy-allocator is great in that we have a Python-only playground, but I hope we can mature this support sooner so that the Python allocator API can be made as part of NumPy.

jakirkham · 2021-10-25T19:09:29Z

But does that help?!

Quite possibly, yes. Am still thinking through how this might work. Though being able to plug this into NumPy would be pretty useful.

Fix astropy#12324 : Fix the warnings emitted by Numpy when transferring ownership by setting ``NPY_ARRAY_OWNDATA``: RuntimeWarning: Trying to dealloc data, but a memory policy is not set. If you take ownership of the data, you must set a base owning the data (e.g. a PyCapsule). The warning was added in [1], and is now visible only when setting the NUMPY_WARN_IF_NO_MEM_POLICY environment variable [2]. For more details and a fix see [3]. [1] numpy/numpy#17582 [2] numpy/numpy#20200 [3] https://github.com/numpy/numpy/blob/main/doc/source/reference/c-api/data_memory.rst#what-happens-when-deallocating-if-there-is-no-policy-set

mattip changed the title ~~Configurable allocator~~ ENH: Configurable allocator Oct 19, 2020

mattip added 01 - Enhancement component: numpy._core labels Oct 19, 2020

mattip mentioned this pull request Oct 19, 2020

would like numpy to add a hook when it allocates arrays Quansight/pnumpy#14

Open

mattip commented Oct 19, 2020

View reviewed changes

eric-wieser reviewed Oct 19, 2020

View reviewed changes

doc/source/reference/c-api/data_memory.rst Outdated Show resolved Hide resolved

eric-wieser reviewed Oct 19, 2020

View reviewed changes

doc/source/reference/c-api/data_memory.rst Outdated Show resolved Hide resolved

eric-wieser reviewed Oct 19, 2020

View reviewed changes

mattip force-pushed the configurable_allocator branch 2 times, most recently from f71fd5b to 811b9e4 Compare October 19, 2020 15:19

seberg reviewed Oct 19, 2020

View reviewed changes

mattip mentioned this pull request Oct 19, 2020

ENH: allow using aligned memory allocation, or exposing an API for memory management #17467

Closed

mattip added the triage review Issue/PR to be discussed at the next triage meeting label Oct 21, 2020

mattip removed the triage review Issue/PR to be discussed at the next triage meeting label Nov 4, 2020

ENH: fix and test for blindly taking ownership of data

442b0e1

mattip force-pushed the configurable_allocator branch from e81eff4 to 442b0e1 Compare October 25, 2021 17:17

seberg approved these changes Oct 25, 2021

View reviewed changes

doc/neps/nep-0049.rst Outdated Show resolved Hide resolved

Update doc/neps/nep-0049.rst

8ca8b54

seberg merged commit 84e0707 into numpy:main Oct 25, 2021

seberg mentioned this pull request Oct 25, 2021

TASK: Potential fixup/followups for the allocator changes #20193

Closed

5 tasks

mattip mentioned this pull request Oct 28, 2021

BUG: Integer divide tests fail on 32 bit Windows wheel builds. #20216

Closed

thomasjpfan mentioned this pull request Nov 1, 2021

Upstream nightly dev build failing scikit-learn/scikit-learn#21524

Closed

WarrenWeckesser mentioned this pull request Nov 3, 2021

BUG(?): Tests print 'uh-oh, unmatched shift_free(ptr, 8) but allocated 720' #20289

Closed

This was referenced Nov 5, 2021

Use an aligned allocator for NumPy? #5312

Closed

np.errstate is not asyncio-safe #9444

Closed

mattip deleted the configurable_allocator branch December 17, 2021 08:44

h-vetinari mentioned this pull request Apr 7, 2022

Rebuild for PyPy3.8 and PyPy3.9 conda-forge/numpy-feedstock#263

Merged

eendebakpt mentioned this pull request May 10, 2022

ENH: Reduce overhead of configurable data allocation strategy (NEP49) #21488

Closed

eendebakpt mentioned this pull request May 18, 2022

ENH: Reduce overhead of configurable data allocation strategy (NEP49) #21531

Closed

saimn mentioned this pull request Dec 11, 2022

Fix memory policy warning in io.fits astropy/astropy#14168

Merged

rtpsw mentioned this pull request May 10, 2023

[C++][Parquet] Parquet write_to_dataset performance regression apache/arrow#35498

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Configurable allocator #17582

ENH: Configurable allocator #17582

mattip commented Oct 19, 2020

mattip Oct 19, 2020

eric-wieser Oct 20, 2020

mattip Oct 19, 2020

seberg Apr 11, 2021

mattip Oct 19, 2020

eric-wieser Oct 19, 2020

mattip Oct 19, 2020

mattip Oct 19, 2020

mattip commented Oct 19, 2020

seberg left a comment

mattip commented Oct 19, 2020

seberg commented Oct 19, 2020

mattip commented Oct 19, 2020

seberg commented Oct 19, 2020

tdimitri commented Oct 20, 2020 •

edited

Loading

mattip commented Oct 20, 2020

tdimitri commented Oct 20, 2020 •

edited

Loading

seberg commented Oct 20, 2020

leofang commented Oct 20, 2020

tdimitri commented Oct 21, 2020

mattip commented Oct 21, 2020

tdimitri commented Oct 21, 2020

mattip commented Oct 22, 2020

mattip commented Oct 26, 2020

leofang commented Nov 24, 2020

mattip commented Oct 25, 2021

jakirkham commented Oct 25, 2021

seberg left a comment

seberg commented Oct 25, 2021

seberg commented Oct 25, 2021

leofang commented Oct 25, 2021

jakirkham commented Oct 25, 2021

ENH: Configurable allocator #17582

ENH: Configurable allocator #17582

Conversation

mattip commented Oct 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattip commented Oct 19, 2020

seberg left a comment

Choose a reason for hiding this comment

mattip commented Oct 19, 2020

seberg commented Oct 19, 2020

mattip commented Oct 19, 2020

seberg commented Oct 19, 2020

tdimitri commented Oct 20, 2020 • edited Loading

mattip commented Oct 20, 2020

tdimitri commented Oct 20, 2020 • edited Loading

seberg commented Oct 20, 2020

leofang commented Oct 20, 2020

tdimitri commented Oct 21, 2020

mattip commented Oct 21, 2020

tdimitri commented Oct 21, 2020

mattip commented Oct 22, 2020

mattip commented Oct 26, 2020

leofang commented Nov 24, 2020

mattip commented Oct 25, 2021

jakirkham commented Oct 25, 2021

seberg left a comment

Choose a reason for hiding this comment

seberg commented Oct 25, 2021

seberg commented Oct 25, 2021

leofang commented Oct 25, 2021

jakirkham commented Oct 25, 2021

tdimitri commented Oct 20, 2020 •

edited

Loading

tdimitri commented Oct 20, 2020 •

edited

Loading