Add framework for extensible ArrayFire memory managers #2461

jacobkahn · 2019-03-19T00:03:08Z

Motivation

Many different use cases require performance across many different memory allocation patterns. Even different devices/backends have different costs associated with memory allocations/manipulations. Having the flexibility to implement different memory management schemes can help optimize performance for the use case and backend.

Framework

The basic interface lives in a new header: include/af/memory.h and includes two interfaces:
- A C-style interface defined in the af_memory_manager struct, which includes function pointers to which custom memory manager implementations should be defined along with device/backend-specific functions that can be called by the implementation (e.g. nativeAlloc) and will be dynamically set. Typesafe C-style struct inheritance should be used.
- A C++ style interface using MemoryManagerBase, which defines pure-virtual methods for the API along with device/backend-specific functions as above.

A C++ implementation is simple, and requires only:

#include <af/memory.h>
...
class MyCustomMemoryManager : public af::MemoryManagerBase {
...
  void* alloc(const size_t size, bool user_lock) override {
    ...
    void* ptr = this->nativeAlloc(...);
    ...
  }
};

// In some code run at startup:
af::MemoryManagerBase* p = new MyCustomMemoryManager();
af::setMemoryManager(p);

For the C API:

#include <af/memory.h>
...
af_memory_manager_impl_alloc(const size_t size, bool user_lock) {
  ...
}

typedef struct af_memory_manager_impl {
  af_memory_manager manager; // inherit base methods
  // define custom implementation
} af_memory_manager_impl;

// In some code run at startup:
...
af_memory_manager* p = (af_memory_manager*)malloc(sizeof(af_memory_manager_impl));
p->af_memory_manager_alloc = &af_memory_manager_impl_alloc;
...
af_set_memory_manager(p);

Details

The F-bound polymorphism pattern present in the existing MemoryManager implementation is removed; removing this was required as it precludes dynamically dispatching to a derived implementation.
New interfaces are defined for C/C++ (see below)
MemoryManagerCWrapper wraps a C struct implementation of a memory manager and facilitates using the same backend and DeviceManager APIs to manipulate a manager implemented in C.

API Design Decisions

If a custom memory manager is not defined or set, the default memory manager will be used. While the default memory manager implements the new existing interface, behavior is completely identical to existing behavior by default (as verified by tests)
Memory managers should be stored on the existing DeviceManager framework so as to preserve the integrity of existing backend APIs; memory managers can exist on a per-backend basis and work with the unified backend.
Existing ArrayFire APIs expect garbage collection and memory step sizing to be implemented in a memory manager. These and a few other slightly opinionated methods are included in the overall API.
- That said, these methods can be noops or throw exceptions (e.g. garbage collection) if the style of custom memory manager implementation doesn't implement those facilities.
Setting a memory manager should use one API in the same C/C++ fashion so as to be compatible with the unified backend via dynamic invocation of symbols in a shared object. The C and C++ APIs should have a polymorphic relationship such that either can be passed to the public API (af::MemoryManagerBase is this a subtype of af_memory_manager, a C struct)

Adds tests defining custom memory manager implementations in both the C++ and C API and testing end-to-end Array allocations and public AF API calls (e.g. garbage collection, step size).

9prady9 · 2019-03-19T10:39:19Z

@jacobkahn Yes, the API guards should be 37 if the new feature is to be made available for 3.7 minor release. I have quickly gone over the API headers and tests, I feel the API is a bit complex. Perhaps it can be simplified further. I will go over it once again and provide a more detailed feedback.

Thank you for the contribution! We are excited about this new feature 👍

Regarding the ci failures if you are wondering where to check the logs, here is the link
http://ci.arrayfire.org/index.php?project=ArrayFire

You can filter the entries using the PR number 2461 (click Gear icon on top right corner of ctest dashboard -> Show Filters -> Set Build Name contains as 2461) since all the jobs didn't run today.

jacobkahn · 2019-03-19T18:27:52Z

@9prady9 — thanks. I'd made a mistake rebasing and all tests look green now.

Regarding the API version, I think it would be great if we could have this in the 3.6.3 release — so I'm inclined to leave this as is (inclusion in API version 36) if that's alright.

Regarding the API: how would you make it simpler? The biggest constraint is that we have to keep all existing functions called by each backend. For instance, af::deviceGC/af_device_gc is part of the existing public API and calls memoryManager().garbageCollect() in [cpu,cuda,opencl]/memory.cpp. The JIT also calls deviceMemoryInfo which ends up hitting memoryManager().bufferInfo(...) in the same way.

The list of methods already in the public API/used by the JIT/backend-specific operations includes:

garbageCollect (public API)
bufferInfo (JIT)
getMemStepSize (public API)
setMemStepSize (public API)
printInfo (public API)
getMaxBuffers (JIT)
getMaxBytes (JIT)
checkMemoryLimit (CPU queue)
allocated (JIT)
addMemoryManagement/removeMemoryManagement (OpenCL backend)

The only other relevant/required methods that might be included in a barebones API (and which are also included in the public memory manager API) are:

alloc
unlock
userLock
userUnlock
isUserLocked

pavanky · 2019-03-19T21:35:15Z

I am not sure we need a full fledged support for adding a memory manager via both C and C++. One of them should be a light weight wrapper around the other.

jacobkahn · 2019-03-20T03:27:27Z

@pavanky — MemoryManagerCWrapper wraps the C API so that internally, only a C++ class is used.

I couldn't come up with a better way of structuring the public APIs in this respect; the C API obviously can't use a C++ class, and having a class for the C++ API makes things much simpler compared to using function pointers/a C struct. What did you have in mind?

9prady9 · 2019-03-20T09:45:08Z

3.6.3 would be a fix release and we can't add new functions to fix release. It has to be in 3.7. If having it as a fix release is important for your use case. We can work with you on giving you a special build via https://arrayfire.com/support/ . Can you please look into the timeout on Windows CPU job.

Coming back to our discussion about API, I think following functionalities doesn't need custom implementations as far as I am aware of the existing Memory Manager implementation. These functionalities merely fetch/set attributes or do some sanity checks or print information. In few cases, they do some modifications but trivial ones.

af_memory_manager_add_memory_management
af_memory_manager_remove_memory_management
af_memory_manager_check_memory_limit
af_memory_manager_buffer_info
af_memory_manager_user_lock
af_memory_manager_user_unlock
af_memory_manager_is_user_locked
af_memory_manager_get_max_bytes
af_memory_manager_get_max_buffers
af_memory_manager_print_info
af_memory_manager_get_mem_step_size
af_memory_manager_set_mem_step_size
af_memory_manager_get_active_device_id
af_memory_manager_get_max_memory_size
af_memory_manager_check_memory_limit

Given below is a rough idea(definitely not 100%) of how I envision the customizable memory manager API. I have taken some cues from your existing changes and modified it further based on how we handle resource management in other locations of ArrayFire.

typedef void (*af_memory_manager_initialize_fn)(af_memory_manager);

typedef void (*af_memory_manager_shutdown_fn)(af_memory_manager);

typedef void* (*af_memory_manager_alloc_fn)(af_memory_manager, const size_t, bool);

typedef void (*af_memory_manager_unlock_fn)(af_memory_manager, void*, bool);

typedef void (*af_memory_manager_garbage_collect_fn)(af_memory_manager);

typedef void* (*af_memory_manager_native_alloc_fn)(af_memory_manager, size_t);

typedef void (*af_memory_manager_native_free_fn)(af_memory_manager, void*);

// Creates a handle to an opaque object that internally calls
// user set callbacks if available, otherwise default implementations
// from each respective backends.
af_err af_create_memory_manager(af_memory_manager* out); //(with defaults)

af_err af_release_memory_manager(af_memory_manager); //(restore defaults internally)

af_err af_memory_manager_set_init_callback(
  af_memory_manager handle,
  af_memory_manager_initialize_fn init_fn
);

af_err af_memory_manager_set_shutdown_callback(
  af_memory_manager handle,
  af_memory_manager_shutdown_fn shutdown_fn
);

af_err af_memory_manager_set_alloc_callback(
  af_memory_manager handle,
  af_memory_manager_alloc_fn alloc_fn
);

af_err af_memory_manager_set_unlock_callback(
  af_memory_manager handle,
  af_memory_manager_unlock_fn unlock_fn
);

af_err af_memory_manager_set_garbage_collect_callback(
  af_memory_manager handle,
  af_memory_manager_garbage_collect_fn garbage_collect_fn
);

af_err af_memory_manager_set_native_alloc_callback(
  af_memory_manager handle,
  af_memory_manager_native_alloc_fn native_alloc_fn
);

af_err af_memory_manager_set_garbage_collect_callback(
  af_memory_manager handle,
  af_memory_manager_native_free_fn native_free_fn
);

C++ API will be merely managing the resource handle, af_memory_manager and
it may look like the following.

namespace af {
class MemoryManager {
  public:
    using InitFn           = af_memory_manager_initialize_fn;
    using ShutdownFn       = af_memory_manager_shutdown_fn;
    using AllocFn          = af_memory_manager_alloc_fn;
    using UnlockFn         = af_memory_manager_unlock_fn;
    using GarbageCollectFn = af_memory_manager_garbage_collect_fn;
    using NativeAllocFn    = af_memory_manager_native_alloc_fn;
    using NativeFreeFn     = af_memory_manager_native_free_fn;

    MemoryManager() { /* ... create resource handle(with defaults) ... */}
    ~MemoryManager() { /* ... destroy resource handle(restore defaults internally) ... */}
    registerInit(InitFn fn) {
      AF_CHECK(af_memory_manager_set_init_callback(mHandle, fn));
    }
    // ... similarly register callbacks for other functionality
  private:
    af_memory_manager mHandle;
};

void customMemManagerInit(af_memory_manager handle) {
  // initialization code
  // to enable access to attributes for modifications or reading
  // additional memeber access functions should be provided for
  // the resource handle af_memory_manager otherthan what I have
  // shown above.
}

int main() {
  af::MemoryManager memMngr;
  memMngr.registerInit(customMemManagerInit);
  // regular code.
}
}

Obviously whatever I have suggested does require lot of changes to how memory manager
has been implemented currently. Feel free to suggest modifications.

jacobkahn · 2019-03-20T16:39:13Z

Thanks for those suggestions, @9prady9!

we can't add new functions to fix release

Ah — I didn't know. We can wait until 3.7 then. Not a problem. Our highest priority is for others to be able to reproduce our results in terms of performance; we'd want others to be able to do so by using arrayfire:master rather than a fork.

I'm working on repro-ing the Windows/CPU issue — should be an easy fix.

These functionalities merely fetch/set attributes or do some sanity checks or print information

I certainly agree that it would nice to not add these extra methods that fetch/set attributes to the public API (I tried to make this work!), but not doing so seems to create some problems and make things quite opinionated:

If a method is not implemented publicly, what happens when ArrayFire calls it internally? As we know, if we change nothing else, the implementation won't typecheck unless we implement those methods in the base interface. If we exclude them from the base interface but call the default implementation, I see a few issues:
- First, it forces us to maintain two references: one to the custom implementation and one to the default implementation. This seems error-prone.
- It's wasteful — for example, in the case of addMemoryManagement, relying on the default implementation forces use of its data structures, which, depending on the external custom allocator implementation, might be unused/completely irrelevant (with an arena or obstack allocator, for instance).
How would your API work with the JIT if allocated isn't exposed? If it calls the default implementation, it assumes something is keeping track of how I've allocated memory, and that something lives inside the default memory manager. That's pretty opinionated, since I must structure my cache to use these data structures, and accessing them outside of the default allocator might be difficult to manage.
Exposing native_alloc/native_free as it is here seems like it makes working with the unified API difficult — every time the backend switches, a new function pointer needs to be registered to all backend-specific methods, rather than relying on dynamically calling pre-registered ones.

Unless of course you're suggesting that we remove callsites of methods like addMemoryManagement, allocated, or printInfo entirely, I don't see a way around most of these. Let me know what you think.

C++ API will be merely managing the resource handle

This isn't a huge change to make since the MemoryManagerCWrapper I mentioned above already implements this pattern for the most part.

That said, making the API function pointer-based in C++ is also a bit restrictive, since one can never pass any class methods; it forces use of global state/precludes conveniently storing intermediary state in a C++ class without using the handle. Although that implementation can definitely work.

9prady9 · 2019-03-21T07:05:58Z

I will respond to your last comment first, passing class methods is possible using a simple trick similar to glfwSetWindowUserPointer. Using this trick, the C++ class for memory manager would be something like the following. It is not definitely the final version, I am just to point out that it is possible to pass along class member function.

MemoryManager() {
    auto initCallback = [](af_memory_manager handle) {
        void* ptr;
        af_mm_get_user_ptr(&ptr, handle)
        MemoryManager* objPtr =  s
        static_cast<MemoryManager*>(ptr)->init();
    };
}

MemoryManager::init() {
}

In my opinion, the eventual goal should be to avoid making the user/developer re-implement the entire interface. The following areas are some locations where custom logic can be done.

Allocation and free calls - in the cases where the user/developer wants to use custom allocators like jemalloc or any other framework's alloc/free API etc.
Caching - should the implementer choose to cache/pool allocations, this logic should be customizable per need.
Garbage Collection - To implement custom heuristics that trigger garbage collection of cached/pooled allocations.

Hence, I chose a specific set of functions and not all from the memory manager API. My responses to your checklist is given below inline.

If a method is not implemented publicly, what happens when ArrayFire calls it internally? As we know, if we change nothing else, the implementation won't typecheck unless we implement those methods in the base interface. If we exclude them from the base interface but call the default implementation, I see a few issues:
First, it forces us to maintain two references: one to the custom implementation and one to the default implementation. This seems error-prone.
It's wasteful — for example, in the case of addMemoryManagement, relying on the default implementation forces use of its data structures, which, depending on the external custom allocator implementation, might be unused/completely irrelevant (with an arena or obstack allocator, for
instance).

I should have emphasized on what I was going for before suggesting the API. My primary goal in my earlier comment was to have an API that lets the user/developer to write the logic involved with caching, garbage collection etc. Not to implement the entire interface if the user wants to write a custom allocator. For example, It is possible for some use cases to just use custom alloc/free functions but doesn't want to deal with re-implementing other functional logic. In such cases, where only partial customization is required, the user/developer has to deal with entire interface. Adding/Removing memory management for a specific device is essentially marking that memory for that device will be managed by the manager from that point forward. This should be a trivial function I think. If it isn't trivial in the current code base, we should change that to be as simple as possible. That said, I am not sure if this add/remove management needs customization.

How would your API work with the JIT if allocated isn't exposed? If it calls the default implementation, it assumes something is keeping track of how I've allocated memory, and that something lives inside the default memory manager. That's pretty opinionated, since I must structure my cache to use these data structures, and accessing them outside of the default allocator might be difficult to manage.

I am not yet sure how the final API should look like. If what I suggested - Only allowing the user/developer to customize the algorithmic logic is not possible, then perhaps what you have now is one way to do it. Having said that, I think we might be able to expose the data structures you were referring to without causing too many changes - with some extra helper functions at C-API level but I think it is possible.

Exposing native_alloc/native_free as it is here seems like it makes working with the unified API difficult — every time the backend switches, a new function pointer needs to be registered to all backend-specific methods, rather than relying on dynamically calling pre-registered ones.

That is true. May be, just having separate call back functions(af_memory_manager_set _<cpu|cuda|opencl>_native_alloc) for each backend in unified API should handle it. I am not yet sure if that is optimum route but it should be manageable.

jacobkahn · 2019-03-22T03:36:26Z

Thanks for elaborating more. I'll try to continue the discussion in order:

C++ class for memory manager would be something like the following

I see and understand the GLFW pattern, but that seems a bit different from your example, and I don't see a C++ class used in GLFW. Is your initCallback what's going to be passed to af_memory_manager_set_init_callback above? I'd tried an API like this when I was starting out, and sadly, C++ std::functions can't be converted to void* to facilitate using the C API. (Also, is the line with MemoryManager* objPtr = s supposed to be deleted?)

I do something roughly identical with the C wrapper, except those functions need to be global so we can set function pointers properly.

Would be great if you could expand your example a bit so I could get a better idea of what that design would look like 👍.

have an API that lets the user/developer to write the logic involved with caching, garbage collection etc. Not to implement the entire interface . . . avoid making the user/developer re-implement the entire interface

Thanks for clarifying this. I certainly agree that we need a way for the user to not need to implement everything, but It's tough to find a middle ground between that and making everything too opinionated.

The cost to having to write a quick override/exception throw for a part of the API that the user doesn't want to re-implement seems pretty low:

class MyCustomMemoryManager : public af::MemoryManagerBase {
  ...
  void garbageCollect() override {
    throw af::exception("Garbage collection not implemented for my custom manager!");
  }
  ...
};

Further, it's hard for me to see a case in which a user would want to implement only a few pieces of functionality without redefining almost everything. If I were to break this down further:

Essentially anyone who wants to create their custom memory manager presumably wants to write functions for:
- alloc
- unlock/free
Some implementation-dependent functions might include:
- The user locking abstraction: userLock/userUnlock/isUserLocked (which can be safely ignored)
- Garbage collection (which can also safely be ignored)

The question becomes: what should these functions do if the user provides no implementation for them? I don't feel it makes any sense for garbage collection to do anything if the user doesn't create a memory manager that does garbage collection, but the function has to be defined somewhere (if not in the user's implementation, where is it defined and what does it do? Is it a noop in the default allocator?). The most straightforwards thing in this case seems to be to have the user define it themselves to do nothing or throw.

I definitely agree there is a list of methods which are very specific/are rarely useful to reimplement:
- printInfo
- addMemoryManagement, removeMemoryManagement
- JIT-related functions: allocated, bufferInfo, getMaxBuffers, getMaxBytes, checkMemoryLimit
- Step sizing: setMemStepSize, getMemStepSize

Yet those functions need to stay unless we're willing to modify the underlying implementations of their callers. The question remains though — if the user doesn't define them, what do they do? Especially if because some methods rely on allocator state, we'd have to arbitrarily expose things like the definition of and a pointer to the base memory manager's memory_info so that the user could manipulate it in their alloc function, just so the base memory manager could still have a default implementation of allocated. Two points about this:

If the user didn't want to use memory_info, then they'd need to define their own version of allocated anyways (since they're using their own data structures), and the default implementation would be useless.
If the user did want to use memory_info, why are they defining a new allocator? Perhaps they want to use jemalloc instead of malloc. Yes, they might have to copy the definition memory_info in this case, but that seems better than needlessly exposing it so that the user can make a tiny tweak.
- On this line of thought, I could be wrong, but my sense is that the use case for this API is much less about a user wanting to substitute jemalloc in for malloc in nativeAlloc, and much more likely to be about the user wanting to define an entirely new scheme for allocation, which is probably one in which memory_info is irrelevant to them.
  - Note that with the existing implementation, the user doesn't even need to call nativeAlloc — it's available, but they could have their alloc override call jemalloc directly and ignore the predefined nativeMalloc.

Thoughts on the above? Thanks for a great discussion so far.

jacobkahn · 2019-03-28T17:25:31Z

@9prady9 @pavanky @umar456 — wanted to bump this; any thoughts based on the above?

9prady9 · 2019-03-29T02:09:08Z

@jacobkahn Sorry about the delay, I was caught up with preparing things for a fix release and other bug fixes. I will go over your most recent comment and get back to you soon.

9prady9 · 2019-04-03T09:55:36Z

The sample code had a typo and I think missed pasting the last line , it should be the following.

MemoryManager() {
    auto initCallback = [](af_memory_manager handle) {
        void* ptr;
        af_mm_get_user_ptr(&ptr, handle);
        static_cast<MemoryManager*>(ptr)->init();
    };
   af_memory_manager_set_init_callback(handle, initCallback);
}
MemoryManager::init() {
 /* ... */
}

Yes, you are correct about not being able to pass std::function, but that would be true w.r.t C-API always. Isn't it ?

About adding no-op implementations: Yes, it is not hard to add no-ops for unused API of custom memory manager, it is extra work nontheless. In my earlier trials of suggesting an API, I wasn't trying to replace the entire memory management code with a custom implementation but rather provide a way for user to customize the logic for caching allocations and garbage collection. I had reentrant functions on mind when I was suggesting it where the function pointers passed to C-API don't rely on any state other than the parameters passed to those functions. That way, it doesn't matter whether the code at runtime is executing user provided function or ArrayFire's internal/default function.

If the user does want to use memory_info, what I suggested wouldn't have any issue as custom logic is called only where user registered a callback.

If the user doesn't want to use memory_info and wants to maintain their own version of data structure to pool memory allocations, then that is not possible via the method I was suggesting.

On a side note: Inline with what you(@jacobkahn) originally suggested, I was contemplating about completely abstracting memory management to a separate module(shared/static library) by itself with its own API. ArrayFire will have an default implementation (af/memory_manager.h & libafmem.so) that gets shipped along with existing backends. If an user wants to customize the memory management, they just want to adhere to the API (af/memory_manager.h) and generate their own libafmem.so.

jacobkahn · 2019-04-04T00:18:19Z

it should be the following

If I'm understanding it correctly, I don't think the above code will typecheck, since af_memory_manager_set_init_callback takes a std::function in the above example, but the definition of af_memory_manager_set_init_callback in your previous comment accepts an af_memory_manager_initialize_fn, which is typedef'ed to a void* fun(af_memory_manager). The callback registered here can't be cast to a void*.

that would be true w.r.t C-API always. Isn't it ?

This is the exact reason that I wrapped the C API with the C++ API with the MemoryManagerCWrapper: the C API must exclusively accept a void* (there aren't any other function pointer types in C). The C++ wrapper itself just calls those void* implementations as passed through the C-API using a C++ class that inherits from the base MemoryManagerBase. This way, AF internals can always assume a C++ class exists for the memory manager, even if the C API was used for the implementation. This fact makes it natural to use the C++ API; you don't need to use void* if you're using C++.

Regarding a re-entrant vs. standalone API: sounds like we're on the same page here with respect to what each one offers. As I mentioned above, I think providing an API where the user has maximum flexibility and control is much more in line with the vast majority of use cases. And if a user only wants to change a small part of the memory manager, copying a bit of code or writing a few no-ops seems a small price to pay for allowing other users to have maximum customization that could make AF even more performant for their use case!

libafmem.so

This is an interesting idea — @9prady9, what do you think it buys the user/what's the use case, and is it a good deal given the additional complexity? I don't think it's mutually exclusive in any way with the API I've provided, so it could definitely work with these changes. It's worth noting that it's easy for the user to switch memory managers at runtime with the proposed API.

9prady9 · 2019-04-04T03:52:13Z

If I'm understanding it correctly, I don't think the above code will typecheck, since af_memory_manager_set_init_callback takes a std::function in the above example, but the definition of af_memory_manager_set_init_callback in your previous comment accepts an af_memory_manager_initialize_fn, which is typedef'ed to a void* fun(af_memory_manager). The callback registered here can't be cast to a void*.

I am not sure why it would be cast to void*, as long as it fits the init_fn function signature, it would be accepted. I have created a snippet that compiles successfully here, please have a look.

I don't have enough information on what kind memory-manager customizations are desired by our users at the moment. May be @pavanky or @umar456 have something different on mind.

About the separate module: I was just thinking out aloud if it has any benefits. The reason I thought about that is if memory-management is separate library on its own interface(header), then ArrayFire code wouldn't and can't make assumptions of how memory is being managed - which is good I think. May be it doesn't make any difference.

jacobkahn · 2019-04-04T06:53:06Z

Ah yep — you're right. From 5.1.2 [expr.prim.lambda]:

The closure type for a lambda-expression with no lambda-capture has a public non-virtual non-explicit const conversion function to pointer to function having the same parameter and return types as the closure type’s function call operator

So this pattern can work as long as we have an empty capture list. That example is helpful. A few questions:

what is the function of get_user_ptr, and why does it need to take the handle?
is the proposal that MemoryManager be an interface that users derive from, or does the user need to register callbacks for each function they define? That itself incurs some overhead; I can imagine that ctor getting quite long defining callbacks.

what kind memory-manager customizations

I can list a few use cases which we want to implement but that I think would also be interesting for others:

Arena-style allocators that allocate most of the device memory up front then choose how to manage it internally to save overhead to device-kernels for memory allocations/frees.
Special allocation strategies such as buddy allocation, obstack allocators, or pool allocators. Overall, the default allocator isn't great when allocation sizes are highly variable/granular, since a large step size must be set to avoid frequent garbage collection (see Degraded performance for variable input size arrayfire-ml#44) which itself is expensive and inefficient.
Changing how device allocation is managed, as you suggested previously — for example, using jemalloc in conjunction with other external signals to determine allocation patterns, or using stream-designated memory in CUDA.

We'd be looking to implement any and all of the above for our use cases.

9prady9 · 2019-04-05T04:29:38Z

what is the function of get_user_ptr, and why does it need to take the handle?

get_user_ptr enables passing member functions of class as callbacks to our C-API. It basically stores this value of the object whose member functions we want to use as callbacks. Later when the lambda is passed as callback, this lambda function fetches the pointer and calls the respective member function.

is the proposal that MemoryManager be an interface that users derive from, or does the user need to register callbacks for each function they define? That itself incurs some overhead; I can imagine that ctor getting quite long defining callbacks.

Deriving from an interface is still the idea. The only difference in my suggestion is that C++ class doesn't do anything more than manage the registration of call backs and help define the custom implementations. I am sure there is a way the user can avoid writing any registration calls themselves if they derive an interface class ArrayFire provides, whose constructor takes care of registering the member functions as callbacks. So, the user only derives this interface class and overrides/implements all the methods. I don't think callback registration is too much of an overhead since it is a one time thing and is most certainly done only during the application startup.

Thank you for the use-cases.

Arena Allocators & Obstack Allocations

Yes, that is a case where the user definitely needs to change the data structures to fit the allocation strategy I would assume.

Buddy & pool allocators and last case.

Can be handled with some minor modifications to existing data structures.

Having said that, I am not against refactoring existing implementation to allow customizability. I just think allowing the user to only change the areas they want to and handling the rest seamlessly is most desired. If that can be achieved, we should go for it.

That is my take on it. Other developers have been busy and weren't participating actively. I am hoping they would chime in next week for sure. Lets see what they have to say.

umar456 · 2019-04-10T20:43:47Z

This is a great discussion. I am sorry I haven't been more involved with this PR. I will try to address all of the points here:

There are a few things we need to address to allow for a custom memory manager. Because this is going to be a public facing API, we need to design the internals that take into account future changes to the way the memory manager will be used. I am going to begin by talking about separation of concerns between the components in the backend. I will then talk about the API in the next comment.

Separations of concerns

The JIT and the memory manager are currently linked. This is necessary because the JIT is able to hold references to objects that are no longer available to the user. For example:

array c;
{
    array b = randu(10);
    c = b + 1;
    // b is released here
}
// c still holds the reference to the b array's data

Potentially there could be many buffers which will not be released until the JIT tree is evaluated. If the application is not using too much memory then this is acceptable but becomes an issue when we are on a memory constrained system. In order to get around this, JIT will evaluate an expression tree if there is too much pressure.

It does this by asking the memory manager for some information then bases its decision on that. I think the correct approach would be to ask the memory manger if the number/size of the buffers referenced by the tree is too large. The suggested API:

// this is an awful name
bool checkReferences(int num_buffers, size_t total_buffer_size);

With this function we can remove many of the memory manager calls in the JIT codebase including getMaxBuffers, getMaxBytes, bufferInfo.

If you want I can perform this refactor but it should be pretty straightforward if you look at the createNodeArray function in Array.cpp.

I will look at other uses of the memory manager in ArrayFire but I think JIT is the only problem.

WilliamTambellini · 2019-04-15T18:37:19Z

Hi all
Let me allow here to check whether
#2490
could be done via this new feature/PR :

the user to implement a new mem manager for afcuda using cuda unified mem (exple: libmyafcudamem.so)
the user to register this mem manager using the api (C and/or C++) (Note: I would vote for "module style" dynamic loading:
https://github.com/arrayfire/arrayfire/blob/master/src/backend/common/module_loading.hpp
)
afcuda not to use the default mem manager implemented in afcuda but that custom one

First, is it the expected behavior ?
Kind
WT.

jacobkahn · 2019-04-22T23:47:43Z

Responding first to @umar456:

Separation of Concerns

Definitely agree that there's a good middle ground here. It might make sense to create a very succinct API for querying the memory manager that the user is required to implement; the JIT may not be the only consumer of this information. Making everything more concise will obviously reduce the burden of using the interface.

Would be great to hear your feedback on the API as well.

Some general observations regarding dynamic loading memory managers: (and @WilliamTambellini's thoughts/#2490):

Implementing AFCUDA to use CUDA unified memory #2490 with this this API change would be extremely easy: while the API exposes this->nativeMalloc(...) for someone deriving from af::MemoryManagerInterface (which, would, by default, call cudaMalloc or cudaMallocPinned for the CUDA backend), one can ignore nativeAlloc entirely and implement their own nativeAlloc which could call cudaMallocManaged — that's it.
It seems like the capability that everyone is most interested in is the ability to switch memory managers at runtime for different backends. This change also makes that pretty simple, and works in conjunction with the unified backend. It would work like this:
- One creates an instance of a custom af::MemoryManagerInterface-derived memory manager for each backend (maybe one has custom derived implementations for each backend; makes no difference).
- When switching to each bakend, set the respective memory manager for that backend. The user needs only to keeps around references to each of their memory managers regardless of which backend is active.

I'm not sure I'm clear on the particular advantages of using a separate library given this API: would be great if someone could elaborate a bit more on that.

umar456 · 2019-04-23T04:50:36Z

I am not sure I agree with the idea about a separate library. I think it complicates things and the user could implement it themselves on top of the current approach.

API

My main concern is about future needs of the library and the changes that will be necessary from the user to implement it.

The current implementation allows us to make some assumptions about the operations that are going to be performed by ArrayFire. One assumption we make is that the operations such as allocation and de-allocations are performed on the same queue. This allows us to free and reuse buffers before they are even used by the previous operations. For example:

auto ptr = memAlloc(size); // ptr == 0x123
queueKernel(ptr,...); // async operation
memFree(ptr);  // save to reuse but not free.
auto ptr2 = memAlloc(size); // ptr2 == 0x123
queueKernel2(ptr...);

Here the free operation will be performed before the queueKernel has started and the pointer will be marked for reuse. The queueKernel2 operation will use the same memory address even though the first kernel hasn't started executing.

This makes efficient use of memory and avoids unnecessary allocations but this model breaks for multiple queues.

As GPUs become larger and more powerful, it will be harder to take full advantage of the hardware unless you have very large tensors or multiple queues. In order to implement multiple queues the memory manager will need to be aware of the order of operations between queues.

Up until now this wasn't an issue because the memory manager was an implementation detail not exposed to the user. This PR changes that and I am trying to figure out how to make it so that the API we implement does not limit the future direction of the library without a major break.

One way to implement this would be using events(cudaEvent_t and cl_event). At each kernel call we will create an even object and assign that even to each of the operands of the kernel. When the buffer is deallocated, we will pass an event object along with the pointer to the memory manager. If the same buffer reused in a later memAlloc, the memory manger will return the buffer and the associated event to the caller. operations will be responsible for waiting for these events before they reuse the buffers. The memory manager will also have to wait for these events before the memory is freed.

I don't think this will change the current memory manager significantly but will require some work in the library.

Definitely agree that there's a good middle ground here. It might make sense to create a very succinct API for querying the memory manager that the user is required to implement; the JIT may not be the only consumer of this information. Making everything more concise will obviously reduce the burden of using the interface.

I think you are right. The memory manager API is a rather advanced interface and I think its acceptable if this is more complex. I think we may be too concerned with simplifying the API for this. I still want to avoid including any of the opinionated functions into the interface. Things like getMaxBuffers/Bytes should be remove and their current use within the library should be refactored. Use in the public API can be ignored because the user will have direct access to the custom memory manager.

The C API looks great. I think we can add a void* pointer to the af_memory_manager struct that can be assigned by the user. It will allow the user to maintain some state in the memory manager.

The C++ API should be built on top of the C API(if at all). We maintain ABI compatibility between minor versions so we minimize the C++ machinery in the public facing interface. This means no inheritance, standard library data structures, third part library headers, etc. This minimizes the issues you would run into if you decide to compile ArrayFire using one compiler and then use it in another.

jacobkahn · 2019-04-25T21:41:29Z

I am not sure I agree with the idea about a separate library

Agree — I think this change accomplishes everything having a separate library does.

Concurrency

One assumption we make is that the operations such as allocation and de-allocations are performed on the same queue.

Didn't know about this - thanks for flagging. This seems like a pretty memory-manager-implementation-specific assumption, but I'm surprised it works in general — what if the user triggered a garbage collection after the memFree call in your example? Or is this assumption only made internally?

As GPUs become larger and more powerful, it will be harder to take full advantage of the hardware unless you have very large tensors or multiple queues

Agree. On our end, we've been casually thinking about how one could adapt ArrayFire with multiple-CUDA streams, so we'd definitely want to make sure those use cases are supported by the manager.

One way to implement this would be using events

This seems like a reasonable overall solution; the memory manager could deal with some abstract af::MemEvent which would be a backend-specific wrapper. One obvious concern is that the memory manager now has to keep around these events (and users have to implement them properly), which is a bit odd.

While this would increase complexity, one option would be to somehow separate the API into async and non-async functions (e.g. asyncAlloc versus alloc) so that one would return an event-style future or something of the like. This could be tricky to make C-compatible.

API

add a void* pointer to the af_memory_manager struct

👍

We maintain ABI compatibility between minor versions

Was unaware of this; thanks. In that case, we should define the C++ API in terms of the C API, possibly using the design @9prady9 mentioned before. @umar456 do you have thoughts on that design?

As a follow-up, do you have thoughts about the design of the internals — storing a C++ class on af::DeviceManager? That is, the current implementation has the external C++ API that wraps the external C-API, then have an internal C++ wrapper (MemoryManagerCWrapper) which wraps the C++ API. This is nice because it avoids needs to explicitly store a bunch of function pointers, and lets us keep the existing AF API.

umar456 · 2019-05-01T23:37:01Z

Didn't know about this - thanks for flagging. This seems like a pretty memory-manager-implementation-specific assumption, but I'm surprised it works in general — what if the user triggered a garbage collection after the memFree call in your example? Or is this assumption only made internally?

Indeed. Garbage collection performs a synchronization and all operations must finish before memory is freed. This is actually the same behaviour as the CUDA API. all cudaMalloc and free calls will perform an implicit cudaDeviceSynchronize. This was the motivation for the memory manager.

This seems like a reasonable overall solution; the memory manager could deal with some abstract af::MemEvent which would be a backend-specific wrapper. One obvious concern is that the memory manager now has to keep around these events (and users have to implement them properly), which is a bit odd.

Agreed. The memory manager only has to concern themselves with the last event. My main concern is with the consumers of these events. This may be something we should implement with the default memory manager before exposing it. Would you mind if I try to implement this before we expose the memory manager. This way you will not have to deal with additional details that are not sufficiently documented.

While this would increase complexity, one option would be to somehow separate the API into async and non-async functions (e.g. asyncAlloc versus alloc) so that one would return an event-style future or something of the like. This could be tricky to make C-compatible.

The the alloc calls are only going to be consumed internally. I don't know if it would be beneficial to expose these functions through the public API unless you can think of a compelling reason. As far as the user is concerned, they can directly interact with it in whatever interface they want.

Was unaware of this; thanks. In that case, we should define the C++ API in terms of the C API, possibly using the design @9prady9 mentioned before. @umar456 do you have thoughts on that design?

That design sounds resonable. I don't see the utility of the C++ API but you are welcome to add that if you like.

As a follow-up, do you have thoughts about the design of the internals — storing a C++ class on af::DeviceManager? That is, the current implementation has the external C++ API that wraps the external C-API, then have an internal C++ wrapper (MemoryManagerCWrapper) which wraps the C++ API. This is nice because it avoids needs to explicitly store a bunch of function pointers, and lets us keep the existing AF API.

You want to expose the internal class to the user as a handle which is just a pointer to the object. Basically you create an object internally and cast its pointer to a void*. Take a look at the af_features object.

It is defined in include/af

typedef void * af_features;

Internally it is defined as a struct of af_arrays in src/api/c/features.hpp

typedef struct {
    size_t n;
    af_array x;
    af_array y;
    af_array score;
    af_array orientation;
    af_array size;
} af_features_t;

af_features is just a pointer to the af_features_t object that is used internally. See the af_features_create function:

af_err af_create_features(af_features *featHandle, dim_t num) {
    try {
        af_features_t feat;
        feat.n = num;

        if (num > 0) {
            dim_t out_dims[4] = {dim_t(num), 1, 1, 1};
            AF_CHECK(af_create_handle(&feat.x, 4, out_dims, f32));
            AF_CHECK(af_create_handle(&feat.y, 4, out_dims, f32));
            AF_CHECK(af_create_handle(&feat.score, 4, out_dims, f32));
            AF_CHECK(af_create_handle(&feat.orientation, 4, out_dims, f32));
            AF_CHECK(af_create_handle(&feat.size, 4, out_dims, f32));
        }

        *featHandle = getFeaturesHandle(feat);
    }
    CATCHALL;

    return AF_SUCCESS;
}

This allows us to change the internal size of the af_feature_t object without worrying about binary compatibility issues by the user. It may not happen with af_features but we have changed the size of our Array representation several times but because we expose it as a void* object, we don't have to worry about incompatibilities between different versions of arrays. As long as the function contracts are maintained you should be able to upgrade the library without issues even if the underlying structure is different between versions.

jacobkahn · 2019-05-28T18:57:44Z

@umar456 — sorry for the big delay. Thanks for all that detail about synchronization and the memory manager — it's very helpful. I think I understand the JIT interoperability much better now and the requirements there.

Would you mind if I try to implement this before we expose the memory manager

That would be great. I may also be able to give it a go, but probably won't have bandwidth for another week or two at least, and since you understand the internals better, I'd learn more from your attempt.

I don't know if it would be beneficial to expose these functions through the public API

Agree that they probably don't need to be exposed; the contract will just be that they may be called asynchronously in the user's implementation by ArrayFire. It seems like we may need an internal class to mediate these things with the JIT, a but like MemoryManagerCWrapper is doing now, but with some other things on top of it.

I don't see the utility of the C++ API

I'll leave it out and just provide a C API. Users can wrap it themselves in their own C++ class if they like; it'll be less-opinionated that way.

You want to expose the internal class to the user as a handle which is just a pointer to the object

This pattern looks good to me. I'll work on making those changes to the API once I have an idea about how the async components will work. Will you have time to implement that soon?

umar456 · 2019-05-28T22:25:40Z

This pattern looks good to me. I'll work on making those changes to the API once I have an idea about how the async components will work. Will you have time to implement that soon?

I started implementing this the other day. I should have a PR later in the week.

jacobkahn · 2019-06-05T18:20:32Z

@umar456 — any update here? Let me know if there's anything I can do to help.

umar456 · 2019-06-05T18:36:12Z

I have added support for Events in #2526 but I am having issues with Windows. The CUDA platform will segfault if I try to create an event after the main function (in case there is a global array and its destructor is called after exiting main). The segfault occurs in the CUDA libraries responsible for reloading the drivers. I suspect this is a bug in the CUDA driver but I haven't made a standalone example for a bug report. CUDA should return 'cudaErrorCudartUnloading' instead of segfaulting.

umar456 · 2019-06-21T17:08:07Z

Hey @jacobkahn I have merged the #2526 PR into master. It should contain basic support for events at the memory manager level. The Basic interface for the alloc and free remain the same but the memory manger now accepts an event object when freeing and returns an event object when allocating. Here are the new alloc calls.

https://github.com/arrayfire/arrayfire/blob/master/src/backend/cuda/memory.cpp#L61

template<typename T>
uptr<T> memAlloc(const size_t &elements) {
    size_t size                = elements * sizeof(T);
    MemoryEventPair me = memoryManager().alloc(size, false);
    cudaStream_t stream        = getActiveStream();
    if (me.e) me.e.enqueueWait(stream);
    return uptr<T>(static_cast<T *>(me.ptr), memFree<T>);
}

https://github.com/arrayfire/arrayfire/blob/master/src/backend/cuda/memory.cpp#L77

template<typename T>
void memFree(T *ptr) {
    Event e = make_event(getActiveStream());
    memoryManager().unlock((void *)ptr, move(e), false);
}

The rest of the API remains the same. Let me know if you have questions or comments. We can still change the API here if you can think of a way to improve things.

jacobkahn · 2019-07-31T16:31:35Z

@umar456 — didn't have a chance to comment on the Events API before it was merged, but as I'm re-implementing the framework, I'm encountering a somewhat confounding issue. Because MemoryManager::alloc returns and MemoryManager::unlock take Events, they can't be included as they are in the C API without further changes since af::detail::Event is C++-only. I see a few paths forward here/it would be great to hear everyone's thoughts:

Make af::detail::Event part of the public API and a generic C struct. Since the C++ class has a move ctor/other specifics attached to it, we might need to wrap in some af_event_t of which a af::detail::Event handle is a member.
Abstract away Event components from the default memory manager so that no memory manager has to know about events, and they are handled on the native device/platform level only. This might involve creating a sort of MemoryEventManager singleton that stores maps from pointers/sizes to Events and has a new native alloc/free interface with two new functions, nativeAllocEvent and nativeFreeEvent, which take/return only pointers, but internally use events to make sure free memory isn't re-alloced before the JIT is done with it.

The second option seems difficult to implement considering it opinionates any custom memory manager off the bat by forcing it to be very size-aware. Only an 'exact' reuse of a piece of memory in a particular way would trigger waiting as needed.

Let me know what you think.

umar456 · 2019-07-31T17:04:08Z

I like the first approach. I think there are a few advantages to exposing the event object externally. I also think it would be beneficial for the memory manager to be aware of events as it can be used to sort the list of potential free buffers based on their status. The af_event object should be implemented using a similar approach to the af_array and af_feature classes. You will have to create several functions that process events.

I agree that the second approach is difficult to implement and could cause issues with multi-threaded code. I would like to avoid creating a singleton if possible.

- Rename af_release_event to af_delete_event and make arg non-const - Make af::event::block() const - Improvements to memory manager API documentation throughout

WilliamTambellini · 2019-12-03T00:34:37Z

Hello @jacobkahn
I have cloned and built your fork
https://github.com/jacobkahn/arrayfire.git
and run the NN MNIST example and looks like there is a speed drop at least for that one :

[wtambellini@lasdewtambe02 ~/repos/afjacob/release] (master)
$ examples/machine_learning/neural_network_cuda
ArrayFire v3.7.0 (CUDA, 64-bit Linux, build e37cac8)
Platform: CUDA Toolkit 10.0, Driver: 418.56
[0] GeForce GTX 1060, 6079 MB, CUDA Compute 6.1
** ArrayFire ANN Demo **
...
Training time: 6.8353 s

vs af master:

[wtambellini@lasdewtambe02 ~/repos/arrayfire/Release] (master)
$ examples/machine_learning/neural_network_cuda
ArrayFire v3.7.0 (CUDA, 64-bit Linux, build c30d545)
Platform: CUDA Toolkit 10.0, Driver: 418.56
[0] GeForce GTX 1060, 6079 MB, CUDA Compute 6.1
** ArrayFire ANN Demo **
...
Training time: 6.4855 s

Could you run on one of your GPUs to confirm/repro ?

jacobkahn · 2019-12-03T15:29:45Z

@WilliamTambellini — on extremely small networks run for a very short amount of time (e.g. MNIST), there may be a small regression presumably due to vtable cache miss overhead (@umar456 has run that same benchmark without Intel Turbo boost and can say more).

I've run the flashlight Alexnet benchmark (much larger but obviously still tiny by today's standards), and I'm actually seeing better performance with this PR, although the difference is under 0.1%. This is presumably due to better vtable cache performance, where, even over a few seconds, things get amortized away. With CUDA 9.2 on one NVIDIA Quadro GP100:
With PR:
34.197 msec
34.361 msec
34.472 msec
34.435 msec

Without PR:
34.764 msec
34.713 msec
34.643 msec
34.389 msec

Overall, I'm very confident that the upsides of being able to write custom memory managers will more than make up for this several-fold. I've already written memory managers in flashlight that give a very significant performance boost that I'll fully-benchmark and start open-sourcing once this PR is merged.

WilliamTambellini · 2019-12-03T19:50:20Z

Hi @jacobkahn
Tks.
On my side, I ve run a typical NN (production sized, several hundred of thousand params of weights&bias) of mines :
afmaster: ArrayFire v3.7.0 (CUDA, 64-bit Linux, build c30d545) : 37.126000 secs
afjacob: ArrayFire v3.7.0 (CUDA, 64-bit Linux, build e37cac8) : 39.194000 secs

As today, without any example/evidence of the advantage of a custom memory manager, this change is at the moment for regular AF users just an additional source of speed drop.
That is specially bad since af master is already up to 30% slower than 3.6.4 :
#2673

Your benchmark (Alexnet) seems to be on training (running both fwd and bwd). For prodcution, the speed of inference has usually a higher priority over the speed of training. Have you tested the speed impact of that change just for inference ('normal' fwd pass, no grad/delta) ?

Kind regards
W.

jacobkahn · 2019-12-03T20:18:04Z

@WilliamTambellini — thanks for that larger benchmark. That's a significant difference. With that large a difference, we should investigate a bit further/understand the slowdown. That said, it's somewhat illogical that this PR alone is causing this considering the only overhead is to the vtable. It's possible that changes in how we handle events could exacerbate the issues you brought up in #2673 — I think you mentioned this before.

I reran my benchmarks and removed the backwards component. There's still not much of a difference on my side. Either way, I'm going to run perf and see what it shows.

jacobkahn · 2019-12-03T23:50:57Z

@WilliamTambellini I forgot to share, but we also have full benchmarks with wav2letter + flashlight with a very large model (a 100 million parameter model) that actually show a speedup with AF master (without this PR) [blue is master, orange is 3.6.4]:

Full logs are here: https://gist.github.com/jacobkahn/ecf18371f52332dc978c5e713f2b677c

It would be nice to learn a bit more about your benchmarks/profiling and see where you might be seeing some of this slowdown.

WilliamTambellini · 2019-12-04T19:25:28Z

Hi @jacobkahn
Tks. I have done more tests and looks like the speed drop of master vs this PR branch only happens when using afunified. Are you linking directly with afcuda or afunified ?
W.

- C API functions with the default memory manager were, when used with the unified backend, causing symbol table lookups which slowed things down - A unified neural network example now benchmarks similarly on master and with custon memory manager integrations after the change when build and linked to the unified backend

jacobkahn · 2019-12-05T16:38:15Z

@WilliamTambellini — was only linking with afcuda. That would explain it.

The commit above removes all C API functions from the default memory manager because those will dispatch a [slow] symbol table lookup according to @umar456 if called with the unified backend. When I test with the example locally, this completely removes the performance gap with master.

WilliamTambellini · 2019-12-05T20:18:26Z

@jacobkahn Tks, I ve retried with the new changes of your branch and the perf is indeed now better, basically like afmaster (The perf of afmaster is still sometimes bad compared to 3.6.4 but that s another issue).
Could I do anything to speed up the merge of this PR ?

jacobkahn · 2019-12-05T23:26:06Z

@WilliamTambellini — great to hear!

Nothing blocking the merge on my side. cc @umar456

Old review

9prady9 · 2019-12-06T04:17:00Z

@jacobkahn @WilliamTambellini @umar456 Great job! Finally this one is in!

Summary: Beginning to check in some memory management framework code. **NB*: I won't land this until ArrayFire 3.7 is out; for now, this only runs on master. Putting in `contrib` for now since it doesn't build inside FB, but will move this once it's landable into `flashlight/flashlight/memory`. The framework has several components: **A C++ wrapper for the ArrayFire C memory manager API** (`fl::MemoryManagerAdapter`)added in arrayfire/arrayfire#2461. - Contains AF public interface functions that can be overriden to facilitate building a custom memory manager. - Everything else is totally unopinionated. Besides JIT functions, memory management implementations can theoretically do anything. I'm leaving out interoperability with `af_event` here and will add that in a separate diff. - `fl::MemoryManagerAdapter` differs slightly from the internal AF API because that needs to support opinionated memory functions that are also in the public AF device API (such as memory step size, `usageInfo`, etc). **A memory manager adaptor** to facilitate easily using C++ memory manager implementations/wrap AF functions. - AF memory management APIs expect an `af_memory_manager` for ABI compatibility, the manager adapter creates an `af_memory_manager` which corresponds to the AF handle corresponding to the C++ implementation - The `fl::MemoryManagerAdapter` that corresponds to the `af_memory_manager` is added as the payload to the `af_memory_manager` - When the manager installer is created and passed an `fl::MemoryManagerAdapter`, it creates function pointers and sets those in the relevant `af_memory_manager`. Each function pointer accomplishes the following: - Function pointer callbacks for the AF memory manager API are all passed an `af_memory_manager`. Since the C++ implementation is a `void*` paylaod on the `af_memory_manager`, it can be retrieved, then the proper C++ function on the implementation called. - Calls `log(...)` on the handle payload to log the native AF call **A logging framework for memory management** that logs ArrayFire requests for memory and calls to functions inspecting memory manager state used to determine JIT behavior. - To enable logging with a memory manager, call `fl::MemoryManagerAdapter::setLoggingEnabled(...)` after setting an output stream with `setLogStream`. - `fl::MemoryManagerAdapter::log(...)` is called inside the manager adapter's lambdas using the function on the impl. - Logging can also easily be performed from the memory manager directly using `log(...)` in order to log user-defined functions. Reviewed By: avidov Differential Revision: D19056964 fbshipit-source-id: b02e0107d9cfab2f09abbb5f55774b89679a6f01

jacobkahn mentioned this pull request Mar 19, 2019

Add framework for extensible ArrayFire memory managers #2460

Closed

9prady9 mentioned this pull request Apr 15, 2019

AFCUDA to use CUDA unified memory #2490

Open

Changes to af::event and memory manager docs

fbe22b4

- Rename af_release_event to af_delete_event and make arg non-const - Make af::event::block() const - Improvements to memory manager API documentation throughout

CPU queue memory pressure calls in Release build

1fbe79c

Add final to all derived memory manager classes

a5c8b0c

Add final to MemoryManagerFunctionWrapper and AllocatorPinned

4885c8a

umar456 approved these changes Dec 5, 2019

View reviewed changes

umar456 merged commit a448544 into arrayfire:master Dec 6, 2019

Add framework for extensible ArrayFire memory managers #2461

Add framework for extensible ArrayFire memory managers #2461

Uh oh!

Conversation

jacobkahn commented Mar 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Framework

Details

API Design Decisions

Uh oh!

9prady9 commented Mar 19, 2019

Uh oh!

jacobkahn commented Mar 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pavanky commented Mar 19, 2019

Uh oh!

jacobkahn commented Mar 20, 2019

Uh oh!

9prady9 commented Mar 20, 2019

Uh oh!

jacobkahn commented Mar 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

9prady9 commented Mar 21, 2019

In my opinion, the eventual goal should be to avoid making the user/developer re-implement the entire interface. The following areas are some locations where custom logic can be done.

Uh oh!

jacobkahn commented Mar 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobkahn commented Mar 28, 2019

Uh oh!

9prady9 commented Mar 29, 2019

Uh oh!

9prady9 commented Apr 3, 2019

Uh oh!

jacobkahn commented Apr 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

9prady9 commented Apr 4, 2019

Uh oh!

jacobkahn commented Apr 4, 2019

Uh oh!

9prady9 commented Apr 5, 2019

Uh oh!

umar456 commented Apr 10, 2019

Separations of concerns

Uh oh!

WilliamTambellini commented Apr 15, 2019

Uh oh!

jacobkahn commented Apr 22, 2019

Uh oh!

umar456 commented Apr 23, 2019

API

Uh oh!

jacobkahn commented Apr 25, 2019

Concurrency

API

Uh oh!

umar456 commented May 1, 2019

Uh oh!

jacobkahn commented May 28, 2019

Uh oh!

umar456 commented May 28, 2019

Uh oh!

jacobkahn commented Jun 5, 2019

Uh oh!

umar456 commented Jun 5, 2019

Uh oh!

umar456 commented Jun 21, 2019

Uh oh!

jacobkahn commented Jul 31, 2019

Uh oh!

umar456 commented Jul 31, 2019

Uh oh!

WilliamTambellini commented Dec 3, 2019

Uh oh!

jacobkahn commented Dec 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jacobkahn commented Mar 19, 2019 •

edited

Loading

jacobkahn commented Mar 19, 2019 •

edited

Loading

jacobkahn commented Mar 20, 2019 •

edited

Loading

jacobkahn commented Mar 22, 2019 •

edited

Loading

jacobkahn commented Apr 4, 2019 •

edited

Loading

jacobkahn commented Dec 3, 2019 •

edited

Loading

jacobkahn commented Dec 3, 2019 •

edited

Loading