Skip to content

Deferred execution NEP #8889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mattharrigan
Copy link
Contributor

This PR is for a NEP with a possible mechanism/API for deferred execution of various numpy operations to allow for various optimizations. I thought I would post here for initial feedback before considering starting a discussion topic. I think this might be a nice use case for the new __array_ufunc__. In fact recent discussions in PR #8247 sparked this concept, particularly some comments by @mhvk

@njsmith
Copy link
Member

njsmith commented Apr 4, 2017

This is indeed one of the goals of adding __array_ufunc__, but... There are substantial downsides to putting code into numpy itself. (One of them being the need to write a nep and argue about it instead of just going and doing it :-).) What benefits come from this being in numpy instead of a standalone library? How do you expect we could be competitive with libraries that are dedicated to this use case like dask and theano?

@shoyer
Copy link
Member

shoyer commented Apr 4, 2017

I'd love to see someone develop a library like this, building up an abstract graph to represent NumPy arrays, but I don't think it needs to live in NumPy.

This also bears some similarity to Dask and Blaze.

CC @mrocklin

@mattharrigan
Copy link
Contributor Author

mattharrigan commented Apr 4, 2017

I'll give my case for having it part of numpy:

  1. Large libraries already using NumPy that would like to have the performance benefits but are resistant to adding additional dependencies
  2. Many users are familiar with NumPy but not with other libraries. I know of Dask and Theano but have not used them. I have used Numba, but just a little bit. If it were part of numpy then dramatically more people would know of and use it, which ultimately is one of the goals.
  3. This may be a stretch/unfounded, but I believe NumPy is of the highest quality. I think that is because it is so widely used, and because of the people/process used in developing and testing. If it were some other library I suspect it may be of lesser quality.
  4. What I proposed would be single CPU core. Certainly other libraries could add GPU (Theano) or cluster processing (Dask). If NumPy had a good deferred execution API then it could serve as unifying across those other libraries. Adding cluster processing to a NumPy script could literally only require an import statement "from some_new_library import AwesomeClass as DeferredExecution"

Also I don't think arguing over a NEP is a downside. Instead I see it as a strength.

@juliantaylor
Copy link
Contributor

note that there already is a nep for this:
https://github.com/numpy/numpy/blob/master/doc/neps/deferred-ufunc-evaluation.rst

@rgommers
Copy link
Member

rgommers commented Apr 4, 2017

I'll give my case for having it part of numpy:
...

Even if all of those points are true, it doesn't follow that this functionality should live in NumPy from the start. It may be a good approach to work out a NEP, then implement it as a separate project, add enough tests to NumPy for the NumPy interfaces that enable the functionality in the separate project, and only after everything has proven itself for a while consider moving things into NumPy.

This is complex enough that an implementation won't be first time right (see __numpy_ufunc__ ...).

@mhvk
Copy link
Contributor

mhvk commented Apr 4, 2017

Agreed that, if at all possible, best is to develop something separate first and then try to include. But a NEP is definitely good to start thinking about this! So, one general comment on it: I think we need to be very careful to define what we mean by "deferred execution" and in particular what its purpose is. All cases I can think can (I think) be summarized with the following example:

phase = frequency * (time - time0)
signal = amplitude * np.sin(phase)

this would normally be executed as

np.subtract(time, time0, out=tmp)
np.multiply(frequency, tmp, out=phase)
np.sin(phase, out=tmp2)
np.multiply(amplitude, tmp2, out=signal)

it would be lovely (in increasing steps of complexity) if it was realised that:

  1. tmp and tmp2 will be discarded anyway, so one might as well re-use them for the later output, i.e., both multiplies become in-place; this was done in ENH: avoid temporary arrays in expressions (again) #7997 (@juliantaylor: is this optimizable further in python 3.6 with its frame execution hooks?).
  2. all operations are ufuncs, which operate on single elements, so it should be faster to iterate over all inputs, combining inner loops (reusing outputs as above), i.e., one "chains ufuncs"; this is something I'm thinking about (see below).
  3. phase is not needed immediately, so one might as well defer execution for it until it is needed (possibly allowing further optimization via 1 and 2); this is the goal of the NEP, correct?

My own goal for 2 aims simply to allow one to define a chained ufunc (for which __array_ufunc__ is also essential); I think that as ufunc definitions allow one to pass fairly arbitrary *data, this should not be too difficult. I think it would also be appropriate for numpy given that there have been quite a few requests for specific functions that chain 2 ufuncs; we could implement those using this mechanism.

Well, now that I've written this, I'm not so sure any more whether my stuff is relevant to your goal (item 3)... I probably should write my own NEP. But better first get __array_ufunc__ in... For here, it may still be useful to be very explicit about a case where the deferral would really be useful.

@mattharrigan
Copy link
Contributor Author

@mhvk its not so much chaining ufuncs as operating on cache sized blocks to reduce memory bandwidth. For large arrays, many operations are memory IO bound and the CPU spends most of its time waiting to read or write data to main memory. A much more efficient way to execute them would be to process a cache sized chunk through multiple operations. A crudely optimized example is below, note that many other optimizations could be done and it would be significantly faster in C than python:

arr = np.arange(1e7, dtype=float)
chunk = 2**14
In [11]: def f(x):
    ...:     sum_sq = 0
    ...:     for i in range(0, len(x), chunk):
    ...:         sum_sq += np.add.reduce(np.square(x[i:i+chunk] * (np.pi/180)))
    ...:     return sum_sq
In [20]: %timeit f(arr)
10 loops, best of 3: 64.3 ms per loop
In [21]: %timeit np.add.reduce(np.square(arr * (np.pi/180)))
10 loops, best of 3: 132 ms per loop

@rgommers I agree that might be a good path. We definitely don't want to release/enable it before its mature enough. At this point I am much more concerned about whether this functionality should ever be included in numpy and whether the proposed API is any good. The development strategy can be hammered out once the first two are cleared. Also I am somewhat hopeful that experience from the previous NEP and projects such as numexpr, theano, dask, numba, and blaze will improve our chances at getting it right sooner.

@shoyer
Copy link
Member

shoyer commented Apr 4, 2017

I would be interested in this for NumPy if:

  1. It's prototyped outside NumPy, so we're sure it works.
  2. It's mostly focused on building a shared interface for abstract computation graph on arrays as a common language to be shared between array libraries. For example, a higher level array library like xarray could use this to build up a computation graph, that could then be executed by Dask, NumPy or some other distributed execution engine. The deferred execution engine should not be the point of this library, although it would be fine to include a simple runner based on NumPy.

More concretely, I think the interface should be closer to:

# array_like can be any object with a shape and dtype
deferred = DeferredExecutionArray(array_like)
...
result = some_engine.run(sum_square)

where some_engine could be from NumPy or Dask or whatever.

@mhvk
Copy link
Contributor

mhvk commented Apr 4, 2017

@mattharrigan - yes, the speed-up you mention is the goal of my step (2): let nditer provide the various inner loops with chunks of the right size and do the chaining at that level rather than for the full arrays. Basically, my idea is to have an easy way to create custom ufuncs that are chained versions of regular ones.

@mattharrigan
Copy link
Contributor Author

@shoyer For 1, agreed. I also like the API change, the array shouldn't be responsible for building up the operation graph and executing it.

For 2, my initial thought was that the DeferredExecutionArray would not have public methods exposing the operation graph. It would be used as a proxy through the code and then used by the NumPy engine. The API for a typical user writing a script would be consistent/unified across execution libraries, but each execution library would write its own class for building the operation graph exactly how they want it. IIUC you are proposing that the new array would have public methods for exposing the operation graph and would be used directly by library writers for execution. I definitely see value in that approach. My concern is that the API would be much larger and more complex. Also I'm not entirely sure multiple execution engines would like the same thing. A very quick scan of dask and numexpr indicate that there are many differences in how they build graphs. However perhaps we could collect a superset of data so that either of those forms would be producible from our graph.

@njsmith
Copy link
Member

njsmith commented Apr 5, 2017

At this point I am much more concerned about whether this functionality should ever be included in numpy and whether the proposed API is any good. The development strategy can be hammered out once the first two are cleared.

I think you have this backwards :-). It's always tricky to give good feedback on this kind of thing, because we don't want to discourage people and do want the leave the door open to new ideas, but... I would say there's basically zero chance of something like this being included in numpy proper unless it's (a) demonstrated some compelling success as a standalone library first, and (b) there's some reason why putting it in numpy will have compelling advantages over it living on as a standalone library. This isn't because we want to be mean; it's exactly the opposite :-). Putting your code into numpy is really a bad idea if it can be at all avoided, both for you and for your code. You have to integrate with a complicated and rickety code base, you end up having to maintain unreasonable levels of backwards compatibility that hobble your ability to make improvements, and if your code is ever replaced by something better than you're still stuck maintaining it forever – or someone will be. (Numpy has many examples of code that's stuck in this kind of grim afterlife, starting with np.matrix...)

Really the question should be "is there any way we can avoid putting this into numpy?" And this is a major motivation behind __numpy_ufunc__ and related proposals: to take things that would previously have required difficult changes to numpy's core, and open them up so that anyone with a PyPI account can do them.

I'd encourage you to start with a prototype. That'll teach you more about the API quality than any amount of review here. And if you run into limitations where you need changes to numpy's core, then we'd be very interested to hear about them.

@shoyer
Copy link
Member

shoyer commented Apr 5, 2017

IIUC you are proposing that the new array would have public methods for exposing the operation graph and would be used directly by library writers for execution.

Yes, exactly.

Also I'm not entirely sure multiple execution engines would like the same thing.

This is certainly true: every execution engine is going to want to build their own representation of the data. There are lots of ways to represent graphs, but fundamentally everything computation graph is a tree, and converting back and forth is usually relatively straightforward. For DeferredExecutionArray, I would use an extremely simple format, e.g., simply collecting the input arguments to __array_ufunc__ and saving them as an array property.

The advantage of DeferredExecutionArray is that it could propagate shape and dtype properties and do error checking on arguments as it builds the graph, before passing things off to an execution engine (at which point things may be much harder to debug).

@mattharrigan
Copy link
Contributor Author

@njsmith Thank you very much for your candor. I think that is good advice and I don't find anything you said mean at all. Candidly I probably lack both the skill and time required to create a large complex library on my own.

@mattharrigan
Copy link
Contributor Author

What should be done with the PR? My guess is to close it.

@mhvk You were indeed going after some of the same goals as this PR, sorry for my confusion. I think your approach is far simpler and more achievable. I would be interested in seeing exactly what you want to do and helping if you would like.

Thanks all

@mhvk
Copy link
Contributor

mhvk commented Apr 5, 2017

@mattharrigan - OK, I'll keep you in the loop! But first order will be to get __array_ufunc__ in...

As for what to do with this PR: my own feeling would be to add a note to the existing NEP about using __array_ufunc__ to implement this; I feel your particular suggestion should not be lost, but things should also be in one place. But I don't really know what the numpy policy on NEPs is...

@rgommers
Copy link
Member

rgommers commented Apr 5, 2017

But I don't really know what the numpy policy on NEPs is

We don't have an explicit one. I'd say let's do this similar to PEPs: if @mattharrigan feels it makes sense to merge the two NEPs then he can just edit the existing one (Mark won't be coming back to it I think), and otherwise two competing NEPs is not a problem.

@njsmith
Copy link
Member

njsmith commented Apr 6, 2017

Candidly I probably lack both the skill and time required to create a large complex library on my own.

Yeah, that's always a challenge – but if it helps (...it probably doesn't help 😄) then this is a problem you'd have to solve either way. It's entirely possible you might be able to convince some of the people in this thread to help you with a third-party library, or vice-versa if you don't have the time to write the code and you can't find volunteers to help, then targeting numpy won't make helpers magically appear. (If only...)

@mhvk mhvk mentioned this pull request Oct 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants