doc/ufuncs.rst - external/github.com/numpy/numpy - Git at Google

 Buffered general ufunc explanation
 ==================================

 .. note::

   This was implemented already, but the notes are kept here for historical
   and explanatory purposes.

 We need to optimize the section of ufunc code that handles mixed-type
 and misbehaved arrays.  In particular, we need to fix it so that items
 are not copied into the buffer if they don't have to be.

 Right now, all data is copied into the buffers (even scalars are copied
 multiple times into the buffers even if they are not going to be cast).

 Some benchmarks show that this results in a significant slow-down
 (factor of 4) over similar numarray code.

 The approach is therefore, to loop over the largest-dimension (just like
 the NO_BUFFER) portion of the code.  All arrays will either have N or
 1 in this last dimension (or their would be a mismatch error). The
 buffer size is B.

 If N <= B (and only if needed), we copy the entire last-dimension into
 the buffer as fast as possible using the single-stride information.

 Also we only copy into output arrays if needed as well (other-wise the
 output arrays are used directly in the ufunc code).

 Call the function using the appropriate strides information from all the input
 arrays.  Only set the strides to the element-size for arrays that will be copied.

 If N > B, then we have to do the above operation in a loop (with an extra loop
 at the end with a different buffer size).

 Both of these cases are handled with the following code::

    Compute N = quotient * B + remainder.
    quotient = N / B  # integer math
    (store quotient + 1) as the number of innerloops
    remainder = N % B # integer remainder

 On the inner-dimension we will have (quotient + 1) loops where
 the size of the inner function is B for all but the last when the niter size is
 remainder.

 So, the code looks very similar to NOBUFFER_LOOP except the inner loop is
 replaced with::

   for(k=0; i<quotient+1; k++) {
       if (k==quotient+1) make itersize remainder size
       copy only needed items to buffer.
       swap input buffers if needed
       cast input buffers if needed
       call function()
       cast outputs in buffers if needed
       swap outputs in buffers if needed
       copy only needed items back to output arrays.
       update all data-pointers by strides*niter
   }


 Reference counting for OBJECT arrays:

 If there are object arrays involved then loop->obj gets set to 1.  Then there are two cases:

 1) The loop function is an object loop:

    Inputs:
 	    - castbuf starts as NULL and then gets filled with new references.
 	    - function gets called and doesn't alter the reference count in castbuf
 	    - on the next iteration (next value of k), the casting function will
 	      DECREF what is present in castbuf already and place a new object.

 	    - At the end of the inner loop (for loop over k), the final new-references
 	      in castbuf must be DECREF'd.  If its a scalar then a single DECREF suffices
 	      Otherwise, "bufsize" DECREF's are needed (unless there was only one
 	      loop, then "remainder" DECREF's are needed).

    Outputs:
             - castbuf contains a new reference as the result of the function call.  This
 	      gets converted to the type of interest and.  This new reference in castbuf
 	      will be DECREF'd by later calls to the function.  Thus, only after the
 	      inner most loop do we need to DECREF the remaining references in castbuf.

 2) The loop function is of a different type:

    Inputs:

 	    - The PyObject input is copied over to buffer which receives a "borrowed"
 	      reference.  This reference is then used but not altered by the cast
 	      call.   Nothing needs to be done.

    Outputs:

             - The buffer[i] memory receives the PyObject input after the cast.  This is
 	      a new reference which will be "stolen" as it is copied over into memory.
 	      The only problem is that what is presently in memory must be DECREF'd first.
	Buffered general ufunc explanation
	==================================

	.. note::

	This was implemented already, but the notes are kept here for historical
	and explanatory purposes.

	We need to optimize the section of ufunc code that handles mixed-type
	and misbehaved arrays. In particular, we need to fix it so that items
	are not copied into the buffer if they don't have to be.

	Right now, all data is copied into the buffers (even scalars are copied
	multiple times into the buffers even if they are not going to be cast).

	Some benchmarks show that this results in a significant slow-down
	(factor of 4) over similar numarray code.

	The approach is therefore, to loop over the largest-dimension (just like
	the NO_BUFFER) portion of the code. All arrays will either have N or
	1 in this last dimension (or their would be a mismatch error). The
	buffer size is B.

	If N <= B (and only if needed), we copy the entire last-dimension into
	the buffer as fast as possible using the single-stride information.

	Also we only copy into output arrays if needed as well (other-wise the
	output arrays are used directly in the ufunc code).

	Call the function using the appropriate strides information from all the input
	arrays. Only set the strides to the element-size for arrays that will be copied.

	If N > B, then we have to do the above operation in a loop (with an extra loop
	at the end with a different buffer size).

	Both of these cases are handled with the following code::

	Compute N = quotient * B + remainder.
	quotient = N / B # integer math
	(store quotient + 1) as the number of innerloops
	remainder = N % B # integer remainder

	On the inner-dimension we will have (quotient + 1) loops where
	the size of the inner function is B for all but the last when the niter size is
	remainder.

	So, the code looks very similar to NOBUFFER_LOOP except the inner loop is
	replaced with::

	for(k=0; i<quotient+1; k++) {
	if (k==quotient+1) make itersize remainder size
	copy only needed items to buffer.
	swap input buffers if needed
	cast input buffers if needed
	call function()
	cast outputs in buffers if needed
	swap outputs in buffers if needed
	copy only needed items back to output arrays.
	update all data-pointers by strides*niter
	}


	Reference counting for OBJECT arrays:

	If there are object arrays involved then loop->obj gets set to 1. Then there are two cases:

	1) The loop function is an object loop:

	Inputs:
	- castbuf starts as NULL and then gets filled with new references.
	- function gets called and doesn't alter the reference count in castbuf
	- on the next iteration (next value of k), the casting function will
	DECREF what is present in castbuf already and place a new object.

	- At the end of the inner loop (for loop over k), the final new-references
	in castbuf must be DECREF'd. If its a scalar then a single DECREF suffices
	Otherwise, "bufsize" DECREF's are needed (unless there was only one
	loop, then "remainder" DECREF's are needed).

	Outputs:
	- castbuf contains a new reference as the result of the function call. This
	gets converted to the type of interest and. This new reference in castbuf
	will be DECREF'd by later calls to the function. Thus, only after the
	inner most loop do we need to DECREF the remaining references in castbuf.

	2) The loop function is of a different type:

	Inputs:

	- The PyObject input is copied over to buffer which receives a "borrowed"
	reference. This reference is then used but not altered by the cast
	call. Nothing needs to be done.

	Outputs:

	- The buffer[i] memory receives the PyObject input after the cast. This is
	a new reference which will be "stolen" as it is copied over into memory.
	The only problem is that what is presently in memory must be DECREF'd first.