-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Sortlib #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sortlib #89
Conversation
Fix the build to deal with the new name and location.
Make include file for the future sorting library. Use prefixed numpy types.
case prefix. The name are now of the form <sortkind>_<npy_type>.
The branch is based on the current master 5cb956e. |
Cannot offer windows, but on MacOS X i386+ppc all tests are passing without new failures. Speed also seems not affected for a typical argsort() in an application of mine. |
Thanks for that. |
When I first looked at the sorting in numpy, it seemed odd that Timsort wasn't there and the default, especially since it originated from the Python community. Have you given a thought to adding an implementation of that? My understanding is that the CPython source code provides a good example. ;) |
Timsort was optimized for sorting python lists containing python objects where the identification of runs and dealing with them offered a big gain. There was a discussion of this on the list a long time ago IIRC, but I don't recall any benchmarks. Google indicates that it is now competitive with quicksort, so If you want to give it a shot it could replace mergesort. The one difficulty you might have is that all the current sorts are based on the < comparison and you might need to add more. |
I noticed on Python3.2 there is one error that does not turn up on master (while a couple others have been fixed in your branch): ERROR: test_datetime_divide (test_datetime.TestDateTime)Traceback (most recent call last): I could not figure out if this is a new one or had been fixed in master in the meantime - when did you fork your branch? |
I see you're structuring this as a a library, which I guess would be callable from plugins linking to NumPy since npy_sort.h is in the public include directory. I'd like to suggest a slightly different tact, since I don't think this scales pleasantly especially towards adding custom data types. At the very least, I think it would be better to keep this an internal implementation detail until how it relates to the other aspects of numpy data types is hashed out. Regarding the API, I'm not sure having a separate functions exposed for each of the data types is a good idea. Instead of various sort_, just a few core API calls like PyArray_RawSort(int type_num, char *data, npy_intp size), which internally do something like _sort_function_table[type_num](data, size), might be better. This would also allow for some pluggable extensibility with custom data types to be added, for instance the dtype could have a place to put soft functions like the ones in the _sort_function_table. |
On the first point, I agree, I felt it somewhat dangerous to expose an API that we would probably want to change at some point and your suggestion tips the balance. I'll move the header into the private area for the time being. Any suggestion as to what to do with the library? On the second point, I was thinking along similar lines as the size of the include was a bit shocking. Although I wasn't thinking of abstracting a much as you suggest. Sounds like a good idea. |
I'm not sure PyArray_RawSort(int type_num, char *data, npy_intp size) provides quite enough information for the general case that would also include strings and void types that have variable lengths. User defined types might also want to add a couple of functions for compare. Swap/copy can be handled from the element size. Perhaps a descriptor could provide that information? How would you suggest changing the present initialization fo arraytypes that use explicit pointers to relevant sort functions? It would seem that with a soft interface those function pointers could be dispensed with and the calling function be made responsible. What does RawSort imply in your way of thinking? |
The model I would suggest is the replacement for the casting mechanism I've created in lowlevel_strided_loops and dtype_transfer. I've kept this private for the time being, because it needs to evolve further, but the basic idea is right. Here's the documentation in the header: https://github.com/numpy/numpy/blob/master/numpy/core/src/private/lowlevel_strided_loops.h#L122 This approach is good if you would want to sort lots of data which is the same dtype, and possibly the same stride. For the dtype transfer, I also made a simple wrapper which allows for the straightforward casting of one raw data buffer to another: https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/dtype_transfer.c#L3238 One of the primary users of the dtype transfer function is the routine to copy an array: https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L2661 I could imagine the sort API be structured similar to this, with a sort function pointer + void * data convention similar to the dtype transfer one: https://github.com/numpy/numpy/blob/master/numpy/core/src/private/lowlevel_strided_loops.h#L10 then, a primary API function which generates a sort function based on an input stride (NPY_INTP_MAX to indicate unspecialized, just like in the casting), and a dtype object, and either returns NULL for the simple data types, or some internal data which stores data about how to do comparisons to lexicographically sort structured dtypes, or any other data specific to the dtype that needs sorting. A simplified interface like CastRawArrays could be provided as well. |
The sort and argsort pointers in the PyArray_ArrFuncs need to stay as they are at the moment for ABI compatibility, but that kind of detail needs to be removed from the API/ABI completely, it should be completely opaque to plugins that link against NumPy. |
OK about the current pointers. The idea of this pull was to make a starting point for further development and I was going to start another branch for that. So perhaps it could go in as is if it works and then API could be developed after that. |
That sounds good to me. It still installed libnpysort.a and an npysort.ini in the destination, should that remain private as well, just being statically linked, until the .h file becomes public? |
If I can figure out how to make distutils do it. I'd like to delete the dummy module used to force distutils to generate the configuration files also... |
#define CFLOAT_SWAP(a,b) {npy_cfloat tmp = (b); (b)=(a); (a) = tmp;} | ||
#define CDOUBLE_SWAP(a,b) {npy_cdouble tmp = (b); (b)=(a); (a) = tmp;} | ||
#define CLONGDOUBLE_SWAP(a,b) {npy_clongdouble tmp = (b); (b)=(a); (a) = tmp;} | ||
#define INTP_SWAP(a,b) {npy_intp tmp = (b); (b)=(a); (a) = tmp;} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason INTP needs its own methods? I thought it was just an alias for the appropriate int type.
nm, I think I answered my own question - it's for the argsort methods. Might be worth a comment though.
Looks like a good start to me, feel free to merge. |
Pushed. |
Merge in ~STEPAN.SINDELAR_ORACLE.COM/numpy-hpy from mq/GR-40889 to labs-hpy-port * commit '23216016a53c3788a1f4b7284ca1fa030dd1587b': address comments address comments hacky bug fix (revert me) bug fix bug fixes and hpy abort clean up port PyArray_GenericReduceFunction to HPy bug fix use free instead of PyMem_Free clean up port PyArray_EnsureXXArray to HPy missing PyArray_PyXXXAbstractDType initializations HPyArray_MaskedStridedUnaryOp require hpy ctx
feat: Add vaba_s8
These patches put the numpy sorting functions into a library, remove the _sort module, and initialize the arraytypes with the corresponding sort functions at the time they are initialized. There are currently no sorting functions for datetime or timedelta, but they could easily be added. These patches need testing on windows to make sure the new library is installed and links as it should.