Skip to content

Sortlib #89

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 21, 2011
Merged

Sortlib #89

merged 14 commits into from
Jun 21, 2011

Conversation

charris
Copy link
Member

@charris charris commented Jun 17, 2011

These patches put the numpy sorting functions into a library, remove the _sort module, and initialize the arraytypes with the corresponding sort functions at the time they are initialized. There are currently no sorting functions for datetime or timedelta, but they could easily be added. These patches need testing on windows to make sure the new library is installed and links as it should.

@charris
Copy link
Member Author

charris commented Jun 17, 2011

The branch is based on the current master 5cb956e.

@dhomeier
Copy link
Contributor

Cannot offer windows, but on MacOS X i386+ppc all tests are passing without new failures. Speed also seems not affected for a typical argsort() in an application of mine.

@charris
Copy link
Member Author

charris commented Jun 17, 2011

Thanks for that.

@mwiebe
Copy link
Member

mwiebe commented Jun 17, 2011

When I first looked at the sorting in numpy, it seemed odd that Timsort wasn't there and the default, especially since it originated from the Python community. Have you given a thought to adding an implementation of that? My understanding is that the CPython source code provides a good example. ;)

http://en.wikipedia.org/wiki/Timsort

@charris
Copy link
Member Author

charris commented Jun 17, 2011

Timsort was optimized for sorting python lists containing python objects where the identification of runs and dealing with them offered a big gain. There was a discussion of this on the list a long time ago IIRC, but I don't recall any benchmarks. Google indicates that it is now competitive with quicksort, so If you want to give it a shot it could replace mergesort. The one difficulty you might have is that all the current sorts are based on the < comparison and you might need to add more.

@dhomeier
Copy link
Contributor

I noticed on Python3.2 there is one error that does not turn up on master (while a couple others have been fixed in your branch):

ERROR: test_datetime_divide (test_datetime.TestDateTime)

Traceback (most recent call last):
File "/Users/derek/lib/python3.2/site-packages/numpy/core/tests/test_datetime.py", line 762, in test_datetime_divide
assert_equal(tda / tdb, 6.0 / 9.0)
TypeError: internal error: could not find appropriate datetime inner loop in true_divide ufunc

I could not figure out if this is a new one or had been fixed in master in the meantime - when did you fork your branch?

@mwiebe
Copy link
Member

mwiebe commented Jun 17, 2011

I see you're structuring this as a a library, which I guess would be callable from plugins linking to NumPy since npy_sort.h is in the public include directory. I'd like to suggest a slightly different tact, since I don't think this scales pleasantly especially towards adding custom data types. At the very least, I think it would be better to keep this an internal implementation detail until how it relates to the other aspects of numpy data types is hashed out.

Regarding the API, I'm not sure having a separate functions exposed for each of the data types is a good idea. Instead of various sort_, just a few core API calls like PyArray_RawSort(int type_num, char *data, npy_intp size), which internally do something like _sort_function_table[type_num](data, size), might be better. This would also allow for some pluggable extensibility with custom data types to be added, for instance the dtype could have a place to put soft functions like the ones in the _sort_function_table.

@charris
Copy link
Member Author

charris commented Jun 18, 2011

On the first point, I agree, I felt it somewhat dangerous to expose an API that we would probably want to change at some point and your suggestion tips the balance. I'll move the header into the private area for the time being. Any suggestion as to what to do with the library?

On the second point, I was thinking along similar lines as the size of the include was a bit shocking. Although I wasn't thinking of abstracting a much as you suggest. Sounds like a good idea.

@charris
Copy link
Member Author

charris commented Jun 18, 2011

I'm not sure PyArray_RawSort(int type_num, char *data, npy_intp size) provides quite enough information for the general case that would also include strings and void types that have variable lengths. User defined types might also want to add a couple of functions for compare. Swap/copy can be handled from the element size. Perhaps a descriptor could provide that information?

How would you suggest changing the present initialization fo arraytypes that use explicit pointers to relevant sort functions? It would seem that with a soft interface those function pointers could be dispensed with and the calling function be made responsible.

What does RawSort imply in your way of thinking?

@mwiebe
Copy link
Member

mwiebe commented Jun 18, 2011

The model I would suggest is the replacement for the casting mechanism I've created in lowlevel_strided_loops and dtype_transfer. I've kept this private for the time being, because it needs to evolve further, but the basic idea is right. Here's the documentation in the header:

https://github.com/numpy/numpy/blob/master/numpy/core/src/private/lowlevel_strided_loops.h#L122

This approach is good if you would want to sort lots of data which is the same dtype, and possibly the same stride. For the dtype transfer, I also made a simple wrapper which allows for the straightforward casting of one raw data buffer to another:

https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/dtype_transfer.c#L3238

One of the primary users of the dtype transfer function is the routine to copy an array:

https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L2661

I could imagine the sort API be structured similar to this, with a sort function pointer + void * data convention similar to the dtype transfer one:

https://github.com/numpy/numpy/blob/master/numpy/core/src/private/lowlevel_strided_loops.h#L10

then, a primary API function which generates a sort function based on an input stride (NPY_INTP_MAX to indicate unspecialized, just like in the casting), and a dtype object, and either returns NULL for the simple data types, or some internal data which stores data about how to do comparisons to lexicographically sort structured dtypes, or any other data specific to the dtype that needs sorting. A simplified interface like CastRawArrays could be provided as well.

@mwiebe
Copy link
Member

mwiebe commented Jun 18, 2011

The sort and argsort pointers in the PyArray_ArrFuncs need to stay as they are at the moment for ABI compatibility, but that kind of detail needs to be removed from the API/ABI completely, it should be completely opaque to plugins that link against NumPy.

@charris
Copy link
Member Author

charris commented Jun 18, 2011

OK about the current pointers. The idea of this pull was to make a starting point for further development and I was going to start another branch for that. So perhaps it could go in as is if it works and then API could be developed after that.

@mwiebe
Copy link
Member

mwiebe commented Jun 18, 2011

That sounds good to me. It still installed libnpysort.a and an npysort.ini in the destination, should that remain private as well, just being statically linked, until the .h file becomes public?

@charris
Copy link
Member Author

charris commented Jun 18, 2011

If I can figure out how to make distutils do it. I'd like to delete the dummy module used to force distutils to generate the configuration files also...

#define CFLOAT_SWAP(a,b) {npy_cfloat tmp = (b); (b)=(a); (a) = tmp;}
#define CDOUBLE_SWAP(a,b) {npy_cdouble tmp = (b); (b)=(a); (a) = tmp;}
#define CLONGDOUBLE_SWAP(a,b) {npy_clongdouble tmp = (b); (b)=(a); (a) = tmp;}
#define INTP_SWAP(a,b) {npy_intp tmp = (b); (b)=(a); (a) = tmp;}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason INTP needs its own methods? I thought it was just an alias for the appropriate int type.

nm, I think I answered my own question - it's for the argsort methods. Might be worth a comment though.

@mwiebe
Copy link
Member

mwiebe commented Jun 20, 2011

Looks like a good start to me, feel free to merge.

@charris charris merged commit e4d790f into numpy:master Jun 21, 2011
@charris
Copy link
Member Author

charris commented Jun 21, 2011

Pushed.

fangerer pushed a commit to hpyproject/numpy-hpy that referenced this pull request Dec 12, 2022
Merge in ~STEPAN.SINDELAR_ORACLE.COM/numpy-hpy from mq/GR-40889 to labs-hpy-port

* commit '23216016a53c3788a1f4b7284ca1fa030dd1587b':
  address comments
  address comments
  hacky bug fix (revert me)
  bug fix
  bug fixes and hpy abort clean up
  port PyArray_GenericReduceFunction to HPy
  bug fix
  use free instead of PyMem_Free
  clean up
  port PyArray_EnsureXXArray to HPy
  missing PyArray_PyXXXAbstractDType initializations
  HPyArray_MaskedStridedUnaryOp require hpy ctx
luyahan pushed a commit to plctlab/numpy that referenced this pull request Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants