BUG: add exception to searchsorted incompatible inputs #26650

luxedo · 2024-06-09T20:58:26Z

Closes #24032 by throwing an exception when searchsorted receives incompatible types

luxedo · 2024-06-10T01:39:05Z

Initial PR failed because of regression tests from more than 10 years ago. There are 3 tests:

The first regression test passes:

numpy/numpy/_core/tests/test_regression.py

Lines 859 to 862 in 5aaede6

    
           def test_searchsorted_variable_length(self): 
        
               x = np.array(['a', 'aa', 'b']) 
        
               y = np.array(['d', 'e']) 
        
               assert_equal(x.searchsorted(y), [3, 3])

The second test passes too:

numpy/numpy/_core/tests/test_regression.py

Lines 2017 to 2020 in 5aaede6

    
           def test_search_sorted_invalid_arguments(self): 
        
               # Ticket #2021, should not segfault. 
        
               x = np.arange(0, 4, dtype='datetime64[D]') 
        
               assert_raises(TypeError, x.searchsorted, 1)

The new verification does not catches the error and the expected exception is thrown in a getattr inside _wrapit:

File ~/code/os/numpy/numpy/_core/fromnumeric.py:46, in _wrapit(obj, method, *args, **kwds)
     43 # As this already tried the method, subok is maybe quite reasonable here
     44 # but this follows what was done before. TODO: revisit this.
     45 arr, = conv.as_arrays(subok=False)
---> 46 result = getattr(arr, method)(*args, **kwds)
     48 return conv.wrap(result, to_scalar=False)

TypeError: '<' not supported between instances of 'tuple' and 'float'

The third test fails (logs) and is related to #642 and #2658.

numpy/numpy/_core/tests/test_regression.py

Lines 2069 to 2078 in 5aaede6

    
           def test_searchsorted_wrong_dtype(self): 
        
               # Ticket #2189, it used to segfault, so we check that it raises the 
        
               # proper exception. 
        
               a = np.array([('a', 1)], dtype='S1, int') 
        
               assert_raises(TypeError, np.searchsorted, a, 1.2) 
        
               # Ticket #2066, similar problem: 
        
               dtype = np.rec.format_parser(['i4', 'i4'], [], []) 
        
               a = np.recarray((2,), dtype) 
        
               a[...] = [(1, 2), (3, 4)] 
        
               assert_raises(TypeError, np.searchsorted, a, 1)

Which now throws:

ValueError: Incompatible types for searching: a ((numpy.record, [('f0', '<i4'), ('f1', '<i4')])) and v (int64)

Summary

There's already a rudimentary type verification by means of not being able to compare types, but is not enough to catch string and integer discrepancies.

With this PR there will be two types of errors:

The old TypeError by type comparison: TypeError: '<' not supported between instances of 'tuple' and 'float'
The new ValueError given by resolve_dtypes: TypeError: Incompatible types for searching: a (str) and v (int64)

My suggestion from 647bb85 is to just use TypeError for the new exception as well, which I think makes sense. But I'm a bit bothered by the fact that there are two exceptions for handling incompatible data types.

eendebakpt · 2024-06-12T09:04:58Z

This PR has a negative impact on the performance for small arrays. A quick test:

import numpy as np
x=np.array([1, 2, 3])
y=np.array([1, 5])
z=np.searchsorted(x,y)

%timeit np.searchsorted(x,y)

Results in

main: 753 ns ± 37.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
PR: 1.23 µs ± 34.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

(that does not mean we should not merge this PR, as it does fix a bug)

Could we add the type validation to ndarray.searchsorted? That might make the overhead a lot less
Are there any general guidelines on how much type checking we want to do?

luxedo · 2024-06-12T12:48:09Z

I agree!

Is there any opportunity for "private" functions and methods that already trust the argument types? If so, the work has to be done only once to bring the Python built-in types to numpy types, all the rest of the work can be done under the hood.

luxedo · 2024-06-18T01:55:51Z

Any updates on this? A very similar PR has been approved: #26667
Indeed there's a performance drop of around 60%, maybe placing the type validation in C code would make this faster.

mattip · 2024-07-02T20:10:32Z

Personally, I prefer the "first make it right, then make it fast" approach and am inclined to accept this PR as is. But if someone wants to dive into moving the resolve_dtypes call into C, the place to do it is in PyArray_SearchSorted

numpy/numpy/_core/src/multiarray/item_selection.c

Line 2084 in 95073d1

PyArray_SearchSorted(PyArrayObject *op1, PyObject *op2,

The implementation would still look up np.less.resolve_dtypes using the lookup-and-cache mechanism, and then call it with the already-converted arguments.

eendebakpt · 2024-07-03T09:51:20Z

@mattip The problem addressed in the PR is a bit on the boundary between fixing an bug, and providing a better error message, so I would prefer to solve it right in the first go. I would be willing to look in to the C implementation, but that will not be in the comings weeks, so if we do not want to wait there are no objections from my side to merge this.

luxedo · 2024-07-04T15:22:41Z

I can give a go at the C code as well but I'm very new to numpy's codebase, so a little guidance would be of much help.

numpy/_core/fromnumeric.py

Co-authored-by: Pieter Eendebak <pieter.eendebak@gmail.com>

eendebakpt · 2024-08-13T21:08:33Z

Performance considerations for this PR might have changed a bit since #27119. I am having trouble compiling on my own system, so I cannot provide any good benchmarks though.

@luxedo Do you still want to have a go at the C code? I think (but it would be nice if a numpy developer could confirm) that you could move the check to PyArray_SearchSorted in https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/item_selection.c#L2084. The input arguments have already been converted to arrays, and the dtypes are available via PyArray_DESCR

luxedo · 2024-08-14T00:08:43Z

I'm out for a couple more weeks or so :( but I'd love to give it a go

luxedo · 2024-09-02T00:44:00Z

Hi! I'm starting taking a look and trying to understand numpy's C code better. So far I got this:

NPY_NO_EXPORT PyObject *
PyArray_SearchSorted(PyArrayObject *op1, PyObject *op2,
                     NPY_SEARCHSIDE side, PyObject *perm)
{
    /* Check if types are comparable */
   
    // Load 'less' ufunc to pass to py_resolve_dtypes_generic. DON'T KNOW HOW TO FIND THE ARGUMENTS.
    PyUFuncObject *ufunc_less = PyUFunc_FromFuncAndData("Don't know how to get 'less' ufunc");
    
    // Pack input types an a tuple to pass to py_resolve_dtypes_generic
    PyObject *dtypes = PyTuple_Pack(2, PyArray_DESCR(op1), op2);

    // Empty tuple as kwnames
    PyObject *kwnames = PyTuple_New(0);
   
     // Check if types are compatible
    PyObject *resolve = py_resolve_dtypes_generic(ufunc_less, NPY_FALSE, &dtypes, 2, kwnames);
    
    // Cleanup and early return if types are not comparable
    bool incompatible_types = resolve == NULL;
    Py_DECREF(dtypes);
    Py_DECREF(kwnames);
    Py_XDECREF(resolve);
    if (incompatible_types) {
        PyErr_SetString(PyExc_TypeError, "incomparable types");
        return NULL;
    }

  ...

This won't compile yet. Because I don't know how to get the less ufunc with PyUFunc_FromFuncAndData or any other way.

Is this in the correct path? Is there any other way to validate that the types can be compared?

BUG: add exception to searchsorted incompatible inputs

48122e8

luxedo marked this pull request as ready for review June 9, 2024 20:58

github-actions bot added the 00 - Bug label Jun 9, 2024

luxedo marked this pull request as draft June 9, 2024 23:26

BUG: searchsorted now throws TypeError with incompatible types

647bb85

luxedo marked this pull request as ready for review June 10, 2024 01:44

eendebakpt reviewed Jul 28, 2024

View reviewed changes

numpy/_core/fromnumeric.py Outdated Show resolved Hide resolved

ENH: optimize searchsorted dtype validation

4c52d46

Co-authored-by: Pieter Eendebak <pieter.eendebak@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: add exception to searchsorted incompatible inputs #26650

BUG: add exception to searchsorted incompatible inputs #26650

Uh oh!

luxedo commented Jun 9, 2024

Uh oh!

luxedo commented Jun 10, 2024 •

edited

Loading

Uh oh!

eendebakpt commented Jun 12, 2024

Uh oh!

luxedo commented Jun 12, 2024 •

edited

Loading

Uh oh!

luxedo commented Jun 18, 2024

Uh oh!

mattip commented Jul 2, 2024

Uh oh!

eendebakpt commented Jul 3, 2024

Uh oh!

luxedo commented Jul 4, 2024

Uh oh!

Uh oh!

eendebakpt commented Aug 13, 2024

Uh oh!

luxedo commented Aug 14, 2024

Uh oh!

luxedo commented Sep 2, 2024

Uh oh!

Uh oh!

Uh oh!

BUG: add exception to searchsorted incompatible inputs #26650

Are you sure you want to change the base?

BUG: add exception to searchsorted incompatible inputs #26650

Uh oh!

Conversation

luxedo commented Jun 9, 2024

Uh oh!

luxedo commented Jun 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The first regression test passes:

The second test passes too:

The third test fails (logs) and is related to #642 and #2658.

Summary

Uh oh!

eendebakpt commented Jun 12, 2024

Uh oh!

luxedo commented Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luxedo commented Jun 18, 2024

Uh oh!

mattip commented Jul 2, 2024

Uh oh!

eendebakpt commented Jul 3, 2024

Uh oh!

luxedo commented Jul 4, 2024

Uh oh!

Uh oh!

eendebakpt commented Aug 13, 2024

Uh oh!

luxedo commented Aug 14, 2024

Uh oh!

luxedo commented Sep 2, 2024

Uh oh!

Uh oh!

luxedo commented Jun 10, 2024 •

edited

Loading

luxedo commented Jun 12, 2024 •

edited

Loading