Skip to content

BUG: add exception to searchsorted incompatible inputs #26650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

luxedo
Copy link
Contributor

@luxedo luxedo commented Jun 9, 2024

Closes #24032 by throwing an exception when searchsorted receives incompatible types

@luxedo luxedo marked this pull request as ready for review June 9, 2024 20:58
@luxedo luxedo marked this pull request as draft June 9, 2024 23:26
@luxedo
Copy link
Contributor Author

luxedo commented Jun 10, 2024

Initial PR failed because of regression tests from more than 10 years ago. There are 3 tests:

The first regression test passes:

def test_searchsorted_variable_length(self):
x = np.array(['a', 'aa', 'b'])
y = np.array(['d', 'e'])
assert_equal(x.searchsorted(y), [3, 3])

The second test passes too:

def test_search_sorted_invalid_arguments(self):
# Ticket #2021, should not segfault.
x = np.arange(0, 4, dtype='datetime64[D]')
assert_raises(TypeError, x.searchsorted, 1)

The new verification does not catches the error and the expected exception is thrown in a getattr inside _wrapit:

File ~/code/os/numpy/numpy/_core/fromnumeric.py:46, in _wrapit(obj, method, *args, **kwds)
     43 # As this already tried the method, subok is maybe quite reasonable here
     44 # but this follows what was done before. TODO: revisit this.
     45 arr, = conv.as_arrays(subok=False)
---> 46 result = getattr(arr, method)(*args, **kwds)
     48 return conv.wrap(result, to_scalar=False)

TypeError: '<' not supported between instances of 'tuple' and 'float'

The third test fails (logs) and is related to #642 and #2658.

def test_searchsorted_wrong_dtype(self):
# Ticket #2189, it used to segfault, so we check that it raises the
# proper exception.
a = np.array([('a', 1)], dtype='S1, int')
assert_raises(TypeError, np.searchsorted, a, 1.2)
# Ticket #2066, similar problem:
dtype = np.rec.format_parser(['i4', 'i4'], [], [])
a = np.recarray((2,), dtype)
a[...] = [(1, 2), (3, 4)]
assert_raises(TypeError, np.searchsorted, a, 1)

Which now throws:

ValueError: Incompatible types for searching: a ((numpy.record, [('f0', '<i4'), ('f1', '<i4')])) and v (int64)

 

Summary

There's already a rudimentary type verification by means of not being able to compare types, but is not enough to catch string and integer discrepancies.

With this PR there will be two types of errors:

  1. The old TypeError by type comparison: TypeError: '<' not supported between instances of 'tuple' and 'float'
  2. The new ValueError given by resolve_dtypes: TypeError: Incompatible types for searching: a (str) and v (int64)

My suggestion from 647bb85 is to just use TypeError for the new exception as well, which I think makes sense. But I'm a bit bothered by the fact that there are two exceptions for handling incompatible data types.

@luxedo luxedo marked this pull request as ready for review June 10, 2024 01:44
@eendebakpt
Copy link
Contributor

This PR has a negative impact on the performance for small arrays. A quick test:

import numpy as np
x=np.array([1, 2, 3])
y=np.array([1, 5])
z=np.searchsorted(x,y)

%timeit np.searchsorted(x,y)

Results in

main: 753 ns ± 37.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
PR: 1.23 µs ± 34.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

(that does not mean we should not merge this PR, as it does fix a bug)

  • Could we add the type validation to ndarray.searchsorted? That might make the overhead a lot less
  • Are there any general guidelines on how much type checking we want to do?

@luxedo
Copy link
Contributor Author

luxedo commented Jun 12, 2024

I agree!

Is there any opportunity for "private" functions and methods that already trust the argument types? If so, the work has to be done only once to bring the Python built-in types to numpy types, all the rest of the work can be done under the hood.

@luxedo
Copy link
Contributor Author

luxedo commented Jun 18, 2024

Any updates on this? A very similar PR has been approved: #26667
Indeed there's a performance drop of around 60%, maybe placing the type validation in C code would make this faster.

@mattip
Copy link
Member

mattip commented Jul 2, 2024

Personally, I prefer the "first make it right, then make it fast" approach and am inclined to accept this PR as is. But if someone wants to dive into moving the resolve_dtypes call into C, the place to do it is in PyArray_SearchSorted

PyArray_SearchSorted(PyArrayObject *op1, PyObject *op2,

The implementation would still look up np.less.resolve_dtypes using the lookup-and-cache mechanism, and then call it with the already-converted arguments.

@eendebakpt
Copy link
Contributor

@mattip The problem addressed in the PR is a bit on the boundary between fixing an bug, and providing a better error message, so I would prefer to solve it right in the first go. I would be willing to look in to the C implementation, but that will not be in the comings weeks, so if we do not want to wait there are no objections from my side to merge this.

@luxedo
Copy link
Contributor Author

luxedo commented Jul 4, 2024

I can give a go at the C code as well but I'm very new to numpy's codebase, so a little guidance would be of much help.

Co-authored-by: Pieter Eendebak <pieter.eendebak@gmail.com>
@eendebakpt
Copy link
Contributor

Performance considerations for this PR might have changed a bit since #27119. I am having trouble compiling on my own system, so I cannot provide any good benchmarks though.

@luxedo Do you still want to have a go at the C code? I think (but it would be nice if a numpy developer could confirm) that you could move the check to PyArray_SearchSorted in https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/item_selection.c#L2084. The input arguments have already been converted to arrays, and the dtypes are available via PyArray_DESCR

@luxedo
Copy link
Contributor Author

luxedo commented Aug 14, 2024

I'm out for a couple more weeks or so :( but I'd love to give it a go

@luxedo
Copy link
Contributor Author

luxedo commented Sep 2, 2024

Hi! I'm starting taking a look and trying to understand numpy's C code better. So far I got this:

NPY_NO_EXPORT PyObject *
PyArray_SearchSorted(PyArrayObject *op1, PyObject *op2,
                     NPY_SEARCHSIDE side, PyObject *perm)
{
    /* Check if types are comparable */
   
    // Load 'less' ufunc to pass to py_resolve_dtypes_generic. DON'T KNOW HOW TO FIND THE ARGUMENTS.
    PyUFuncObject *ufunc_less = PyUFunc_FromFuncAndData("Don't know how to get 'less' ufunc");
    
    // Pack input types an a tuple to pass to py_resolve_dtypes_generic
    PyObject *dtypes = PyTuple_Pack(2, PyArray_DESCR(op1), op2);

    // Empty tuple as kwnames
    PyObject *kwnames = PyTuple_New(0);
   
     // Check if types are compatible
    PyObject *resolve = py_resolve_dtypes_generic(ufunc_less, NPY_FALSE, &dtypes, 2, kwnames);
    
    // Cleanup and early return if types are not comparable
    bool incompatible_types = resolve == NULL;
    Py_DECREF(dtypes);
    Py_DECREF(kwnames);
    Py_XDECREF(resolve);
    if (incompatible_types) {
        PyErr_SetString(PyExc_TypeError, "incomparable types");
        return NULL;
    }

  ...

This won't compile yet. Because I don't know how to get the less ufunc with PyUFunc_FromFuncAndData or any other way.

Is this in the correct path? Is there any other way to validate that the types can be compared?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: numpy.histogram should raise an error for string arrays
3 participants