Skip to content

[ENH](WIP) new ufunc implementations for object arrays #6320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

ahaldane
Copy link
Member

This PR overhauls the ufunc implementations for object type, to follow the rules I will paste in the comment below.

At this point I am pasting it to see if it there is interest so it is worth effort to finish unit tests, and to discuss what the ufunc implementations should be. I still need to do tests, docs, and bug-hunting.

motivation

Currently many ufuncs do not work on object arrays (eg all transcendental functions), and some that do work can be improved (see recent discussion of np.sign). With this PR all (or most...) missing ufuncs are implemented using the math and cmath modules as a fallback.

request for comments

I am sure I have missed funny edge cases in the ufunc implementations, so I would love more eyes looking at them. The current implementations are in _objectmath.py.

Some issues in particular: I am not sure how to implement np.nextafter, np.spacing or np.cbrt in pure python. Also, I changed the way the logical_ ufuncs work: In current numpy, logical_and (for example) behave like x and y (Ie, it will return either x or y), while I changed it to True if (x and y) else False (so it returns a boolean). The new way behaves more like the ufuncs behave for float/int arrays.

Also, while we are updating the ufunc implementation, maybe this is the time for back-compat breaks. For example, should the fallback to (eg) sqrt be to call the elem.sqrt method, or might we rename it to elem.npy_sqrt to avoid name conflicts?

@ahaldane
Copy link
Member Author

ahaldane commented Sep 15, 2015

EDIT: This comment is obsolete. See comment #6320 (comment) below for a more up-to-date description




Here is what this PR does. (This can go into Docs later)

ufuncs and object arrays

Ufuncs operate specially on objects arrays, since the objects contained in the array may not be numpy types. To evaluate a ufunc numpy tries a series of strategies in order for each element of the array (for unary ufuncs) or for each pair of elements from the arrays (for binary ufuncs):

  1. If there is a Python "Special method" which corresponds to the ufunc numpy uses it to evaluate the ufunc. This step only applies to the ufuncs add, subtract, multiply, divide, true_divide, floor_divide, remainder, negative, [positive], power, mod, absolute, bitwise_and, bitwise_or, bitwise_xor, invert, left_shift, right_shift, greater, greater_equal, less, less_equal, not_equal, equal.
  2. If any of the arguments is a numpy scalar or ndarray then numpy applies the ufunc directly, eg np.sqrt(elem).
  3. If the first argument has a method with the same name as the ufunc then that method will be called, eg elem.sqrt(). Binary ufuncs such as logaddexp(x,y) will call x.logaddexp(y).
  4. Numpy will then fall back to a Python implementation of the ufunc which may use the math and cmath modules, eg math.sqrt(elem). This includes all ufuncs not evaluated in step 1. Details of the Python implementations are given in [the file _obectmath.py].

If no step succeeds, a TypeError error is raised. [This is not implemented yet, not sure if we want this]

This provides a rough way of creating new numeric types compatible with numpy, as you can define a class which implements all ufuncs missing from step 1 as methods, and then create an object array containing elements of your type. However, note that creating a user-defined type (see "User-Defined Types") is often preferrable. User defined types will be much faster and you will have control over casting and coercion.

@njsmith
Copy link
Member

njsmith commented Sep 15, 2015

Can you post this to the list, so more eyes will see it?
On Sep 15, 2015 11:49 AM, "ahaldane" notifications@github.com wrote:

Here is what this PR does. (This can go into Docs later)
ufuncs and object arrays

Ufuncs operate specially on objects arrays, since the objects contained in
the array may not be numpy types. To evaluate a ufunc numpy tries a series
of strategies in order for each element of the array (for unary ufuncs) or
for each pair of elements from the arrays (for binary ufuncs):

If there is a Python "Special method"
https://docs.python.org/2/reference/datamodel.html#special-method-names
which corresponds to the ufunc numpy uses it to evaluate the ufunc. This
step only applies to the ufuncs add, subtract, multiply, divide,
true_divide, floor_divide, remainder, negative, [positive], power, mod,
absolute, bitwise_and, bitwise_or, bitwise_xor, invert, left_shift,
right_shift, greater, greater_equal, less, less_equal, not_equal, equal
.
2.

If any of the arguments is a numpy scalar or ndarray then numpy
applies the ufunc directly, eg np.sqrt(elem).
3.

If the first argument has a method with the same name as the ufunc
then that method will be called, eg elem.sqrt(). Binary ufuncs such as
logaddexp(x,y) will call x.logaddexp(y).
4.

Numpy will then fall back to a Python implementation of the ufunc
which may use the math and cmath modules, eg math.sqrt(elem). This
includes all ufuncs not evaluated in step 1. Details of the Python
implementations are given in [the file _obectmath.py].

If no step succeeds, a TypeError error is raised. [This is not
implemented yet, not sure if we want this]

This provides a rough way of creating new numeric types compatible with
numpy, as you can define a class which implements all ufuncs missing from
step 1 as methods, and then create an object array containing elements of
your type. However, note that creating a user-defined type (see
"User-Defined Types") is often preferrable. User defined types will be much
faster and you will have control over casting and coercion.


Reply to this email directly or view it on GitHub
#6320 (comment).

@ahaldane
Copy link
Member Author

Sure.

Also:

  • I don't know why some python3 builds are failing. It doesn't like the macro expansion, but I did it exactly the same as the other ufuncs in that file... I'll have to figure it out later.
  • Part of the motivation for the 4 steps above is so that all ufuncs are customizable: Step 1 methods by defining the special method, and everything else through step 3. In current numpy there are many ufuncs handled in step 1 (eg sign, maximum) which the user cannot override.


def signbit(x):
if isinstance(x, complex):
return cmath.copysign(1, x) < 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmath.copysign isn't a thing.

@ewmoore
Copy link
Contributor

ewmoore commented Sep 15, 2015

#368 has some earlier discussion of this idea.

@seberg
Copy link
Member

seberg commented Sep 16, 2015

I hate to do this, but my question is: what about for example decimals?
EDIT: OK, this only is a question for those functions calling math or cmath explicitly.

@ahaldane
Copy link
Member Author

That's exactly the kind of thing I was worried about missing. I hadn't thought of Decimals.

It looks like they are already mostly handled.

  • add/multiply are covered by special methods (step 1).
  • The logical_ funcs, log, exp, sqrt and more are covered in step 3 since they are Decimal methods.
  • The math module accepts them (converting to float), covering many remaining ufuncs (sort of).

Some ufuncs could be improved, eg using the is_nan, is_inf, copy_sign methods of Decimal, or using the exp method to implement expm1.

But, should we even try to support Decimal? It doesn't play well with the other numeric types. For example, Decimal(1) + 1.0 fails. Perhaps if someone wants to use Decimal object arrays, they should subclass/wrap Decimal in a way to make numpy happy.

@seberg
Copy link
Member

seberg commented Sep 16, 2015

The thing I am worried about is not about them not working (I am fine with that really). What I have a bit of a bad feeling about is silently casting decimals to float, which to me feels wrong (it is an unsafe cast). Ideally, you we would probably make it easy to create a full dtype for it, but the "object" dtype should play well with them in any case.

In other words, I am not quite comfortable with calling into math or cmath unless we got the vanilla python float/complex type. Decimals have exp, but you might end up calling math.exp on them, because they are not a complex type (you should use numbers.Complex there in any case, but maybe the explicit is good, just also do the float explicitly for math?).

@ahaldane
Copy link
Member Author

The same problem arises for people using the mpmath types. My original thought was that if an object implemented __float__ it was probably happy for us to convert to float if needed.

But now I am thinking you are right, it would be confusing. Maybe we should have something like

def _notimplemented_msg(fname, etype):
    return ("fallback for ufunc '{}' is not implemented "
            "for objects of type {}").format(fname, etype)

def exp(x):
    if isinstance(x, complex):
        return cmath.exp(x)
    if isinstance(x, (int, long, float)):
        return math.exp(x)
    raise TypeError(_notimplemented_msg('exp', type(x)))

(Note that isinstance(True, int) is True, which I think is what we want)

@ahaldane
Copy link
Member Author

Also, what do you think of fallback implementation in terms of other ufuncs? Eg, what if we have

def expm1(x):
    if isinstance(x, (int, long, float)):
        return math.expm1(x)
    return _ufunc_exp(x) - 1

where _ufunc_exp is essentially np.exp, so it goes back through the ufunc object dispatch code. This way, np.expm1(Decimal(1)) will effectively call Decimal(1).exp() - 1.

On the plus side, it means the user doesn't need to implement expm1 if exp is implemented. On the other hand, it silently uses a less precise calculation.

@njsmith
Copy link
Member

njsmith commented Sep 16, 2015

It sounds like a lot of the use cases that this is aiming at would be
better handled by a new dtype system that lets us have an actual Decimal or
mpmath dtype, instead of trying to guess what to do with arbitrary objects
in a one-size-fits-all way. (Which doesn't necessarily mean we should block
bandaid solutions now -- bandaids are good first aid! -- but something to
keep in mind while thinking about this.)
On Sep 16, 2015 11:43 AM, "ahaldane" notifications@github.com wrote:

Also, what do you think of default implementation in terms of other
ufuncs? Eg, what if we have

def expm1(x):
if isinstance(x, (int, long, float)):
return math.expm1(x)
return _ufunc_exp(x) - 1

where _ufunc_exp is essentially np.exp, so it goes back through the ufunc
object dispatch code. This way, np.exp(Decimal(1)) will effectively call Decimal(1).exp()

On the plus side, it means the user doesn't need to implement expm1 if exp
is implemented. On the other hand, it silently uses a less precise
calculation.


Reply to this email directly or view it on GitHub
#6320 (comment).

@ahaldane
Copy link
Member Author

Updated. Everyone's comments make me think a lot of ufuncs should just fail for types that aren't builtin (bool/int, long, float or complex), so now they raise TypeErrors. It's not 100% consistent though - isfinite does not need to be implemented as long as isinf and isnan are. log2 is implemented in terms of log. fmax, fmin are similar. I'm still on the fence about these.

Ufuncs which exist to increase precision (expm1, log1p, logaddexp) will raise TypeErrors for non-builtin types. I don't want to get in the buisness of guessing a numerically accurate implementation when I don't know what the input type is.

@seberg
Copy link
Member

seberg commented Sep 17, 2015

@njsmith, while I do think that a lot of this would fit much better with some better dtype system, we should have reasonable stuff in place for plain object arrays (the question is what that is ;)).
I can live with specializing python float/long/int as long as we do not cast things like decimals. I think this has to be run by the list (which I guess it already was, but the list is hanging)....

I think it might be better if we do not use isinstance, but actually type(blah) in (int, long, float). But I am willing to concede here if everyone thinks it is just so much more practical. Another example of these kind of problems:

In [3]: m = np.finfo(np.float128).max
In [4]: import math
In [5]: math.isinf(m)
Out[5]: True

So my question is: unless it is completely obvious that one thing is right (sign seems fine to me for example, though then again, strings ^^, so maybe a "number" check....), how about we always check the type exactly. That feels pretty future proof to me in terms of not breaking down at some point and does probably solve almost all real live cases?!

@ahaldane
Copy link
Member Author

I think you're right about not using isinstance. That will be a problem for mpmath types too, eg math.isinf(mpmath.mpf(10)**800) says True.

Also sign currently doesn't work for complex numbers. Actually np.sign is a little weird in that case already:

>>> np.sign(0)
0
>> np.sign(0+1j)
(1+0j)

@argriffing
Copy link
Contributor

Actually np.sign is a little weird in that case already

see #1848

@ahaldane
Copy link
Member Author

Ah! But now that I get down to implementing it, is this what we want?

>>> np.sign(np.inf*1j)
(nan+0j)

I would expect +1 based on wikipedia's definition since np.sign(np.inf) gives +1.

Edit: Oh, I think this is Python's problem, not numpy's. np.inf*1j gives nan+infj.

@pv
Copy link
Member

pv commented Sep 17, 2015

@ahaldane: np.inf*1j -> (nan+infj). In c99 there's only one complex infinity (it has multiple representations). For z/sqrt(z**2), infinity is a similarly singular point as 0, so sign is probably not defined there. Unlike 0, it probably cannot be defined as 0 either...

@pv
Copy link
Member

pv commented Sep 17, 2015

But it probably would also be fine to define it in terms of whatever limiting procedure, which might be less surprising for some cases.

@ahaldane
Copy link
Member Author

@pv, that makes sense. But also, in python and numpy I can specify the type of infinity:

>>> np.sign(complex(0, -np.inf))
(-1+0j)

But this is not very accurate then:

>>> repr(np.complex64(complex(0, -np.inf)))
'-inf*j'

@ahaldane ahaldane force-pushed the objmath branch 3 times, most recently from 31998e0 to b15f145 Compare September 17, 2015 21:00
@ahaldane
Copy link
Member Author

All right, I updated all the ufunc implementations based on recent discussion, fixed small bugs in the C code, and also added a rough unit test.

The unit test checks that the ufunc works similarly for numpy types and python builtin types.

However, it turns out a bunch of my ufunc implementations failed the test. I fixed some of them, but others I've just disabled for now in the unit test. See the comments in the unit test.

@ahaldane
Copy link
Member Author

If we are only going to handle bool, int, long, float, complex, it is starting to make sense to me that we should just do this for all ufuncs:

def ufunc(x):
    if type(x) in (bool, int, long, float, complex):
        return np.ufunc(x).item()
    raise TypeError

@njsmith
Copy link
Member

njsmith commented Sep 17, 2015

Interestingly, that definition also makes sense for ndarrays and eventually
anything with numpy_ufunc.
On Sep 17, 2015 3:34 PM, "ahaldane" notifications@github.com wrote:

If we are only going to handle bool, int, long, float, complex, it is
starting to make sense to me that we should just do this for all ufuncs:

def ufunc(x):
if type(x) in (bool, int, long, float, complex):
return np.ufunc(x).item()
raise TypeError


Reply to this email directly or view it on GitHub
#6320 (comment).

@seberg
Copy link
Member

seberg commented Sep 18, 2015

Actually, I think it would likely make sense to define "object" arrays in the sense of "try to call the type specific ufunc -- if we know it -- on each element" + some fallback (unless they map to a python operator maybe).
Just to annoy you, we might want to support numpy dtypes as well. Whether or not they may be cast, I am not sure ;). The ufuncs have slightly different behaviours sometimes I think, but yes, the only objections to it I can think of, is that is possibly a bit slow.

@njsmith
Copy link
Member

njsmith commented Sep 18, 2015

For most objects that support both + and np.add they should do the same
thing. (Whether this should be true of all objects was one of the key
questions unresolved from #5844.) There are some objects that only support
+, and under two of the three main approaches in #5844, anything that
supports np.add should generally do the same thing for +. so it might make
sense to just use binops unconditionally for the ufuncs that have
corresponding binops, and ufuncs otherwise.
.
This does get more confusing though if we go with "option 3" there and
allow + and np.add to do arbitrarily different things -- in this case it'd
be sort of weird to call add on two object arrays and have it delegate to
+, since logically those would be unrelated operations.
On Sep 18, 2015 00:36, "seberg" notifications@github.com wrote:

Actually, I think it would likely make sense to define "object" arrays in
the sense of "try to call the type specific ufunc -- if we know it -- on
each element" + some fallback (unless they map to a python operator maybe).
Just to annoy you, we might want to support numpy dtypes as well. Whether
or not they may be cast, I am not sure ;). The ufuncs have slightly
different behaviours sometimes I think, but yes, the only objections to it
I can think of, is that is possibly a bit slow.


Reply to this email directly or view it on GitHub
#6320 (comment).

@seberg
Copy link
Member

seberg commented Sep 18, 2015

Too bad I refuse to read that discussion ;). Well, I think one reason to not have it the same would be to have elementwise vs. some other logic (such as concatenate)? In that sense, I think if we call + elementwise inside np.add (instead of the "elementwise np.add) that seems fine.
There would maybe be no way to "call elementwise add recursively" unless you specifically define the ufunc for your type.
The only thing I can think of that would break this type of logic is when you start wishing to propagate things like the out (or other) keyword argument. However, all of this kind of fancy tricks should be possibly if you implement __numpy_ufunc__ for the container or probably more often have some specfic dtype functions for it even.

Which actually gets me philosophizing. Are maybe containers just dtype-scalars in some sense? Is there actually any difference between how a ufunc knows how to work on a single element and how __numpy_ufunc__ would know how to work on a single "container"?

@ahaldane ahaldane force-pushed the objmath branch 3 times, most recently from 5f973f6 to c1740de Compare April 28, 2017 04:03
@ahaldane
Copy link
Member Author

rebased. Still needs more work in the tests.

Now that the __array_ufunc__ code is in (hooray!), I can try using it here to replace the named-method hack (step 3 in my last comment). I'll look into it later.

if (ret != NULL) {
arr = (PyArrayObject*)PyArray_FromScalar(ret, NULL);
Py_DECREF(ret);
ret = PyArray_GETITEM(arr, PyArray_DATA(arr));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For pure-python float and complex objects we call the ufunc on the object, which converts to a numpy type, eg float -> float64. Then we convert back to python types by calling .item(), but this method only exists on arrays so we need to convert the scalar to an array first, then call PyArray_GETITEM.

return bool(x and y)

def logical_or(x, y):
return bool(x or y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave the bool to a separate PR, since that's a behaviour change?

Copy link
Member Author

@ahaldane ahaldane Apr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I want to go even further the other way. logical_or is documented to guarantee it returns a boolean array as output.

I've already forced isinf, isnan, isfinite and signbit to return booleans, but it looks like I forgot to do so for the logical_ ufuncs (in generate_umath.py).

Copy link
Member Author

@ahaldane ahaldane Apr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's true maybe I can leave that for another PR... if I go all the way through with that idea I'd also have to make the comparison operators (less and so on) return bool arrays.

Edit: Never mind I forgot: I already made less and othe comparisons return boolean

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be much keener to leave behaviour changes to a second PR, if possible - in particular, we might want a deprecation cycle, or even a way to explicitly request the old behaviour. But I guess we can review it all together, and split it up once everything else looks good.

@@ -2877,8 +2880,13 @@ OBJECT_@func@(char **args, npy_intp *dimensions, npy_intp *steps, void *data)
return;
}

#if @out_bool@
*((npy_bool *)op1) = PyObject_IsTrue(ret);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handles -1 badly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just break the loop, as we do above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, will fix

if isint(x):
return nfunc(float(x))
raise TypeError(_notimplemented_msg(name, type(x)))
return mathfunc
Copy link
Member

@eric-wieser eric-wieser Apr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: set mathfunc.__name__ = nfunc.__name__

),
'power':
Ufunc(2, 1, None,
docstrings.get('numpy.core.umath.power'),
None,
TD(ints),
TD(inexact, f='pow', astype={'e':'f'}),
TD(O, f='npy_ObjectPower'),
TD(O, f='PyNumber_Power'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look safe to me - we don't pass enough arguments to PyNumber_Power in the loop - npy_ObjectPower is there to add an extra Py_None argument

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, fixing

exp, expm1, log, log10, log1p, sqrt, ceil, trunc, fabs, floor,
arctan2, conjugate, square, reciprocal, _ones_like, sign, degrees,
rad2deg, radians, deg2rad, exp2, log2, cbrt, rint, spacing#
#out_bool = 1, 1, 1, 1, 1, 0*37#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of this information being duplicated in python and C

npy_cache_import("numpy.core._objectmath", "@func@", &fallback);
if (fallback == NULL) {
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put these cache imports, and/or the function name, in data instead? Then we'd only need one loop function, rather than one for each ufunc (or perhaps two - one for boolean return values)

NPY_NO_EXPORT int
_is_float_complex(PyObject *op)
{
if (PyFloat_Check(op) || PyComplex_Check(op)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work for objects that define __float__ or __complex__ methods.


/* Emulates Python's 'not b' behavior */
static PyObject *
npy_ObjectLogicalNot(PyObject *i1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not modify these to return proper numpy bools instead of python objects instead of getting rid of them?

@@ -2189,6 +2189,8 @@ def norm(x, ord=None, axis=None, keepdims=False):

if not issubclass(x.dtype.type, (inexact, object_)):
x = x.astype(float)
elif issubclass(x.dtype.type, object_):
x = x*1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding support for anything with __float__ would likely eliminate the need for this step.

if x == -1:
return -1
if x == 0:
return np.reciprocal(0).item()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives RuntimeWarning: divide by zero encountered in reciprocal - was that intended?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember the intention anymore, but maybe. Doesn't the corresponding int32/int64 ufunc raise the warning too? Then the object ufunc probably should as well.

np.reciprocal([1,0,1])                                                    
/usr/bin/ipython3:1: RuntimeWarning: divide by zero encountered in reciprocal
  #!/usr/bin/python3
/usr/bin/ipython3:1: RuntimeWarning: invalid value encountered in reciprocal
  #!/usr/bin/python3
array([                   1, -9223372036854775808,                    1])

@mattip mattip added the triage review Issue/PR to be discussed at the next triage meeting label Mar 17, 2020
@mattip mattip removed triage review Issue/PR to be discussed at the next triage meeting 54 - Needs decision labels Mar 25, 2020
@mattip
Copy link
Member

mattip commented Mar 25, 2020

In discussion we did not feel that this should move forward as is. It would be best if this could be done outside numpy, perhaps via a new dtype or using one of the protocols somehow. It needs a champion and a clear use case. Closing, please reopen with a clear plan for moving it forward.

@mattip mattip closed this Mar 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.