Skip to content

Adding isin function for multidimensional arrays #8423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
May 5, 2017
Merged

Conversation

brsr
Copy link
Contributor

@brsr brsr commented Dec 28, 2016

This resolves #8331. I had initially started the request there, and there was discussion on the mailing list, in which everyone seemed pretty happy with the idea. I hadn't realized that there was a separate in1d method for masked arrays, so this includes an isin function for masked arrays too.

@endolith
Copy link
Contributor

I was going to submit this, too, but just using the existing function:

def in_nd(ar1, ar2, assume_unique=False, invert=False):
    """
        ...
    """
    out = np.in1d(ar1, ar2, assume_unique=False, invert=False)
    return out.reshape(ar1.shape)


Returns a boolean array the same length as `ar1` that is True
Returns a boolean array the same shape as `ar1` that is True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Returns a boolean array of the same shape"

>>> test = np.array([0, 1, 2, 5, 0])
>>> states = [0, 2]
>>> mask = np.in1d(test, states)
>>> test = np.array([0, 2, 4, 6]).reshape([2, 2])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not

test = np.array([[0, 2], [4, 6]])

?

@eric-wieser
Copy link
Member

Could probably do with better parameter names than in1 and in2. Marginally better perhaps is query and values?

@endolith
Copy link
Contributor

Agreed, but not sure what the best names are. Conceptually, in2 is a set, not an (ordered) array

@eric-wieser
Copy link
Member

item, item_set?

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that another possible name for this function is contains, copying the name from operator.contains. That has the arguments backwards, however, which would entail flipping the arguments here, too.

Maybe isin(query, index) for the argument names?

Default is False. ``np.in1d(a, b, invert=True)`` is equivalent
to (but is faster than) ``np.invert(in1d(a, b))``.
Default is False. ``np.isin(a, b, invert=True)`` is equivalent
to (but is faster than) ``np.invert(isin(a, b))``.

.. versionadded:: 1.8.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This versionadded is no longer relevant.

python keyword `in`, for 1-D sequences. ``in1d(a, b)`` is roughly
equivalent to ``np.array([item in b for item in a])``.
`isin` can be considered as an element-wise function version of the
python keyword `in`. ``in1d(ar1, ar2)`` is roughly equivalent to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be isin, not in1d

equivalent to ``np.array([item in b for item in a])``.
`isin` can be considered as an element-wise function version of the
python keyword `in`. ``in1d(ar1, ar2)`` is roughly equivalent to
``in1d = np.vectorize(lambda x: x in ar2); in1d(ar1)``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the original example, which I think is a little clearer than using np.vectorize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original example is not valid for multidimensional arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but you can still use it to describe how this function works on 1D arrays.

@@ -380,16 +381,16 @@ def setxor1d(ar1, ar2, assume_unique=False):
flag2 = flag[1:] == flag[:-1]
return aux[flag2]

def in1d(ar1, ar2, assume_unique=False, invert=False):
def isin(ar1, ar2, assume_unique=False, invert=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely consider more meaningful names here.

@brsr
Copy link
Contributor Author

brsr commented Dec 31, 2016

Went with elements and test_elements to replace ar1 and ar2. It fits with the description in the docstring: "Test whether each element of an array is present in a second array." I felt that calling ar2 a "set" would be confusing, because the function has counterintuitive behavior if ar2 is a standard Python set.

@endolith
Copy link
Contributor

It occurs to me that another possible name for this function is contains, copying the name from operator.contains. That has the arguments backwards, however, which would entail flipping the arguments here, too.

To me, that would mean one array is a sub-array of the other, like contains([[1,2,3],[4,5,6],[7,8,9]], [[1,2],[4,5]]) would be True.

https://docs.python.org/2/reference/expressions.html#membership-test-operations calls the arguments x and s, and calls s a "collection". The first argument could be some synonym of testee or subject, but nothing seems obviously good to me.

@eric-wieser
Copy link
Member

I'd argue the name should still be contains, because your intuition is inconsistent with operator.contains: operator.contains([1,2,3], [2,3]) is false

@brsr
Copy link
Contributor Author

brsr commented Jan 1, 2017

operator.contains and numpy.contains would have inconsistent behavior. Furthermore, ndarray implements __contains__, so numpy.contains and numpy.ndarray.__contains__ would have inconsistent behavior.

Compare:

>>> operator.contains([1, 2, 3], 2)
True
>>> operator.contains([1, 2, 3], [2, 3])
False
>>> operator.contains([[1, 2], [3, 4], [5, 6]], [3, 6])
False

versus

>>> np.in1d(2, [1, 2, 3])
array([ True], dtype=bool)
>>> np.in1d([2, 3], [1, 2, 3])
array([ True,  True], dtype=bool)
>>> np.in1d([3, 6], [[1, 2], [3, 4], [5, 6]])
array([ True,  True], dtype=bool)

@eric-wieser
Copy link
Member

eric-wieser commented Jan 1, 2017

I'm assuming that you mean isin not in1d there.

Your second case is not really any more inconsistent than:

>>> 1 == [1, 2]
False
>>> 1 == np.array([1, 2])
array([ True,  False], dtype=bool)

I think that whatever we name it, differences in behaviour between np.isin and in will cause surprises, and that's not really any different to operator.contains vs np.contains causing those same surprises.

@shoyer
Copy link
Member

shoyer commented Jan 1, 2017

I guess there are enough differences here that isin could make more sense. I like that the name is shorter and is consistent with pandas.

@seberg
Copy link
Member

seberg commented Jan 2, 2017

__contains__ has bad behaviour in any case :/, it would be nice to deprecate the more insane stuff there.

@eric-wieser
Copy link
Member

eric-wieser commented Jan 2, 2017

@seberg: Is the behaviour documented anywhere? The first three google results simply tell me x.__contains__(val) means val in x without a link to what that means, and the third is a StackOverflow post saying "Don't try to use __contains__ in NumPy. It does crazy, useless shit"

@seberg
Copy link
Member

seberg commented Jan 2, 2017

Reading the code of it is the easiest. Basically, it works as long as the lhs is a single element. If it is not a single element, things likely won't make much sense.

Returns a boolean array of the same length as `elements` that is True
where an element of `elements` is in `test_elements` and False otherwise.

See `numpy.isin` for more details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little nervous about deleting all the parameters on this method, including the notes on when they were added. Can we keep them around, even though it's slightly redundant?

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very close -- would love your help to get this across the finish line!

This should also be mentioned in the docs (doc/source/reference/routines.set.rst) and the release notes (doc/release/1.13.0-notes.rst).

@@ -380,106 +381,130 @@ def setxor1d(ar1, ar2, assume_unique=False):
flag2 = flag[1:] == flag[:-1]
return aux[flag2]

def in1d(ar1, ar2, assume_unique=False, invert=False):
def isin(elements, test_elements, assume_unique=False, invert=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not worth the hassle of changing argument names on an effectively deprecated function and potentially breaking some users. Let's only use the new arguments names for isin.

ar2 : array_like
The values against which to test each value of `ar1`.
test_elements : array_like
The values against which to test each value of `elements`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this argument is flattened.

in1d : (M,) ndarray, bool
The values `ar1[in1d]` are in `ar2`.
isin : ndarray, bool
The values `elements[isin]` are in `test_elements`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this has the same shape as elements.

# Ravel both arrays, behavior for the first array could be different
ar1 = np.asarray(ar1).ravel()
ar2 = np.asarray(ar2).ravel()
# Ravel both arrays, will reshape result later to match elements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though isin should replace most uses of in1d, I think this code is actually easier to understand when a separate helper function is used for the 1D functionality, in the style of unique/_unique1d in this file. So I would keep the core logic in in1d.

@brsr
Copy link
Contributor Author

brsr commented Feb 15, 2017

So it sounds like we want to leave in1d as it is in 1.12, and the new function isin should just reshape the result of in1d?

@shoyer
Copy link
Member

shoyer commented Feb 15, 2017

@brsr You're certainly welcome to do any internal clean-up that is warranted, but from a public API perspective I would indeed keep in1d mostly changed. I would certainly search for references to in1d in the documentation and replace most of those with isin as the preferred solution, but we want to keep in1d around in an intact state to maximize backwards compatibility.

@brsr
Copy link
Contributor Author

brsr commented Feb 16, 2017

I believe I've caught all the instances of in1d in the docs. The docs for where actually demonstrate the exact use case that I wanted this function for. Let me know what else needs tweaking.

@homu
Copy link
Contributor

homu commented Feb 22, 2017

☔ The latest upstream changes (presumably #8446) made this pull request unmergeable. Please resolve the merge conflicts.

@brsr
Copy link
Contributor Author

brsr commented Apr 10, 2017

@eric-wieser, I'm not following your reasoning. Using the example from the docs, isin is equivalent to np.array([item in b for item in a]) if a and b are 1-D arrays. Expressing that in terms of the names currently used in isin would be something like np.array([element in test_elements for element in elements]). elements is plural because the function tests each singular element in elements.

@eric-wieser
Copy link
Member

eric-wieser commented Apr 10, 2017

elements is plural because the function tests each singular element in elements

Right, but in numpy by default everything is plural, so that's not really helpful.

For instance, np.add(a, b):

  • When passed 0d arrays, does (a + b),
  • When passed 1d arrays, does [(ai + bi) for ai, bi in zip(a, b)]
  • When passed 2d arrays, does [[(aij + bij) for aij, bij in zip(ai, bi)] for ai, bi in zip(a, b) ]
  • ...

It makes the most sense to document this as "calculates a + b, broadcasting over a and b"

So for np.isin(a, bs):

  • When passed 0d a, does (a in bs),
  • When passed 1d a, does [(ai in bs) for ai in a]
  • When passed 2d a, does [[(aij in bs) for aij in ai] for ai in a ]
  • ...

Again, we have a base operation, in parentheses, and then a bunch of numpy broadcasting happening.

Let's be consistent here - we should always describe the function in terms of its lowest-dimensional generalization. So it makes the most sense to document this as "calculates a in bs, broadcasting over a alone"

@eric-wieser
Copy link
Member

eric-wieser commented Apr 10, 2017

Also, we already bikeshedded the name, but...

The more I look at this PR, the harder I find it not to read np.isin as i sin, as python sets a precedent with __iadd__, and I'm used to seeing np.sin.

I know we've picked it for pandas compatibility, but as it's not a method of ndarray, this doesn't help with duck typing.

How do people feel about np.is_in?

@shoyer
Copy link
Member

shoyer commented Apr 10, 2017

How do people feel about np.is_in?

I can see your point, but it looks very stilted to me, so I still prefer isin, which looks like a shortened version of the builtin name isinstance. In context, I don't think I would be confused, because the usage is so different: np.isin returns a boolean array typically used for indexing, rather than floats like np.sin.

@@ -516,7 +516,7 @@ def isin(elements, test_elements, assume_unique=False, invert=False):
Returns
-------
isin : ndarray, bool
Has the same shape as `elements`. The values `elements[isin]`
Has the same shape as `element`. The values `elements[isin]`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed an s here


If `test_elements` is a set (or other non-sequence collection) it will
be converted to an object array with one element, rather than an array
of the values contained in `test_elements`. Converting the set to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth pointing out that this is just the np.array constructor doing its thing, and is not something weird this function chose to do. Mentioning "array constructor" somewhere should do the job

elements = np.array(elements)
return in1d(elements, test_elements, assume_unique=assume_unique,
invert=invert).reshape(elements.shape)
element = np.array(element)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be asarray, or asanyarray if you think this will worked for masked arrays

elements = ma.array(elements)
return in1d(elements, test_elements, assume_unique=assume_unique,
invert=invert).reshape(elements.shape)
element = ma.array(element)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, asarray or asanyarray

ec[0, 0, 1] = True
ec[0, 2, 3] = True
c = isin(a, b)
assert_array_equal(c, ec)
Copy link
Member

@eric-wieser eric-wieser Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a test here for type(c)? Should it be MaskedArray or array?

Also, a test of what happens if you call np.isin instead of np.ma.isin might be interesting as well.

ec[1, 2, 2] = True
c = isin(a, b)
assert_array_equal(c, ec)

Copy link
Member

@eric-wieser eric-wieser Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a test for 0d a here

@eric-wieser
Copy link
Member

eric-wieser commented Apr 10, 2017

you might be able to simplify your tests with something like:

isin_slow = np.vectorize(lambda a, b: a in b, excluded={'b'})

# do this for a bunch of things, now that you don't need to find the result by hand
assert_equal(isin_slow(a, b), isin(a, b))

#both args
assert_isin_equal(5, 6)
#zero-d array-like as...
x = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[] is 1d with shape (0,), not 0d with shape ().

I'm thinking testing things like np.isin(somearr, np.array(1)), or np.isin(np.array(1), somearr)

a = array(a, mask=mask)
values = [0, 10, 20, 30, 1, 3, 11, 22, 33]
mask = [0, 1, 0, 1, 0, 1, 0, 1, 0]
b = array(values, mask=mask)
Copy link
Member

@eric-wieser eric-wieser Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe even just

array(data=[0, 10, 20, 30,  1,  3, 11, 22, 33],
      mask=[0,  1,  0,  1,  0,  1,  0,  1,  0])

Better already with the alignment

def in1d(ar1, ar2, assume_unique=False, invert=False):
"""
Test whether each element of an array is also present in a second
array.

The output is always a masked array. See `numpy.in1d` for more details.

This function has been deprecated: use `isin` instead.

Deprecated since version 1.13.0.
Copy link
Member

@eric-wieser eric-wieser Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use the .. deprecated:: 1.13.0 syntax. Look around at other files to see how things are worded around it. Some files use .. note as well, I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, numpy/HOWTO_DOCUMENT.rst.txt says something different about deprecation warnings.

def in1d(ar1, ar2, assume_unique=False, invert=False):
"""
Test whether each element of a 1-D array is also present in a second array.

Returns a boolean array the same length as `ar1` that is True
where an element of `ar1` is in `ar2` and False otherwise.

This function has been deprecated: use `isin` instead.

Deprecated since version 1.13.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

c = [(9, 8), (7, 6)]
d = (9, 7)
assert_isin_equal(c, d)
#zero-d scalar as:
Copy link
Member

@eric-wieser eric-wieser Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are good, but tests for 0d arrays are also important, and usually catch more bugs — so it would be good to have those too.

Related questions:

  • should isin(1, 1):
    • return np.array(True)
    • return np.bool_(True)
    • raise TypeError, because 1 is not iterable before coercion to a 0d array
  • should isin(1, np.array(1)):
    • return np.array(True)
    • return np.bool_(True)
  • etc

Also, I think the comments labelling which arg is a scalar are a bit much. This'd be more readable with just the section headings, and some space between ssections

@@ -411,6 +417,8 @@ def in1d(ar1, ar2, assume_unique=False, invert=False):

See Also
--------
isin : Version of this function that preserves the
shape of ar1.
numpy.lib.arraysetops : Module with a number of other functions for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"See also: The module you're looking at right now" is a little weird. @brsr, you don't need to worry about it in this PR though.

can speed up the calculation. Default is False.
invert : bool, optional
If True, the values in the returned array are inverted (that is,
False where an element of `ar1` is in `ar2` and True otherwise).
Copy link
Member

@eric-wieser eric-wieser Apr 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names are out of date. I'd probably ditch this wording all together, and explain it in terms of not in.

array([[ True, False],
[ False, True]], dtype=bool)
>>> element[mask]
array([0, 6])"""
Copy link
Member

@eric-wieser eric-wieser Apr 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: needs an extra newline in this docstring

it will be converted to an object array with one element, rather than an
array of the values contained in `test_elements`. This is a consequence of
the `array` constructor's way of handling non-sequence collections.
Converting the set to a list usually gives the desired behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a good thing to show an example of, for people skipping the notes

@@ -147,6 +147,8 @@
setxor1d Set exclusive-or of 1D arrays with unique elements.
in1d Test whether elements in a 1D array are also present in
another array.
isin Test whether elements in an array are also present in
another array, preserving shape of first array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "test whether each element of one array is present anywhere within another"?

No strong feelings here

@@ -147,8 +147,8 @@
setxor1d Set exclusive-or of 1D arrays with unique elements.
in1d Test whether elements in a 1D array are also present in
another array.
isin Test whether elements in an array are also present in
another array, preserving shape of first array.
isin Test whether each element of one (possibly multidimensional)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why just possibly? I think you can put ND before both of these arrays, to make it explicit compared to its surrounding 1D functions

Copy link
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me now - nice job!

brsr added 2 commits April 13, 2017 10:28
remove 1D from section description, since they're not all 1D. match order in the list at top of arraysetops.py, where all the ND functions are first and all the 1D ones are after.
@homu
Copy link
Contributor

homu commented Apr 28, 2017

☔ The latest upstream changes (presumably #8996) made this pull request unmergeable. Please resolve the merge conflicts.

@@ -22,6 +22,8 @@ Dropped Support
Deprecations
============

* ``in1d`` is now deprecated in favor of ``isin``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should mark things as deprecated without also raising a DeprecationWarning.

I think it would be best to not deprecate in1d in this PR, so that merging this otherwise nice work can happen in time for 1.13.

You can always make a new PR that adds deprecation once this is merged

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be my fault: I may have said something like to suggest that we should deprecate in1d, but I was being sloppy with my language. I don't think we can (or should) ever raise a DeprecationWarning. in1d works just fine, albeit not with as nice an API as isin, and is used all over the place, so there is absolutely no good reason to remove it.

What we can and should do is discourage using in1d in favor of isin for new code. Instead of saying in1d is deprecated, its docstring should say something like: "We recommend using isin instead of in1d for new code".

@eric-wieser
Copy link
Member

Will need a rebase at some point, but if you're happy for all the commits to be squashed into one, then I can do that for you via the github UI

@brsr
Copy link
Contributor Author

brsr commented Apr 28, 2017

@eric-wieser , squashing is fine with me.

@eric-wieser eric-wieser merged commit 69b0c42 into numpy:master May 5, 2017
@eric-wieser
Copy link
Member

Thanks @brsr!

mherkazandjian pushed a commit to mherkazandjian/numpy that referenced this pull request May 30, 2017
This fixes numpygh-8331

Also update the docs for arraysetops to remove the outdated "1D" from the
description, which was already incorrect for np.unique.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: in1d, but preserve shape of ar1
7 participants