Skip to content

_axes.py.scatter() array index out of bound / calling from seaborn #12641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
esvhd opened this issue Oct 26, 2018 · 28 comments
Closed

_axes.py.scatter() array index out of bound / calling from seaborn #12641

esvhd opened this issue Oct 26, 2018 · 28 comments
Milestone

Comments

@esvhd
Copy link
Contributor

esvhd commented Oct 26, 2018

Bug report

Bug summary

Seeing this with matplotlib 3.0.1 and seaborn 0.9. cc @mwaskom

matplotlib.axes._axes.scatter() tests parameter c hits array index out of bound when c is an empty array.

I encountered this while using seaborn.swarmplot() with hue= parameter. Essentially this plots each group (hue, in addition to x-axis category) and some times a group can be empty for a major x-axis category.

Suggest potential fix to test length of c as well before accessing c[0].

        if (c_none or
                co is not None or
                isinstance(c, str) or
                (isinstance(c, collections.Iterable) and 
                    len(c) > 0 and # proposed fix here
                    isinstance(c[0], str))):

Code for reproduction

import pandas as pd
import seaborn as sns

df = pd.DataFrame(dict(category=['a', 'a', 'b', 'b', 'c'],
                       hue=['x', 'y', 'x', 'y', 'x'],
                       values=[1, 2, 3, 4, 3]))
sns.swarmplot('category', 'values', hue='hue', data=df, dodge=True)

Actual outcome

This is the error I see:

IndexError                                Traceback (most recent call last)
<ipython-input-5-0368a082f6a8> in <module>
----> 1 sns.swarmplot('category', 'values', hue='hue', data=df, dodge=True)

h:\anaconda3\lib\site-packages\seaborn\categorical.py in swarmplot(x, y, hue, data, order, hue_order, dodge, orient, color, palette, size, edgecolor, linewidth, ax, **kwargs)
   2989                        linewidth=linewidth))
   2990 
-> 2991     plotter.plot(ax, kwargs)
   2992     return ax
   2993 

h:\anaconda3\lib\site-packages\seaborn\categorical.py in plot(self, ax, kws)
   1444     def plot(self, ax, kws):
   1445         """Make the full plot."""
-> 1446         self.draw_swarmplot(ax, kws)
   1447         self.add_legend_data(ax)
   1448         self.annotate_axes(ax)

h:\anaconda3\lib\site-packages\seaborn\categorical.py in draw_swarmplot(self, ax, kws)
   1429                     kws.update(c=point_colors)
   1430                     if self.orient == "v":
-> 1431                         points = ax.scatter(cat_pos, swarm_data, s=s, **kws)
   1432                     else:
   1433                         points = ax.scatter(swarm_data, cat_pos, s=s, **kws)

h:\anaconda3\lib\site-packages\matplotlib\__init__.py in inner(ax, data, *args, **kwargs)
   1803                         "the Matplotlib list!)" % (label_namer, func.__name__),
   1804                         RuntimeWarning, stacklevel=2)
-> 1805             return func(ax, *args, **kwargs)
   1806 
   1807         inner.__doc__ = _add_data_doc(inner.__doc__,

h:\anaconda3\lib\site-packages\matplotlib\axes\_axes.py in scatter(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, verts, edgecolors, **kwargs)
   4194                 (isinstance(c, collections.Iterable) and
   4195                     # len(c) > 0 and
-> 4196                     isinstance(c[0], str))):
   4197             c_array = None
   4198         else:

IndexError: index 0 is out of bounds for axis 0 with size 0

Happy to create a PR if we have happy with the proposed length test.

@anntzer
Copy link
Contributor

anntzer commented Oct 26, 2018

The proposed fix looks fine to me.

@jklymak
Copy link
Member

jklymak commented Oct 26, 2018

You may want to look at the selection code in #12572 which squashed a similar bug. If you open a PR, please do add atest so this doesn't happen again. Thanks!

@esvhd
Copy link
Contributor Author

esvhd commented Oct 26, 2018

@mwaskom yes, new in 3.0.1.

This is 3.0.0

if c_none or co is not None:
c_array = None

and in 3.0.1

if (c_none or
co is not None or
isinstance(c, str) or
(isinstance(c, collections.Iterable) and
isinstance(c[0], str))):
c_array = None

@jklymak Will take a look there.

Thx

@montebhoover
Copy link

montebhoover commented Oct 26, 2018

Even with the length check this is still problematic for certain iterables. I currently use the color option to color labeled data, where I label the data with a pandas Series indexed by ID numbers:

ids = range(10,18)
x =pd.Series(np.random.uniform(size=8), index=ids)
y =pd.Series(np.random.uniform(size=8), index=ids)
c =pd.Series([1, 1, 1, 1, 1, 0, 0, 0] , index=ids)
plt.scatter(x, y, c)

This plot allows for coloring the plotted points according to their label in v3.0.0. However, now when we check isinstance(c[0], str), c[0] will result in a KeyError because the pandas Series passed into c is indexed 10-17.

I have at lease one case of this that is breaking between 3.0.0 and 3.0.1.

@esvhd
Copy link
Contributor Author

esvhd commented Oct 26, 2018

@montebhoover
In your example, does passing plt.scatter(x, y, c.values) get around the KeyError issue?

@montebhoover
Copy link

Yes that works for me - thanks! Does it work for both of our cases if the proposed fix is:

        if (c_none or
                co is not None or
                isinstance(c, str) or
                (isinstance(c, collections.Sequence) and  # change Iterable to Sequence
                    len(c) > 0 and  # proposed fix here
                    isinstance(c[0], str))):

This allows me to still pass in a pandas Series:

>>> isinstance(pd.Series(), collections.Sequence)
False

but would that cover all the relevant cases? I believe a numpy ndarray is also not a Sequence, so that could cause problems if it contains strings.

For my use I need to use the c.values fix for now, but I'm just trying to think of a solution that provides compatibility between 3.0.0 and 3.0.1.

@jklymak
Copy link
Member

jklymak commented Oct 27, 2018

Pandas series are not iterable? What does no.isiterable give for them?

@esvhd
Copy link
Contributor Author

esvhd commented Oct 27, 2018 via email

@esvhd
Copy link
Contributor Author

esvhd commented Oct 27, 2018

I think the real issue here is isinstance(c[0], str) - it's a fast but less robust python way to check if c is a iterable of strings. If we had explicit types then we could have avoided this but that's for another day.

Anyway, to make this work for lists, numpy arrays and pandas series, we need to avoid indexing into c with [0]. Could try that by doing one of the following, it only tests the first item of an iterable, which is what the original code does. Testing all is more robust but clearly could be slow if c is big.

import itertools as itr

def _is_iterable_str(x):
    # since we also test if x is collections.Iterable
    # take 1st item, head is an iterator
    head = itr.islice(x, 1)
    # test if first item in head is str
    return isinstance(next(head), str)

# in _axes.py.scatter()...
if (c_none or
    co is not None or
    isinstance(c, str) or
    (isinstance(c, collections.Iterable) and 
     # proposed fix 2 lines below
     len(c) > 0 and 
     _is_iterable_str(x))):

@jklymak @anntzer What do you guys think? It might be a bit of an overkill but would work for python lists, numpy array or pandas series, or any iterables.

@anntzer
Copy link
Contributor

anntzer commented Oct 27, 2018

Why not just next(iter(x)), which seems just as fine as going through islice? Of course there's also the issue of whether it would be problematic to advance a non-rewindable generator (not sure).
Note that there are other places where we use x[0] for similar pruposes so they should get the same fix (hopefully factored out as a helper).

@esvhd
Copy link
Contributor Author

esvhd commented Oct 27, 2018 via email

@anntzer
Copy link
Contributor

anntzer commented Oct 27, 2018

private (prefixed by underscore) in cbook.

@esvhd
Copy link
Contributor Author

esvhd commented Oct 27, 2018

Ok, going to be traveling tomorrow but here's some stuff to get us going. I ended up writing a more general helper function to test element type in an Iterable.

This function below is what I plan to add in cbook.__init__.py, which is then called in _axes.scatter() instead of testing c[0].

The function basically tests if all elements are of the expected type by default, but can test only the first N items as well if specified. I ended up going with islice instead of iter for the first N part but happy to improve if there's a better way.

One thing on this - if the iterable being tested is empty, i.e. len(x) == 0, this function currently returns True. Open to discussion - should this be the expected result?

@jklymak Haven't looked into a test case yet but will give that some thoughts.

Let me know what you guys think.

# in cbook.__init__.py
def _is_iterable_type(x, expected_type, first_n: int = None):
    """
    Tests if the type of elements in an iterable is a given type.

    Parameters:
    -----------
    x:
        Iterable to test

    expected_type:
        Expected type, e.g. expected_type=str

    first_n: (int)
        default None, non-zero positive integer. Tests first N elements in
        x. If None then test all elements.

    Returns:
    --------
    result: (boolean)

    Examples:
    ---------
    >>> # test if first element in a is a string
    >>> _is_iterable_type(a, str, 1)

    """
    # At the moment if x is empty, this function returns True
    # if len(x) < 1:
    #     return False

    if first_n is not None:
        if first_n < 1:
            raise ValueError('Parameter first_n must be an integer greater '
                             'than 0.')
        test_range = itertools.islice(x, first_n)
    else:
        test_range = iter(x)

    results = (isinstance(u, expected_type) for u in test_range)

    return all(results)

@jklymak
Copy link
Member

jklymak commented Oct 27, 2018

Sorry for my typo above but does np.is_iterable help?

@esvhd
Copy link
Contributor Author

esvhd commented Oct 28, 2018

Less about testing for iterable here, it's caused by c[0] for testing if the elements in c is of type str.

@anntzer
Copy link
Contributor

anntzer commented Oct 28, 2018

The function basically tests if all elements are of the expected type by default, but can test only the first N items as well if specified.

Would stick to just testing the first one: it seems needlessly costly to iterate a potentially very long collection in pure python code, and it's not as if we had any sensible behavior for mixed types anyways (I think).

@esvhd
Copy link
Contributor Author

esvhd commented Oct 28, 2018 via email

@tacaswell tacaswell added this to the v3.0.x milestone Oct 28, 2018
@tacaswell
Copy link
Member

While the general version of this sounds good, I am skeptical of adding a new helper function in a patch release.

Wouldn't the simplest solution be to use https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/cbook/__init__.py#L1606 instead of c[0] and add a pd.Series test ?

@esvhd
Copy link
Contributor Author

esvhd commented Oct 29, 2018

safe_first_element() feels like an easy fix.

I wrote test cases for the general method proposed as well. But happy to do either.

I noticed that there isn't any dependency on pandas (couldn't find any reference to import pandas in source). If I add a test case for pandas it'd introduce a dependency. Does this have any undesirable implications?

I know that pd.Series does work with next(iter(x)).

@timhoffm
Copy link
Member

We have a pd fixture to enable pandas-related tests but not introduce it as mandatory dependency.

@esvhd
Copy link
Contributor Author

esvhd commented Oct 29, 2018 via email

@timhoffm
Copy link
Member

Where can I find an example?

E.g.

def test_pandas_pcolormesh(pd):
and following tests

@massich
Copy link

massich commented Oct 29, 2018

see #12664.

IMHO, this is an easy way to hit the same bug.

import matplotlib.pyplot as plt
import numpy as np

n_points = 4
xs = ys = zs = np.full(n_points, np.nan)
colors = list(np.full(2, 'k'))
plt.scatter(xs, ys, zs, c=colors)
plt.show()

@jklymak
Copy link
Member

jklymak commented Oct 29, 2018

@massich I don't quite understand that example. That errors on colors not being the same length as xs, etc. Which seems correct to me.

@massich
Copy link

massich commented Oct 29, 2018

@jklymak you are completely right. I mess up while doing the MWE. I'll also close #12664

@massich
Copy link

massich commented Oct 29, 2018

This is the right snipped

import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D  # noqa: F401 unused import

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

n_points = 4
xs = ys = zs = np.full(n_points, np.nan)
colors = list(np.full(n_points, 'k'))
ax.scatter(xs, ys, zs, c=colors)
plt.show()

@esvhd
Copy link
Contributor Author

esvhd commented Oct 30, 2018

Pushed a fix using @tacaswell 's suggestion to use cbook.safe_first_element(). Added a test case for pd.Series for cbook.safe_first_element(), thanks @timhoffm for the pointer.

@tacaswell The test case for pd.Series deliberately used an non-zero-starting index.

@montebhoover @massich this should fix both of your issues.

Let me know what you guys think.

@anntzer
Copy link
Contributor

anntzer commented Nov 4, 2018

Closed by #12673, afaict.

@anntzer anntzer closed this as completed Nov 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants