ENH: Add 'crosstab', for creating contingency tables. #4958

WarrenWeckesser · 2014-08-12T13:48:54Z

It needs more tests, and it needs to be added to the docs, but I'd like to see if there is any interest in adding this to numpy before doing more work. You can interpret this function as a multi-argument generalization of unique(a, return_counts=True).

I've kept the API simple; I'm sure there are many variations that could be useful.

Similar functions (but with more bells and whistles) include Pandas' crosstab function, R's table function, and Matlab's crosstab function. For example, here's the proposed function in action:

In [7]: x = [1, 1, 1, 1, 2, 2, 2, 2, 2]

In [8]: y = [3, 4, 3, 3, 3, 4, 5, 5, 5]

In [9]: (xvals, yvals), counts = crosstab(x, y)

In [10]: xvals
Out[10]: array([1, 2])

In [11]: yvals
Out[11]: array([3, 4, 5])

In [12]: counts
Out[12]:
array([[3, 1, 0],
       [1, 1, 3]])

Here's the same calculation in R:

> x <- c(1,1,1,1,2,2,2,2,2)
> y <- c(3,4,3,3,3,4,5,5,5)
> table(x, y)
   y
x   3 4 5
  1 3 1 0
  2 1 1 3

jaimefrio · 2014-08-12T14:15:21Z

Interesting.... What would be the use case?

njsmith · 2014-08-12T14:27:16Z

For assessing interest, the mailing list is a better place to ask -- most
of the people who might be interested aren't reading this :-)

On Tue, Aug 12, 2014 at 2:48 PM, Warren Weckesser notifications@github.com
wrote:

It needs more tests, and it needs to be added to the docs, but I'd like to
see if there is any interest in adding this to numpy before doing more
work. You can interpret this function as a multi-argument generalization of unique(a,
return_counts=True).

I've kept the API simple; I'm sure there are many variations that could be

useful.

You can merge this Pull Request by running

git pull https://github.com/WarrenWeckesser/numpy enh-count-unique

Or view, comment on, or merge it at:

#4958
Commit Summary

ENH: Add 'count_unique', for creating contingency tables.

File Changes

M numpy/lib/arraysetops.py
https://github.com/numpy/numpy/pull/4958/files#diff-0 (103)

M numpy/lib/tests/test_arraysetops.py
https://github.com/numpy/numpy/pull/4958/files#diff-1 (16)

Patch Links:

https://github.com/numpy/numpy/pull/4958.patch

https://github.com/numpy/numpy/pull/4958.diff

—
Reply to this email directly or view it on GitHub
#4958.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

WarrenWeckesser · 2014-08-12T14:52:42Z

@njsmith: 👍
@jaimefrio: Look for my email on the mailing list.

juliantaylor · 2014-08-12T19:09:07Z

numpy/lib/arraysetops.py

+    # Count the occurrences of the unique tuples by applying np.add.at
+    # to the inverses.
+    shape = [len(u) for u in unique_elements]
+    count = np.zeros(shape, dtype=np.int64)


the type should probably be np.intp a count cannot exceed the size of a pointer (on relevant architectures)

charris · 2014-08-23T14:23:30Z

@WarrenWeckesser Looks like there was interest in something like this, but the conversation petered out. Where are you with this now?

WarrenWeckesser · 2014-08-27T17:17:49Z

I've been letting it stew for a bit. Eelco's code (linked to in the mailing list thread) suggests some nice extensions to the API, but I haven't thought too deeply about them yet, and so I haven't put together a complete response.

charris · 2015-01-25T17:08:29Z

@WarrenWeckesser Still stewing?

WarrenWeckesser · 2015-01-25T18:50:22Z

@charris: I've pinged the mailing list again. Let's see if any there is any strong opposition to this.

WarrenWeckesser · 2015-01-28T04:29:31Z

I renamed table to crosstab and rebased, because of the conflict with pylab.table.

I didn't really need to do that, though, because I'm closing the PR. There has been no interest from any numpy devs on the mailing list, which I interpret as a collective "meh". This function might make more sense as part of the contingency module of scipy.stats, and anyone who needs more powerful grouping and counting functions can use pandas.

matthew-brett · 2015-01-28T04:32:50Z

@josef-pkt @jseabold - any comments here?

josef-pkt · 2015-01-28T05:12:26Z

I didn't think too much about it, I don't think I would have a use for it. I'm not using pandas groupby and similar myself either, but other statsmodels contributors do, and we have pretty much settled on pandas for those because most of our users use it also.

My general impression (given that I'm lagging several versions behind and I'm not really up to date.):

Various multivariate extensions to np.unique, np.bincount and similar would be useful. (I never send my reply to a request for comments to Jaime's bincount improvements #4330) Statsmodels is missing these for some vectorized use cases.

The table version in this PR is a bit "lonely" (remember it's my impression). Heavy users will use pandas which provides a lot more, and for specific uses like the contingency tables, I also think it would fit better in scipy.stats.
Similar to the recarray utilities which I haven't seen mentioned on the mailing list in some time, the urgent need has gone away with the increase in functionality and popularity of pandas.

I'm not saying that this shouldn't be in numpy, it's just an uphill struggle or it needs to have a clear wider set of use cases besides data analysis.

(aside: I've recently read a blog post by an instructor of online courses for data science who mentioned that with each release of pandas they can teach less numpy and more pandas.)

jaimefrio · 2015-01-28T06:01:39Z

Just for the record, I still find this interesting, but I have a feeling that when we finally figure this type of functionality out, it will look more like unique with an axis parameter, in the line of what @joferkington once proposed as a PR that also ended up being closed without merging. In any case, thanks for stirring up the discussion, Warren.

matthew-brett · 2019-11-28T18:47:45Z

Hi @WarrenWeckesser - I just found out the Pandas crosstab is crazy slow, and so, not very useful for things like simulations. I found I had to write my own function, like this:

def fast_crosstab(firsts, seconds):
    rows = np.unique(firsts)
    cols = np.unique(seconds)
    tab = np.zeros((len(rows), len(cols)), dtype=int)
    for col_i, col_val in enumerate(cols):
        these_firsts = firsts[seconds == col_val]
        for row_i, row_val in enumerate(rows):
            tab[row_i, col_i] = np.count_nonzero(these_firsts == row_val)
    return tab

This is about 100 times faster than Pandas crosstab. It isn't the same as np.unique with the axis keyword, because it deals with counts of 0. Is there anything better, more standard out there, at this point? I would really like to use something like this for teaching, but I don't want to drop this lump of code on them.

WarrenWeckesser · 2019-11-30T14:41:53Z

@matthew-brett, have you tried the crosstab function in this pull request? It appears to be faster than your fast_crosstab, at least for one example:

In [10]: a = np.random.randint(0, 10, size=1000)

In [11]: b = np.random.randint(0, 10, size=1000)

In [12]: %timeit c = fast_crosstab(a, b)
425 µs ± 9.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

crosstab from this PR:

In [13]: %timeit elements, tab = crosstab(a, b)
215 µs ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I haven't kept track of what alternatives might be available in widely-used libraries. The Pandas function is probably good enough for most use-cases.

(Since closing this PR, I've added the option of giving predefined lists of values to be counted to crosstab, but the updated code is not in a public repo. at the moment.)

matthew-brett · 2019-12-24T13:43:13Z

Sorry - I somehow missed your reply. Yes, your function is extremely efficient, I cannot improve on its speed.

I think this would be a useful addition, personally, because the Pandas implementation is so slow, that it makes it uncomfortable to do simulations, even with small numbers, and relatively few iterations (10000 or so). Your function will also cross-tabulate more than two arrays.

rgommers · 2019-12-24T13:49:38Z

It could fit in scipy.stats.contingency.

matthew-brett · 2019-12-24T13:56:05Z

Nice suggestion!

matthew-brett · 2019-12-26T13:41:16Z

Warren - what you think about Ralf's suggestion of putting this in scipy.stats.contingency?

matthew-brett · 2020-01-08T13:59:05Z

@WarrenWeckesser - another ping.

WarrenWeckesser · 2020-01-08T14:22:54Z

@matthew-brett, sure, this can go into scipy. I'll get a pull request in over the weekend.

matthew-brett · 2020-01-08T14:25:21Z

Great - thanks.

WarrenWeckesser · 2020-01-13T16:35:33Z

SciPy pull request: scipy/scipy#11352

juliantaylor reviewed Aug 12, 2014
View reviewed changes

WarrenWeckesser changed the title ~~ENH: Add 'count_unique', for creating contingency tables.~~ ENH: Add 'table', for creating contingency tables. Aug 13, 2014

ENH: Add 'crosstab', for creating contingency tables.

3b3e5f2

WarrenWeckesser force-pushed the enh-count-unique branch from 67bc65d to 3b3e5f2 Compare January 28, 2015 04:08

WarrenWeckesser changed the title ~~ENH: Add 'table', for creating contingency tables.~~ ENH: Add 'crosstab', for creating contingency tables. Jan 28, 2015

WarrenWeckesser closed this Jan 28, 2015

joferkington mentioned this pull request Apr 21, 2015

ENH: Add an "axis" kwarg to numpy.unique #3584

Closed

Uh oh!

ENH: Add 'crosstab', for creating contingency tables. #4958

ENH: Add 'crosstab', for creating contingency tables. #4958

Uh oh!

Conversation

WarrenWeckesser commented Aug 12, 2014

Uh oh!

jaimefrio commented Aug 12, 2014

Uh oh!

njsmith commented Aug 12, 2014

useful.

Uh oh!

WarrenWeckesser commented Aug 12, 2014

Uh oh!

juliantaylor Aug 12, 2014

Choose a reason for hiding this comment

Uh oh!

charris commented Aug 23, 2014

Uh oh!

WarrenWeckesser commented Aug 27, 2014

Uh oh!

charris commented Jan 25, 2015

Uh oh!

WarrenWeckesser commented Jan 25, 2015

Uh oh!

WarrenWeckesser commented Jan 28, 2015

Uh oh!

matthew-brett commented Jan 28, 2015

Uh oh!

josef-pkt commented Jan 28, 2015

Uh oh!

jaimefrio commented Jan 28, 2015

Uh oh!

matthew-brett commented Nov 28, 2019

Uh oh!

WarrenWeckesser commented Nov 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthew-brett commented Dec 24, 2019

Uh oh!

rgommers commented Dec 24, 2019

Uh oh!

matthew-brett commented Dec 24, 2019

Uh oh!

matthew-brett commented Dec 26, 2019

Uh oh!

matthew-brett commented Jan 8, 2020

Uh oh!

WarrenWeckesser commented Jan 8, 2020

Uh oh!

matthew-brett commented Jan 8, 2020

Uh oh!

WarrenWeckesser commented Jan 13, 2020

Uh oh!

Uh oh!

WarrenWeckesser commented Nov 30, 2019 •

edited

Loading