Skip to content

ENH: Add 'crosstab', for creating contingency tables. #4958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

WarrenWeckesser
Copy link
Member

It needs more tests, and it needs to be added to the docs, but I'd like to see if there is any interest in adding this to numpy before doing more work. You can interpret this function as a multi-argument generalization of unique(a, return_counts=True).

I've kept the API simple; I'm sure there are many variations that could be useful.

Similar functions (but with more bells and whistles) include Pandas' crosstab function, R's table function, and Matlab's crosstab function. For example, here's the proposed function in action:

In [7]: x = [1, 1, 1, 1, 2, 2, 2, 2, 2]

In [8]: y = [3, 4, 3, 3, 3, 4, 5, 5, 5]

In [9]: (xvals, yvals), counts = crosstab(x, y)

In [10]: xvals
Out[10]: array([1, 2])

In [11]: yvals
Out[11]: array([3, 4, 5])

In [12]: counts
Out[12]:
array([[3, 1, 0],
       [1, 1, 3]])

Here's the same calculation in R:

> x <- c(1,1,1,1,2,2,2,2,2)
> y <- c(3,4,3,3,3,4,5,5,5)
> table(x, y)
   y
x   3 4 5
  1 3 1 0
  2 1 1 3

@jaimefrio
Copy link
Member

Interesting.... What would be the use case?

@njsmith
Copy link
Member

njsmith commented Aug 12, 2014

For assessing interest, the mailing list is a better place to ask -- most
of the people who might be interested aren't reading this :-)

On Tue, Aug 12, 2014 at 2:48 PM, Warren Weckesser notifications@github.com
wrote:

It needs more tests, and it needs to be added to the docs, but I'd like to
see if there is any interest in adding this to numpy before doing more
work. You can interpret this function as a multi-argument generalization of unique(a,
return_counts=True).

I've kept the API simple; I'm sure there are many variations that could be

useful.

You can merge this Pull Request by running

git pull https://github.com/WarrenWeckesser/numpy enh-count-unique

Or view, comment on, or merge it at:

#4958
Commit Summary

  • ENH: Add 'count_unique', for creating contingency tables.

File Changes

Patch Links:


Reply to this email directly or view it on GitHub
#4958.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@WarrenWeckesser
Copy link
Member Author

@njsmith: 👍
@jaimefrio: Look for my email on the mailing list.

# Count the occurrences of the unique tuples by applying np.add.at
# to the inverses.
shape = [len(u) for u in unique_elements]
count = np.zeros(shape, dtype=np.int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type should probably be np.intp a count cannot exceed the size of a pointer (on relevant architectures)

@WarrenWeckesser WarrenWeckesser changed the title ENH: Add 'count_unique', for creating contingency tables. ENH: Add 'table', for creating contingency tables. Aug 13, 2014
@charris
Copy link
Member

charris commented Aug 23, 2014

@WarrenWeckesser Looks like there was interest in something like this, but the conversation petered out. Where are you with this now?

@WarrenWeckesser
Copy link
Member Author

I've been letting it stew for a bit. Eelco's code (linked to in the mailing list thread) suggests some nice extensions to the API, but I haven't thought too deeply about them yet, and so I haven't put together a complete response.

@charris
Copy link
Member

charris commented Jan 25, 2015

@WarrenWeckesser Still stewing?

@WarrenWeckesser
Copy link
Member Author

@charris: I've pinged the mailing list again. Let's see if any there is any strong opposition to this.

@WarrenWeckesser WarrenWeckesser changed the title ENH: Add 'table', for creating contingency tables. ENH: Add 'crosstab', for creating contingency tables. Jan 28, 2015
@WarrenWeckesser
Copy link
Member Author

I renamed table to crosstab and rebased, because of the conflict with pylab.table.

I didn't really need to do that, though, because I'm closing the PR. There has been no interest from any numpy devs on the mailing list, which I interpret as a collective "meh". This function might make more sense as part of the contingency module of scipy.stats, and anyone who needs more powerful grouping and counting functions can use pandas.

@matthew-brett
Copy link
Contributor

@josef-pkt @jseabold - any comments here?

@josef-pkt
Copy link

I didn't think too much about it, I don't think I would have a use for it. I'm not using pandas groupby and similar myself either, but other statsmodels contributors do, and we have pretty much settled on pandas for those because most of our users use it also.

My general impression (given that I'm lagging several versions behind and I'm not really up to date.):

Various multivariate extensions to np.unique, np.bincount and similar would be useful. (I never send my reply to a request for comments to Jaime's bincount improvements #4330) Statsmodels is missing these for some vectorized use cases.

The table version in this PR is a bit "lonely" (remember it's my impression). Heavy users will use pandas which provides a lot more, and for specific uses like the contingency tables, I also think it would fit better in scipy.stats.
Similar to the recarray utilities which I haven't seen mentioned on the mailing list in some time, the urgent need has gone away with the increase in functionality and popularity of pandas.

I'm not saying that this shouldn't be in numpy, it's just an uphill struggle or it needs to have a clear wider set of use cases besides data analysis.

(aside: I've recently read a blog post by an instructor of online courses for data science who mentioned that with each release of pandas they can teach less numpy and more pandas.)

@jaimefrio
Copy link
Member

Just for the record, I still find this interesting, but I have a feeling that when we finally figure this type of functionality out, it will look more like unique with an axis parameter, in the line of what @joferkington once proposed as a PR that also ended up being closed without merging. In any case, thanks for stirring up the discussion, Warren.

@matthew-brett
Copy link
Contributor

Hi @WarrenWeckesser - I just found out the Pandas crosstab is crazy slow, and so, not very useful for things like simulations. I found I had to write my own function, like this:

def fast_crosstab(firsts, seconds):
    rows = np.unique(firsts)
    cols = np.unique(seconds)
    tab = np.zeros((len(rows), len(cols)), dtype=int)
    for col_i, col_val in enumerate(cols):
        these_firsts = firsts[seconds == col_val]
        for row_i, row_val in enumerate(rows):
            tab[row_i, col_i] = np.count_nonzero(these_firsts == row_val)
    return tab

This is about 100 times faster than Pandas crosstab. It isn't the same as np.unique with the axis keyword, because it deals with counts of 0. Is there anything better, more standard out there, at this point? I would really like to use something like this for teaching, but I don't want to drop this lump of code on them.

@WarrenWeckesser
Copy link
Member Author

WarrenWeckesser commented Nov 30, 2019

@matthew-brett, have you tried the crosstab function in this pull request? It appears to be faster than your fast_crosstab, at least for one example:

In [10]: a = np.random.randint(0, 10, size=1000)

In [11]: b = np.random.randint(0, 10, size=1000)

In [12]: %timeit c = fast_crosstab(a, b)
425 µs ± 9.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

crosstab from this PR:

In [13]: %timeit elements, tab = crosstab(a, b)
215 µs ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I haven't kept track of what alternatives might be available in widely-used libraries. The Pandas function is probably good enough for most use-cases.

(Since closing this PR, I've added the option of giving predefined lists of values to be counted to crosstab, but the updated code is not in a public repo. at the moment.)

@matthew-brett
Copy link
Contributor

Sorry - I somehow missed your reply. Yes, your function is extremely efficient, I cannot improve on its speed.

I think this would be a useful addition, personally, because the Pandas implementation is so slow, that it makes it uncomfortable to do simulations, even with small numbers, and relatively few iterations (10000 or so). Your function will also cross-tabulate more than two arrays.

@rgommers
Copy link
Member

It could fit in scipy.stats.contingency.

@matthew-brett
Copy link
Contributor

Nice suggestion!

@matthew-brett
Copy link
Contributor

Warren - what you think about Ralf's suggestion of putting this in scipy.stats.contingency?

@matthew-brett
Copy link
Contributor

@WarrenWeckesser - another ping.

@WarrenWeckesser
Copy link
Member Author

@matthew-brett, sure, this can go into scipy. I'll get a pull request in over the weekend.

@matthew-brett
Copy link
Contributor

Great - thanks.

@WarrenWeckesser
Copy link
Member Author

SciPy pull request: scipy/scipy#11352

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants