-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
ENH: Add 'crosstab', for creating contingency tables. #4958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Interesting.... What would be the use case? |
For assessing interest, the mailing list is a better place to ask -- most On Tue, Aug 12, 2014 at 2:48 PM, Warren Weckesser notifications@github.com
Nathaniel J. Smith |
@njsmith: 👍 |
# Count the occurrences of the unique tuples by applying np.add.at | ||
# to the inverses. | ||
shape = [len(u) for u in unique_elements] | ||
count = np.zeros(shape, dtype=np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the type should probably be np.intp
a count cannot exceed the size of a pointer (on relevant architectures)
@WarrenWeckesser Looks like there was interest in something like this, but the conversation petered out. Where are you with this now? |
I've been letting it stew for a bit. Eelco's code (linked to in the mailing list thread) suggests some nice extensions to the API, but I haven't thought too deeply about them yet, and so I haven't put together a complete response. |
@WarrenWeckesser Still stewing? |
@charris: I've pinged the mailing list again. Let's see if any there is any strong opposition to this. |
67bc65d
to
3b3e5f2
Compare
I renamed I didn't really need to do that, though, because I'm closing the PR. There has been no interest from any numpy devs on the mailing list, which I interpret as a collective "meh". This function might make more sense as part of the |
@josef-pkt @jseabold - any comments here? |
I didn't think too much about it, I don't think I would have a use for it. I'm not using pandas groupby and similar myself either, but other statsmodels contributors do, and we have pretty much settled on pandas for those because most of our users use it also. My general impression (given that I'm lagging several versions behind and I'm not really up to date.): Various multivariate extensions to np.unique, np.bincount and similar would be useful. (I never send my reply to a request for comments to Jaime's bincount improvements #4330) Statsmodels is missing these for some vectorized use cases. The table version in this PR is a bit "lonely" (remember it's my impression). Heavy users will use pandas which provides a lot more, and for specific uses like the contingency tables, I also think it would fit better in scipy.stats. I'm not saying that this shouldn't be in numpy, it's just an uphill struggle or it needs to have a clear wider set of use cases besides data analysis. (aside: I've recently read a blog post by an instructor of online courses for data science who mentioned that with each release of pandas they can teach less numpy and more pandas.) |
Just for the record, I still find this interesting, but I have a feeling that when we finally figure this type of functionality out, it will look more like |
Hi @WarrenWeckesser - I just found out the Pandas crosstab is crazy slow, and so, not very useful for things like simulations. I found I had to write my own function, like this: def fast_crosstab(firsts, seconds):
rows = np.unique(firsts)
cols = np.unique(seconds)
tab = np.zeros((len(rows), len(cols)), dtype=int)
for col_i, col_val in enumerate(cols):
these_firsts = firsts[seconds == col_val]
for row_i, row_val in enumerate(rows):
tab[row_i, col_i] = np.count_nonzero(these_firsts == row_val)
return tab This is about 100 times faster than Pandas |
@matthew-brett, have you tried the
I haven't kept track of what alternatives might be available in widely-used libraries. The Pandas function is probably good enough for most use-cases. (Since closing this PR, I've added the option of giving predefined lists of values to be counted to |
Sorry - I somehow missed your reply. Yes, your function is extremely efficient, I cannot improve on its speed. I think this would be a useful addition, personally, because the Pandas implementation is so slow, that it makes it uncomfortable to do simulations, even with small numbers, and relatively few iterations (10000 or so). Your function will also cross-tabulate more than two arrays. |
It could fit in |
Nice suggestion! |
Warren - what you think about Ralf's suggestion of putting this in |
@WarrenWeckesser - another ping. |
@matthew-brett, sure, this can go into scipy. I'll get a pull request in over the weekend. |
Great - thanks. |
SciPy pull request: scipy/scipy#11352 |
It needs more tests, and it needs to be added to the docs, but I'd like to see if there is any interest in adding this to numpy before doing more work. You can interpret this function as a multi-argument generalization of
unique(a, return_counts=True)
.I've kept the API simple; I'm sure there are many variations that could be useful.
Similar functions (but with more bells and whistles) include Pandas'
crosstab
function, R'stable
function, and Matlab'scrosstab
function. For example, here's the proposed function in action:Here's the same calculation in R: