Skip to content

BUG: groupby().size gives casting error on 32 bit platform #11189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Sep 25, 2015 · 13 comments · Fixed by #11191
Closed

BUG: groupby().size gives casting error on 32 bit platform #11189

jorisvandenbossche opened this issue Sep 25, 2015 · 13 comments · Fixed by #11191
Labels
Bug Compat pandas objects compatability with Numpy or Python functions
Milestone

Comments

@jorisvandenbossche
Copy link
Member

In [1]: df = pd.DataFrame({"id":[1,2,3,4,5,6], "grade":['a', 'b', 'b', 'a', 'a', 'e']})

In [2]: df.groupby("grade").size()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-d8e387418f9d> in <module>()
----> 1 df.groupby("grade").size()

/home/joris/scipy/pandas/pandas/core/groupby.pyc in size(self)
    818 
    819         """
--> 820         return self.grouper.size()
    821 
    822     sum = _groupby_function('sum', 'add', np.sum)

/home/joris/scipy/pandas/pandas/core/groupby.pyc in size(self)
   1380         """
   1381         ids, _, ngroup = self.group_info
-> 1382         out = np.bincount(ids[ids != -1], minlength=ngroup)
   1383         return Series(out, index=self.result_index)
   1384 

TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

In [4]: pd.__version__
Out[4]: '0.17.0rc1+108.g3fb802a'
@jorisvandenbossche jorisvandenbossche added this to the 0.17.0 milestone Sep 25, 2015
@jreback
Copy link
Contributor

jreback commented Sep 25, 2015

cc @behzadnouri

@jorisvandenbossche
Copy link
Member Author

Hmm, although this is maybe more an issue with my numpy installation?
As I don't really see a reason why np.bincount would fail when getting int64 values (the docs says it needs integers and not floats, but does not say anything about integer size)

@jorisvandenbossche
Copy link
Member Author

Possibly an issue with bincount itself: numpy/numpy#4366

@behzadnouri
Copy link
Contributor

since ids are compressed, they cannot exceed memory address space. so i think it should be safe to cast on pandas side, i.e.: ids[ids != -1].astype(np.intp).

@behzadnouri
Copy link
Contributor

with copy=False arg: ids[ids != -1].astype(np.intp, copy=False)

@jreback
Copy link
Contributor

jreback commented Sep 25, 2015

@behzadnouri
Copy link
Contributor

both of them u added int64 not me. xref #10988

def group_info(self): indeed requires to return int64_t values but not def nunique. Either case, the right way to do it would be to leave internal computations alone, but only change the output as in:

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index e72f7c6..06b1105 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -1808,15 +1808,17 @@ class BinGrouper(BaseGrouper):
     @cache_readonly
     def group_info(self):
         ngroups = self.ngroups
-        obs_group_ids = np.arange(ngroups, dtype='int64')
+        obs_group_ids = np.arange(ngroups)
         rep = np.diff(np.r_[0, self.bins])

         if ngroups == len(self.bins):
-            comp_ids = np.repeat(np.arange(ngroups, dtype='int64'), rep)
+            comp_ids = np.repeat(np.arange(ngroups), rep)
         else:
-            comp_ids = np.repeat(np.r_[-1, np.arange(ngroups, dtype='int64')], rep)
+            comp_ids = np.repeat(np.r_[-1, np.arange(ngroups)], rep)

-        return comp_ids, obs_group_ids, ngroups
+        return comp_ids.astype('int64', copy=False),  \
+               obs_group_ids.astype('int64', copy=False),  \
+               ngroups

     @cache_readonly
     def ngroups(self):
@@ -2565,8 +2567,8 @@ class SeriesGroupBy(GroupBy):

         # group boundries are where group ids change
         # unique observations are where sorted values change
-        idx = com._ensure_int64(np.r_[0, 1 + np.nonzero(ids[1:] != ids[:-1])[0]])
-        inc = com._ensure_int64(np.r_[1, val[1:] != val[:-1]])
+        idx = np.r_[0, 1 + np.nonzero(ids[1:] != ids[:-1])[0]]
+        inc = np.r_[1, val[1:] != val[:-1]]

         # 1st item of each group is a new unique observation
         mask = isnull(val)
@@ -2577,7 +2579,7 @@ class SeriesGroupBy(GroupBy):
             inc[mask & np.r_[False, mask[:-1]]] = 0
             inc[idx] = 1

-        out = np.add.reduceat(inc, idx)
+        out = np.add.reduceat(inc, idx).astype('int64', copy=False)
         return Series(out if ids[0] != -1 else out[1:],
                       index=self.grouper.result_index,
                       name=self.name)

@jreback
Copy link
Contributor

jreback commented Sep 25, 2015

I changed them because they didn't work on window. I will try your diff and let u know.

@jreback jreback added the Compat pandas objects compatability with Numpy or Python functions label Sep 25, 2015
jreback added a commit to jreback/pandas that referenced this issue Sep 25, 2015
jreback added a commit that referenced this issue Sep 25, 2015
COMPAT: platform_int fixes in groupby ops, #11189
@BuhtigithuB
Copy link

Hello,

I have issue with minlength which seems to be fix with this solved issue... Are we going to have pandas 0.17.1 version tag soon? It really unreliable without it to pass to 0.17.x which seems to bring many benefit despite this ".size() minlength" issue...

And if not is using master consider "safe" (read as stable as an official tagged commit)?

Thanks

@BuhtigithuB
Copy link

Sorry, I think I am wrong... This bug fix seems to be include in 0.17.0

But I have this error : <type 'exceptions.ValueError'> minlength must be positive

File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 820, in size
return self.grouper.size()
File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 1383, in size
out = np.bincount(ids[ids != -1], minlength=ngroup)
ValueError: minlength must be positive

Which seems related...

I just do :

grouped.size()

Which were working just fine prior to upgrade from 0.15.0 to 0.17.0

Could it be related?

@BuhtigithuB
Copy link

Downgrading to 0.16.2 solve my issue...

:(

@jreback
Copy link
Contributor

jreback commented Oct 30, 2015

you would have to show a copy-pastable example along with pd.show_versions()

@BuhtigithuB
Copy link

I will try to reproduce with dummy data, because I can't include the actual data... Give me a couples of days.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants