-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
unique() needlessly slow #11136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Note that your approach also only works for types where equality ⇔ binary equality, which is not true for floating point numbers. |
Very interesting, this indeed fails in various circumstances. Still, the 3x speed up perhaps makes it worthwhile thinking about carefully reintroducing the optimization again. |
Also found this somewhat surprising. FWIW scikit-image uses a different trick, which is to cast the data to byte strings. Interestingly this ends up being a little faster than casting to void. cc @jni |
Why is it that |
Here are three more ways:
Interestingly, compared to Case 1: Lots of duplicatesBTW, I didn't know about perfplot.show(
setup=lambda n: np.random.randint(0, 100, (n, 2)),
kernels=[unique_axis, unique_row_view, unique_rows, unique_via_pd_drop_duplicates, unique_rows_via_pd_internals],
n_range=[2 ** k for k in range(20)],
equality_check=None, # the alt methods return results in the original order (not sorted)
) Case 2: No duplicatesperfplot.show(
setup=lambda n: np.stack((np.arange(n), np.arange(n)), axis=1),
kernels=[unique_axis, unique_row_view, unique_rows, unique_via_pd_drop_duplicates, unique_rows_via_pd_internals],
n_range=[2 ** k for k in range(20)],
equality_check=None,
) Implementationimport numpy as np
import pandas as pd
def unique_rows(a):
# np.unique() is slow, in part because it sorts;
# pd.unique() is much faster, but only 1D
# This is inspired by https://github.com/numpy/numpy/issues/11136#issue-325345618
# It creates a 1D view where each element is a byte-encoding of a row, then uses
# pd.unique(), and then reconstruct the original type.
if a.ndim != 2:
raise ValueError(f'bad array dimension {a.ndim}; should be 2')
b = np.ascontiguousarray(a).view(
np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
)[:, 0]
return pd.unique(b).view(a.dtype).reshape(-1, a.shape[1])
def unique_via_pd_drop_duplicates(a):
return pd.DataFrame(a).drop_duplicates().values
from pandas._libs.hashtable import SIZE_HINT_LIMIT, duplicated_int64
from pandas.core.sorting import get_group_index
from pandas.core import algorithms
def duplicated_rows(a, keep='first'):
size_hint = min(a.shape[0], SIZE_HINT_LIMIT)
def f(vals):
labels, shape = algorithms.factorize(vals, size_hint=size_hint)
return labels.astype("i8", copy=False), len(shape)
vals = (col for col in a.T)
labels, shape = map(list, zip(*map(f, vals)))
ids = get_group_index(labels, shape, sort=False, xnull=False)
return duplicated_int64(ids, keep)
def unique_rows_via_pd_internals(a):
return a[~duplicated_rows(a)] |
Collapsing the RFI masks to unique channel masks used np.unique, which possibly due to numpy/numpy#11136 is very slow (around 15s in the worst case for MeerKAT). The scikit-image version from that ticket is about 500x faster. Closes SPR1-1046.
Just ended up here investigating scikit-learn/scikit-learn#26808. Interesting thread, but it might be worth optimizing |
You probably need to implement a hash based solution, maybe as a family of functions and ideally stuff them somewhere so that they are available for extending. Or maybe as a new hash-table/set datastructure (with ufunc-like, C implemented methods define by the DTypes). It could probably also be done via a I don't mind if NumPy grows this, I think it was often brushed off with "pandas does groupby better anyway", because hashing is a bit more common as a groupby. It should be thought through a bit (with a proposal), I think. But overall it's not like hashing methods are particularly complicated. Sorting should be optional with a hash based implementation, although it really is an insignificant step in many cases because it only matters if you have mostly unique values. |
I'll see what I can come up with (looking at pandas' implementation) and open a proposal PR. |
Something is very wrong if from collections import Counter
Counter(map(tuple, numpy_array)) is over 2x faster than: np.unique(numpy_array, axis=0, return_counts=True) Tested on arrays of roughly 150K rows, having many duplicates. (Even naive python-only for-loops that increment a dict for each row found were faster than numpy by a significant margin!) It would be super great if we could specify a hash-table option here... clearly, sorting is terrible for the many-duplicates case. |
UPDATE: after tinkering around, finally came up with a numpy solution that runs a whopping 25X faster than the built-in def clever_find_unique_counts(numpy_array):
next_shift = 0
result = np.zeros(shape=(1,), dtype=np.uint16)
col_data = []
for col in np.transpose(numpy_array):
col_lookup = np.nonzero(np.bincount(col))[0]
col_nbits = math.ceil(math.log2(len(col_lookup)))
col_data.append((col_nbits, col.dtype, col_lookup))
if col_nbits:
result = (np.searchsorted(col_lookup, col) << next_shift).astype(np.uint16) | result
next_shift += col_nbits
if next_shift > 16:
raise NotImplementedError
counts = np.bincount(result)
values_set = np.nonzero(counts)[0]
final_counts = counts[values_set]
restored_cols = []
for col_nbits, col_dtype, col_lookup in col_data:
restored_cols.append(col_lookup[values_set & ((1 << col_nbits) - 1)].astype(col_dtype))
values_set >>= col_nbits
return np.transpose(restored_cols), final_counts Although this version works perfectly for my use-case, it has one serious limitation in that the running product of the total number of unique values in each column must be less than 65536 (and it will only work for integer types). But adding here anyways in case it's helpful, to demonstrate the kind of performance we should be getting out of the box with a hash-based solution. EDIT: correction: more than 50X faster since I just realized I was comparing it to the |
np.unique
has theaxis
option that allows, for example, to callunique
on the rows of a matrix. I noticed however that it's quite slow. Creating a view of the data and calling unique on the view is faster by a factor of 3.MVCE:
The text was updated successfully, but these errors were encountered: