Skip to content

unique() needlessly slow #11136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nschloe opened this issue May 22, 2018 · 11 comments
Open

unique() needlessly slow #11136

nschloe opened this issue May 22, 2018 · 11 comments

Comments

@nschloe
Copy link
Contributor

nschloe commented May 22, 2018

np.unique has the axis option that allows, for example, to call unique on the rows of a matrix. I noticed however that it's quite slow. Creating a view of the data and calling unique on the view is faster by a factor of 3.

MVCE:

import numpy
import perfplot


def unique_axis(data):
    return numpy.unique(data, axis=0)


def unique_row_view(data):
    b = numpy.ascontiguousarray(data).view(
        numpy.dtype((numpy.void, data.dtype.itemsize * data.shape[1]))
    )
    u = numpy.unique(b).view(data.dtype).reshape(-1, data.shape[1])
    return u


def unique_scikit(ar):
    if ar.ndim != 2:
        raise ValueError("unique_rows() only makes sense for 2D arrays, "
                         "got %dd" % ar.ndim)
    # the view in the next line only works if the array is C-contiguous
    ar = numpy.ascontiguousarray(ar)
    # np.unique() finds identical items in a raveled array. To make it
    # see each row as a single item, we create a view of each row as a
    # byte string of length itemsize times number of columns in `ar`
    ar_row_view = ar.view('|S%d' % (ar.itemsize * ar.shape[1]))
    _, unique_row_indices = numpy.unique(ar_row_view, return_index=True)
    ar_out = ar[unique_row_indices]
    return ar_out


perfplot.save(
    "unique.png",
    setup=lambda n: numpy.random.randint(0, 100, (n, 2)),
    kernels=[unique_axis, unique_row_view, unique_scikit],
    n_range=[2 ** k for k in range(20)],
)

unique

@eric-wieser
Copy link
Member

eric-wieser commented May 22, 2018

There used to be an optimization in place for this for integer inputs, but it was removed (#10608) because the sorting of the final array ended up being dependent on the system endianness (#10495).

@eric-wieser
Copy link
Member

Note that your approach also only works for types where equality ⇔ binary equality, which is not true for floating point numbers.

@nschloe
Copy link
Contributor Author

nschloe commented May 23, 2018

Very interesting, this indeed fails in various circumstances. Still, the 3x speed up perhaps makes it worthwhile thinking about carefully reintroducing the optimization again.

@jakirkham
Copy link
Contributor

Also found this somewhat surprising. FWIW scikit-image uses a different trick, which is to cast the data to byte strings. Interestingly this ends up being a little faster than casting to void.

cc @jni

@pdemarti
Copy link

pdemarti commented Apr 18, 2021

Why is it that np.unique() uses sorting? Hashing (which pd.unique() uses, but is unfortunately only 1D) would be O(log n) faster. There could be an option sort (True by default, for backward compatibility, False for speed).

@pdemarti
Copy link

Here are three more ways:

  • unique_rows creates a 1D view (like unique_row_view), then uses pd.unique on it.
  • unique_via_pd_drop_duplicates simply uses pd.drop_duplicates.
  • unique_rows_via_pd_internals uses some internals of pd.drop_duplicates, in order to avoid creating a DataFrame and other unnecessary operations.

Interestingly, compared to unique_row_view, these methods are only faster for large arrays. In the case of unique_rows, the relative speedup vanishes when the result is large, i.e. there are not many duplicates).

Case 1: Lots of duplicates

BTW, I didn't know about perfplot. It is awesome!

perfplot.show(
    setup=lambda n: np.random.randint(0, 100, (n, 2)),
    kernels=[unique_axis, unique_row_view, unique_rows, unique_via_pd_drop_duplicates, unique_rows_via_pd_internals],
    n_range=[2 ** k for k in range(20)],
    equality_check=None,  # the alt methods return results in the original order (not sorted)
)

image

Case 2: No duplicates

perfplot.show(
    setup=lambda n: np.stack((np.arange(n), np.arange(n)), axis=1),
    kernels=[unique_axis, unique_row_view, unique_rows, unique_via_pd_drop_duplicates, unique_rows_via_pd_internals],
    n_range=[2 ** k for k in range(20)],
    equality_check=None,
)

image

Implementation

import numpy as np
import pandas as pd

def unique_rows(a):
    # np.unique() is slow, in part because it sorts;
    # pd.unique() is much faster, but only 1D
    # This is inspired by https://github.com/numpy/numpy/issues/11136#issue-325345618
    # It creates a 1D view where each element is a byte-encoding of a row, then uses
    # pd.unique(), and then reconstruct the original type.
    if a.ndim != 2:
        raise ValueError(f'bad array dimension {a.ndim}; should be 2')
    b = np.ascontiguousarray(a).view(
        np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    )[:, 0]
    return pd.unique(b).view(a.dtype).reshape(-1, a.shape[1])

def unique_via_pd_drop_duplicates(a):
    return pd.DataFrame(a).drop_duplicates().values

from pandas._libs.hashtable import SIZE_HINT_LIMIT, duplicated_int64
from pandas.core.sorting import get_group_index
from pandas.core import algorithms

def duplicated_rows(a, keep='first'):
    size_hint = min(a.shape[0], SIZE_HINT_LIMIT)
    def f(vals):
        labels, shape = algorithms.factorize(vals, size_hint=size_hint)
        return labels.astype("i8", copy=False), len(shape)

    vals = (col for col in a.T)
    labels, shape = map(list, zip(*map(f, vals)))

    ids = get_group_index(labels, shape, sort=False, xnull=False)
    return duplicated_int64(ids, keep)

def unique_rows_via_pd_internals(a):
    return a[~duplicated_rows(a)]

bmerry added a commit to ska-sa/katsdpingest that referenced this issue Jun 2, 2021
Collapsing the RFI masks to unique channel masks used np.unique, which
possibly due to numpy/numpy#11136 is very slow
(around 15s in the worst case for MeerKAT). The scikit-image version
from that ticket is about 500x faster.

Closes SPR1-1046.
@adrinjalali
Copy link
Contributor

Just ended up here investigating scikit-learn/scikit-learn#26808. Interesting thread, but it might be worth optimizing unique a bit.

@seberg
Copy link
Member

seberg commented Jul 11, 2023

You probably need to implement a hash based solution, maybe as a family of functions and ideally stuff them somewhere so that they are available for extending. Or maybe as a new hash-table/set datastructure (with ufunc-like, C implemented methods define by the DTypes).

It could probably also be done via a hash and not_greater_less/not_different (equality, but NaN aware) ufunc-like. Although, making things perfectly fast that way is likely hard in practice.

I don't mind if NumPy grows this, I think it was often brushed off with "pandas does groupby better anyway", because hashing is a bit more common as a groupby. It should be thought through a bit (with a proposal), I think. But overall it's not like hashing methods are particularly complicated.

Sorting should be optional with a hash based implementation, although it really is an insignificant step in many cases because it only matters if you have mostly unique values.

@adrinjalali
Copy link
Contributor

I'll see what I can come up with (looking at pandas' implementation) and open a proposal PR.

@HansBrende
Copy link

Something is very wrong if

from collections import Counter

Counter(map(tuple, numpy_array))

is over 2x faster than:

np.unique(numpy_array, axis=0, return_counts=True)

Tested on arrays of roughly 150K rows, having many duplicates. (Even naive python-only for-loops that increment a dict for each row found were faster than numpy by a significant margin!) It would be super great if we could specify a hash-table option here... clearly, sorting is terrible for the many-duplicates case.

@HansBrende
Copy link

HansBrende commented Oct 9, 2024

UPDATE: after tinkering around, finally came up with a numpy solution that runs a whopping 25X faster than the built-in np.unique(numpy_array, axis=0, return_counts=True).

def clever_find_unique_counts(numpy_array):
    next_shift = 0
    result = np.zeros(shape=(1,), dtype=np.uint16)
    col_data = []
    for col in np.transpose(numpy_array):
        col_lookup = np.nonzero(np.bincount(col))[0]
        col_nbits = math.ceil(math.log2(len(col_lookup)))
        col_data.append((col_nbits, col.dtype, col_lookup))
        if col_nbits:
            result = (np.searchsorted(col_lookup, col) << next_shift).astype(np.uint16) | result
            next_shift += col_nbits
            if next_shift > 16:
                raise NotImplementedError
    counts = np.bincount(result)
    values_set = np.nonzero(counts)[0]
    final_counts = counts[values_set]
    restored_cols = []
    for col_nbits, col_dtype, col_lookup in col_data:
        restored_cols.append(col_lookup[values_set & ((1 << col_nbits) - 1)].astype(col_dtype))
        values_set >>= col_nbits
    return np.transpose(restored_cols), final_counts

Although this version works perfectly for my use-case, it has one serious limitation in that the running product of the total number of unique values in each column must be less than 65536 (and it will only work for integer types).

But adding here anyways in case it's helpful, to demonstrate the kind of performance we should be getting out of the box with a hash-based solution.

EDIT: correction: more than 50X faster since I just realized I was comparing it to the Counter approach I mentioned above (which was already more than 2x faster than the built-in functionality).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants