unique() needlessly slow #11136

nschloe · 2018-05-22T15:24:31Z

np.unique has the axis option that allows, for example, to call unique on the rows of a matrix. I noticed however that it's quite slow. Creating a view of the data and calling unique on the view is faster by a factor of 3.

MVCE:

import numpy
import perfplot


def unique_axis(data):
    return numpy.unique(data, axis=0)


def unique_row_view(data):
    b = numpy.ascontiguousarray(data).view(
        numpy.dtype((numpy.void, data.dtype.itemsize * data.shape[1]))
    )
    u = numpy.unique(b).view(data.dtype).reshape(-1, data.shape[1])
    return u


def unique_scikit(ar):
    if ar.ndim != 2:
        raise ValueError("unique_rows() only makes sense for 2D arrays, "
                         "got %dd" % ar.ndim)
    # the view in the next line only works if the array is C-contiguous
    ar = numpy.ascontiguousarray(ar)
    # np.unique() finds identical items in a raveled array. To make it
    # see each row as a single item, we create a view of each row as a
    # byte string of length itemsize times number of columns in `ar`
    ar_row_view = ar.view('|S%d' % (ar.itemsize * ar.shape[1]))
    _, unique_row_indices = numpy.unique(ar_row_view, return_index=True)
    ar_out = ar[unique_row_indices]
    return ar_out


perfplot.save(
    "unique.png",
    setup=lambda n: numpy.random.randint(0, 100, (n, 2)),
    kernels=[unique_axis, unique_row_view, unique_scikit],
    n_range=[2 ** k for k in range(20)],
)

The text was updated successfully, but these errors were encountered:

eric-wieser · 2018-05-22T16:25:06Z

There used to be an optimization in place for this for integer inputs, but it was removed (#10608) because the sorting of the final array ended up being dependent on the system endianness (#10495).

eric-wieser · 2018-05-22T16:59:00Z

Note that your approach also only works for types where equality ⇔ binary equality, which is not true for floating point numbers.

nschloe · 2018-05-23T08:27:05Z

Very interesting, this indeed fails in various circumstances. Still, the 3x speed up perhaps makes it worthwhile thinking about carefully reintroducing the optimization again.

jakirkham · 2019-02-09T22:52:25Z

Also found this somewhat surprising. FWIW scikit-image uses a different trick, which is to cast the data to byte strings. Interestingly this ends up being a little faster than casting to void.

cc @jni

pdemarti · 2021-04-18T12:18:31Z

Why is it that np.unique() uses sorting? Hashing (which pd.unique() uses, but is unfortunately only 1D) would be O(log n) faster. There could be an option sort (True by default, for backward compatibility, False for speed).

pdemarti · 2021-04-18T14:31:44Z

Here are three more ways:

unique_rows creates a 1D view (like unique_row_view), then uses pd.unique on it.
unique_via_pd_drop_duplicates simply uses pd.drop_duplicates.
unique_rows_via_pd_internals uses some internals of pd.drop_duplicates, in order to avoid creating a DataFrame and other unnecessary operations.

Interestingly, compared to unique_row_view, these methods are only faster for large arrays. In the case of unique_rows, the relative speedup vanishes when the result is large, i.e. there are not many duplicates).

Case 1: Lots of duplicates

BTW, I didn't know about perfplot. It is awesome!

perfplot.show(
    setup=lambda n: np.random.randint(0, 100, (n, 2)),
    kernels=[unique_axis, unique_row_view, unique_rows, unique_via_pd_drop_duplicates, unique_rows_via_pd_internals],
    n_range=[2 ** k for k in range(20)],
    equality_check=None,  # the alt methods return results in the original order (not sorted)
)

Case 2: No duplicates

perfplot.show(
    setup=lambda n: np.stack((np.arange(n), np.arange(n)), axis=1),
    kernels=[unique_axis, unique_row_view, unique_rows, unique_via_pd_drop_duplicates, unique_rows_via_pd_internals],
    n_range=[2 ** k for k in range(20)],
    equality_check=None,
)

Implementation

import numpy as np
import pandas as pd

def unique_rows(a):
    # np.unique() is slow, in part because it sorts;
    # pd.unique() is much faster, but only 1D
    # This is inspired by https://github.com/numpy/numpy/issues/11136#issue-325345618
    # It creates a 1D view where each element is a byte-encoding of a row, then uses
    # pd.unique(), and then reconstruct the original type.
    if a.ndim != 2:
        raise ValueError(f'bad array dimension {a.ndim}; should be 2')
    b = np.ascontiguousarray(a).view(
        np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    )[:, 0]
    return pd.unique(b).view(a.dtype).reshape(-1, a.shape[1])

def unique_via_pd_drop_duplicates(a):
    return pd.DataFrame(a).drop_duplicates().values

from pandas._libs.hashtable import SIZE_HINT_LIMIT, duplicated_int64
from pandas.core.sorting import get_group_index
from pandas.core import algorithms

def duplicated_rows(a, keep='first'):
    size_hint = min(a.shape[0], SIZE_HINT_LIMIT)
    def f(vals):
        labels, shape = algorithms.factorize(vals, size_hint=size_hint)
        return labels.astype("i8", copy=False), len(shape)

    vals = (col for col in a.T)
    labels, shape = map(list, zip(*map(f, vals)))

    ids = get_group_index(labels, shape, sort=False, xnull=False)
    return duplicated_int64(ids, keep)

def unique_rows_via_pd_internals(a):
    return a[~duplicated_rows(a)]

Collapsing the RFI masks to unique channel masks used np.unique, which possibly due to numpy/numpy#11136 is very slow (around 15s in the worst case for MeerKAT). The scikit-image version from that ticket is about 500x faster. Closes SPR1-1046.

adrinjalali · 2023-07-11T09:07:18Z

Just ended up here investigating scikit-learn/scikit-learn#26808. Interesting thread, but it might be worth optimizing unique a bit.

seberg · 2023-07-11T10:48:30Z

You probably need to implement a hash based solution, maybe as a family of functions and ideally stuff them somewhere so that they are available for extending. Or maybe as a new hash-table/set datastructure (with ufunc-like, C implemented methods define by the DTypes).

It could probably also be done via a hash and not_greater_less/not_different (equality, but NaN aware) ufunc-like. Although, making things perfectly fast that way is likely hard in practice.

I don't mind if NumPy grows this, I think it was often brushed off with "pandas does groupby better anyway", because hashing is a bit more common as a groupby. It should be thought through a bit (with a proposal), I think. But overall it's not like hashing methods are particularly complicated.

Sorting should be optional with a hash based implementation, although it really is an insignificant step in many cases because it only matters if you have mostly unique values.

adrinjalali · 2023-12-13T09:58:43Z

I'll see what I can come up with (looking at pandas' implementation) and open a proposal PR.

HansBrende · 2024-10-08T20:43:17Z

Something is very wrong if

from collections import Counter

Counter(map(tuple, numpy_array))

is over 2x faster than:

np.unique(numpy_array, axis=0, return_counts=True)

Tested on arrays of roughly 150K rows, having many duplicates. (Even naive python-only for-loops that increment a dict for each row found were faster than numpy by a significant margin!) It would be super great if we could specify a hash-table option here... clearly, sorting is terrible for the many-duplicates case.

HansBrende · 2024-10-09T18:48:58Z

UPDATE: after tinkering around, finally came up with a numpy solution that runs a whopping 25X faster than the built-in np.unique(numpy_array, axis=0, return_counts=True).

def clever_find_unique_counts(numpy_array):
    next_shift = 0
    result = np.zeros(shape=(1,), dtype=np.uint16)
    col_data = []
    for col in np.transpose(numpy_array):
        col_lookup = np.nonzero(np.bincount(col))[0]
        col_nbits = math.ceil(math.log2(len(col_lookup)))
        col_data.append((col_nbits, col.dtype, col_lookup))
        if col_nbits:
            result = (np.searchsorted(col_lookup, col) << next_shift).astype(np.uint16) | result
            next_shift += col_nbits
            if next_shift > 16:
                raise NotImplementedError
    counts = np.bincount(result)
    values_set = np.nonzero(counts)[0]
    final_counts = counts[values_set]
    restored_cols = []
    for col_nbits, col_dtype, col_lookup in col_data:
        restored_cols.append(col_lookup[values_set & ((1 << col_nbits) - 1)].astype(col_dtype))
        values_set >>= col_nbits
    return np.transpose(restored_cols), final_counts

Although this version works perfectly for my use-case, it has one serious limitation in that the running product of the total number of unique values in each column must be less than 65536 (and it will only work for integer types).

But adding here anyways in case it's helpful, to demonstrate the kind of performance we should be getting out of the box with a hash-based solution.

EDIT: correction: more than 50X faster since I just realized I was comparing it to the Counter approach I mentioned above (which was already more than 2x faster than the built-in functionality).

nschloe mentioned this issue May 23, 2018

BUG: np.unique(a, axis=0) sorted incorrectly when values cross 255 #10495

Closed

mattip added component: numpy._core component: benchmarks labels May 23, 2018

mganahl mentioned this issue Mar 5, 2020

np.unique(arr, axis) slow for axis=0 #15713

Open

nschloe mentioned this issue Oct 30, 2020

boundary vertices method? meshpro/meshplex#81

Closed

rob-rb mentioned this issue Apr 6, 2021

flip_until_delaunay failure meshpro/meshplex#130

Closed

bmerry mentioned this issue Jun 2, 2021

Speed up startup ska-sa/katsdpingest#365

Merged

rgommers mentioned this issue Sep 6, 2021

ENH: optimize min max and median scipy.stats.binned_statistic scipy/scipy#14625

Merged

adelavega mentioned this issue May 28, 2022

Optimize numpy operations in MKDADensity Estimator and (M)KDAKernel neurostuff/NiMARE#685

Merged

adrinjalali mentioned this issue Jul 11, 2023

Speed up classification_report scikit-learn/scikit-learn#26808

Closed

adrinjalali mentioned this issue Jan 4, 2024

DOC: documentation on how to add cython with meson #25537

Closed

adrinjalali mentioned this issue Jan 16, 2024

ENH: hash based np.unique #25596

Closed

goruha mentioned this issue Mar 23, 2024

Follow this PRs intervals-mining-lab/foapy#18

Open

adrinjalali mentioned this issue Jan 23, 2025

ENH add hash based unique #26018

Merged

adrinjalali mentioned this issue Feb 21, 2025

meta-issue: np.unique enhancements #28375

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unique() needlessly slow #11136

unique() needlessly slow #11136

nschloe commented May 22, 2018 •

edited

Loading

eric-wieser commented May 22, 2018 •

edited

Loading

eric-wieser commented May 22, 2018

nschloe commented May 23, 2018

jakirkham commented Feb 9, 2019

pdemarti commented Apr 18, 2021 •

edited

Loading

pdemarti commented Apr 18, 2021

adrinjalali commented Jul 11, 2023

seberg commented Jul 11, 2023 •

edited

Loading

adrinjalali commented Dec 13, 2023

HansBrende commented Oct 8, 2024

HansBrende commented Oct 9, 2024 •

edited

Loading

unique() needlessly slow #11136

unique() needlessly slow #11136

Comments

nschloe commented May 22, 2018 • edited Loading

eric-wieser commented May 22, 2018 • edited Loading

eric-wieser commented May 22, 2018

nschloe commented May 23, 2018

jakirkham commented Feb 9, 2019

pdemarti commented Apr 18, 2021 • edited Loading

pdemarti commented Apr 18, 2021

Case 1: Lots of duplicates

Case 2: No duplicates

Implementation

adrinjalali commented Jul 11, 2023

seberg commented Jul 11, 2023 • edited Loading

adrinjalali commented Dec 13, 2023

HansBrende commented Oct 8, 2024

HansBrende commented Oct 9, 2024 • edited Loading

nschloe commented May 22, 2018 •

edited

Loading

eric-wieser commented May 22, 2018 •

edited

Loading

pdemarti commented Apr 18, 2021 •

edited

Loading

seberg commented Jul 11, 2023 •

edited

Loading

HansBrende commented Oct 9, 2024 •

edited

Loading