Skip to content

ENH: np.unique: support hash based unique for float and complex dtype #29537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

math-hiyoko
Copy link
Contributor

@math-hiyoko math-hiyoko commented Aug 11, 2025

Description

This PR introduces hash-based uniqueness extraction support for float and complex types in NumPy's np.unique function.

Benchmark Results

The following benchmark demonstrates significant performance improvement from the new implementation.

float

import random
import time

import numpy as np

arr = np.array(
    [
        random.random() for _ in range(1_000)
    ] * 5_000_000,
    dtype=np.float64,
)
np.random.shuffle(arr)

time_start = time.perf_counter()
print("unique count (hash based): ", len(np.unique(arr, sorted=False, equal_nan=False)))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

complex

import random
import time

import numpy as np

arr = np.array(
    [
        complex(random.random(), random.random()) for _ in range(1_000)
    ] * 1_000_000,
    dtype=np.complex128,
)
np.random.shuffle(arr)

time_start = time.perf_counter()
print("unique count (hash based): ", len(np.unique(arr, sorted=False, equal_nan=False)))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

Result

float

unique count (hash based):  1000
406.947 secs
unique count (numpy main):  1000
441.833 secs

complex

unique count (hash based):  1000
11.433 secs
unique count (numpy main):  1000
47.408 secs

close #28363

@math-hiyoko math-hiyoko marked this pull request as draft August 11, 2025 04:25
@math-hiyoko math-hiyoko changed the title Feature/#28363 ENH: np.unique: support hash based unique for float and complex dtype Aug 11, 2025
@ngoldbaum
Copy link
Member

Very cool! When you finish this up for review it'd be nice to see some before/after comparisons on some benchmarks.

@math-hiyoko math-hiyoko marked this pull request as ready for review August 15, 2025 13:43
@math-hiyoko
Copy link
Contributor Author

Now it's ready for being reviewed.

@ngoldbaum ngoldbaum added 01 - Enhancement triage review Issue/PR to be discussed at the next triage meeting labels Aug 18, 2025
@ngoldbaum
Copy link
Member

I added the triage review label so hopefully someone at the next triage meeting on Wedesnday will volunteer to look closer at this. Unfortunately I can't take that on this month - too many things going on!

@jorenham jorenham requested a review from Copilot August 20, 2025 18:05
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR extends NumPy's np.unique function to support hash-based unique value extraction for float and complex dtypes, providing significant performance improvements over the existing sort-based approach. The implementation leverages hash tables to efficiently identify unique values without requiring sorting.

Key changes include:

  • Implementation of hash-based unique extraction for float and complex numeric types
  • Addition of specialized hash and equality functions that handle NaN values appropriately
  • Updates to test cases to accommodate unsorted output from hash-based approach

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
numpy/_core/src/multiarray/unique.cpp Core implementation adding hash-based unique support for float/complex types with NaN handling
numpy/lib/tests/test_arraysetops.py Updated test assertions to handle unsorted output from hash-based unique operations
numpy/_core/meson.build Added npymath include directory for math function access
doc/release/upcoming_changes/29537.performance.rst Documentation of performance improvements
doc/release/upcoming_changes/29537.change.rst Documentation of behavior change regarding sorted output

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

int lhs_isnan = npy_isnan(lhs_real) || npy_isnan(lhs_imag);
S rhs_real = real(*rhs);
S rhs_imag = imag(*rhs);
int rhs_isnan = npy_isnan(rhs_real) || npy_isnan(rhs_imag);
Copy link
Preview

Copilot AI Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent use of npy_isnan vs npy_isnan_wrapper. This should use npy_isnan_wrapper for consistency with the template pattern used elsewhere.

Suggested change
int rhs_isnan = npy_isnan(rhs_real) || npy_isnan(rhs_imag);
int lhs_isnan = npy_isnan_wrapper<S>(lhs_real) || npy_isnan_wrapper<S>(lhs_imag);
S rhs_real = real(*rhs);
S rhs_imag = imag(*rhs);
int rhs_isnan = npy_isnan_wrapper<S>(rhs_real) || npy_isnan_wrapper<S>(rhs_imag);

Copilot uses AI. Check for mistakes.

int lhs_isnan = npy_isnan(lhs_real) || npy_isnan(lhs_imag);
S rhs_real = real(*rhs);
S rhs_imag = imag(*rhs);
int rhs_isnan = npy_isnan(rhs_real) || npy_isnan(rhs_imag);
Copy link
Preview

Copilot AI Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent use of npy_isnan vs npy_isnan_wrapper. This should use npy_isnan_wrapper for consistency with the template pattern used elsewhere.

Suggested change
int rhs_isnan = npy_isnan(rhs_real) || npy_isnan(rhs_imag);
int lhs_isnan = npy_isnan_wrapper<S>(lhs_real) || npy_isnan_wrapper<S>(lhs_imag);
S rhs_real = real(*rhs);
S rhs_imag = imag(*rhs);
int rhs_isnan = npy_isnan_wrapper<S>(rhs_real) || npy_isnan_wrapper<S>(rhs_imag);

Copilot uses AI. Check for mistakes.

@ngoldbaum
Copy link
Member

I'm not sure what the state of the unique benchmarks is in our benchmarks, but I think it would help to add new benchmarks for this case to our benchmark suite.

It's probably worth benchmarking several sizes of inputs, not just really huge arrays.

@ngoldbaum
Copy link
Member

@charris is curious how you're handling NaN. What happens if there's more than one distinct NaN value in the array.

@ngoldbaum ngoldbaum added 07 - Deprecation triaged Issue/PR that was discussed in a triage meeting and removed triage review Issue/PR to be discussed at the next triage meeting labels Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 07 - Deprecation triaged Issue/PR that was discussed in a triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

np.unique: support float dtypes
2 participants