Skip to content

EHN: add numpy.topk #19117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

EHN: add numpy.topk #19117

wants to merge 1 commit into from

Conversation

quarrying
Copy link

@quarrying quarrying commented May 27, 2021

Hi, find topk elements is widely used in machine learning and elsewhere, and some python frameworks have implemented it, e.g. TensorFlow (tf.math.top_k), PyTorch (torch.topk).

I try to implement topk using core numpy functions, and have checked its correctness using torch.topk. The following is the test code snippet:

import random

import torch
import numpy as np


def topk_np(a, k, axis=-1, largest=True, sorted=True):
    if axis is None:
        axis_size = a.size
    else:
        axis_size = a.shape[axis]
    assert 1 <= k <= axis_size

    a = np.asanyarray(a)
    if largest:
        index_array = np.argpartition(a, axis_size-k, axis=axis)
        topk_indices = np.take(index_array, -np.arange(k)-1, axis=axis)
    else:
        index_array = np.argpartition(a, k-1, axis=axis)
        topk_indices = np.take(index_array, np.arange(k), axis=axis)
    topk_values = np.take_along_axis(a, topk_indices, axis=axis)
    if sorted:
        sorted_indices_in_topk = np.argsort(topk_values, axis=axis)
        if largest:
            sorted_indices_in_topk = np.flip(sorted_indices_in_topk, axis=axis)
        sorted_topk_values = np.take_along_axis(
            topk_values, sorted_indices_in_topk, axis=axis)
        sorted_topk_indices = np.take_along_axis(
            topk_indices, sorted_indices_in_topk, axis=axis)
        return sorted_topk_values, sorted_topk_indices
    return topk_values, topk_indices
    

def test_for_float_types():
    # could change its shape
    arr = np.random.rand(100, 34, 43, 54)

    axis = random.choice([*range(arr.ndim), None])
    if axis is not None:
        k = random.choice(range(1, arr.shape[axis]))
    else:
        k = random.choice(range(1, arr.size))
    largest = random.choice([True, False])
    sorted = True
    print('axis={}, k={}, largest={}, sorted={}'.format(axis, k, largest, sorted))

    top_values, top_indices = topk_np(arr, k, axis=axis, largest=largest, sorted=sorted)
    
    # topk(): argument 'dim' must be int, not NoneType, so fix it!
    if axis is None:
        arr = arr.flatten()
        axis = 0
    arr_pt = torch.from_numpy(arr)
    top_values_pt, top_indices_pt = torch.topk(arr_pt, k, dim=axis, largest=largest, sorted=sorted)

    assert np.allclose(top_values, top_values_pt.numpy())
    assert np.allclose(top_indices, top_indices_pt.numpy())


def test_for_signed_int_types():
    # could change its shape
    shape = (100, 34, 43, 54)
    # for the consistency of indices, use unique numbers
    arr = np.arange(np.prod(shape), dtype=np.int32)
    np.random.shuffle(arr)

    axis = random.choice([*range(arr.ndim), None])
    if axis is not None:
        k = random.choice(range(1, arr.shape[axis]))
    else:
        k = random.choice(range(1, arr.size))
    largest = random.choice([True, False])
    sorted = True
    print('axis={}, k={}, largest={}, sorted={}'.format(axis, k, largest, sorted))

    top_values, top_indices = topk_np(arr, k, axis=axis, largest=largest, sorted=sorted)
    # topk(): argument 'dim' must be int, not NoneType, so fix it!
    if axis is None:
        arr = arr.flatten()
        axis = 0
    arr_pt = torch.from_numpy(arr)
    top_values_pt, top_indices_pt = torch.topk(arr_pt, k, dim=axis, largest=largest, sorted=sorted)

    assert np.allclose(top_values, top_values_pt.numpy())
    assert np.allclose(top_indices, top_indices_pt.numpy())


def test_for_unsigned_int_types():
    # could change its shape
    shape = (100, 34, 43, 54)
    # for the consistency of indices, use unique numbers
    arr = np.arange(np.prod(shape), dtype=np.uint32)
    np.random.shuffle(arr)

    axis = random.choice([*range(arr.ndim), None])
    if axis is not None:
        k = random.choice(range(1, arr.shape[axis]))
    else:
        k = random.choice(range(1, arr.size))
    largest = random.choice([True, False])
    sorted = True
    print('axis={}, k={}, largest={}, sorted={}'.format(axis, k, largest, sorted))

    top_values, top_indices = topk_np(arr, k, axis=axis, largest=largest, sorted=sorted)
    # topk(): argument 'dim' must be int, not NoneType, so fix it!
    if axis is None:
        arr = arr.flatten()
        axis = 0
    # torch.from_numpy does not support uint32, so fix it!
    arr_pt = torch.from_numpy(arr.astype(np.float32))
    top_values_pt, top_indices_pt = torch.topk(arr_pt, k, dim=axis, largest=largest, sorted=sorted)

    assert np.allclose(top_values, top_values_pt.numpy())
    assert np.allclose(top_indices, top_indices_pt.numpy())


if __name__ == '__main__':
    num_repeats = 100
    for k in range(num_repeats):
        # test_for_float_types()
        test_for_signed_int_types()
        # test_for_unsigned_int_types()

if largest:
index_array = np.argpartition(-a, k-1, axis=axis, order=None)
else:
index_array = np.argpartition(a, k-1, axis=axis, order=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negating like this won't work for unsigned types

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply, I will test it for unsigned types again.

Copy link
Author

@quarrying quarrying May 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I have fixed this bug with the follwing codes and updated test codes.

    if largest:
        index_array = np.argpartition(a, axis_size-k, axis=axis)
        topk_indices = np.take(index_array, -np.arange(k)-1, axis=axis)
    else:
        index_array = np.argpartition(a, k-1, axis=axis)
        topk_indices = np.take(index_array, np.arange(k), axis=axis)

@eric-wieser
Copy link
Member

Note that as a new bit of API, this should hit the mailing list.

My impression is that in its current form topk is not a good fit for numpy, as its API is quite different from that of sort/argsort and partition/argpartition. Maybe separate topk/argtopk functions would resolve that, or maybe the mailing list can suggest a package where this function would be a better fit.

The largest feature here feels like it should perhaps be a reverse argument to sort / argsort / partition / argpartition. IMO that would be suitable for includsion in numpy, and we might already have an issue about that somewhere else.

@quarrying quarrying closed this May 28, 2021
@quarrying quarrying reopened this May 28, 2021
@BvB93
Copy link
Member

BvB93 commented May 28, 2021

The largest feature here feels like it should perhaps be a reverse argument to sort / argsort / partition / argpartition. IMO that would be suitable for includsion in numpy, and we might already have an issue about that somewhere else.

Are there any cases wherein reversal cannot be easily achieved with, e.g., the likes of ar[::-1]?

@bashtage
Copy link
Contributor

bashtage commented May 28, 2021

Not that I think this is a great fit, but topk is a bad name. Reading the mailing list thread I was wondering what would a TO PK function actually do? top_k or just top as in PySpark would be better names (the k feels redundant).

JuliaPoo added a commit to JuliaPoo/numpy that referenced this pull request Jun 12, 2024
Following previous discussion at numpy#15128. I made a small change
to the interface in the previous discussion by changing the
`mode` keyword into a `largest` bool flag. This follows API
such as from
[torch.topk](https://pytorch.org/docs/stable/generated/torch.topk.html).

Carrying from the previous discussion, a parameter might be useful is
`sorted`. This is also implemented in `torch.topk`, and follows from
previous work at numpy#19117.

Co-authored-by: quarrying
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs decision
Development

Successfully merging this pull request may close these issues.

5 participants