EHN: add numpy.topk #19117

quarrying · 2021-05-27T17:11:01Z

Hi, find topk elements is widely used in machine learning and elsewhere, and some python frameworks have implemented it, e.g. TensorFlow (tf.math.top_k), PyTorch (torch.topk).

I try to implement topk using core numpy functions, and have checked its correctness using torch.topk. The following is the test code snippet:

import random

import torch
import numpy as np


def topk_np(a, k, axis=-1, largest=True, sorted=True):
    if axis is None:
        axis_size = a.size
    else:
        axis_size = a.shape[axis]
    assert 1 <= k <= axis_size

    a = np.asanyarray(a)
    if largest:
        index_array = np.argpartition(a, axis_size-k, axis=axis)
        topk_indices = np.take(index_array, -np.arange(k)-1, axis=axis)
    else:
        index_array = np.argpartition(a, k-1, axis=axis)
        topk_indices = np.take(index_array, np.arange(k), axis=axis)
    topk_values = np.take_along_axis(a, topk_indices, axis=axis)
    if sorted:
        sorted_indices_in_topk = np.argsort(topk_values, axis=axis)
        if largest:
            sorted_indices_in_topk = np.flip(sorted_indices_in_topk, axis=axis)
        sorted_topk_values = np.take_along_axis(
            topk_values, sorted_indices_in_topk, axis=axis)
        sorted_topk_indices = np.take_along_axis(
            topk_indices, sorted_indices_in_topk, axis=axis)
        return sorted_topk_values, sorted_topk_indices
    return topk_values, topk_indices
    

def test_for_float_types():
    # could change its shape
    arr = np.random.rand(100, 34, 43, 54)

    axis = random.choice([*range(arr.ndim), None])
    if axis is not None:
        k = random.choice(range(1, arr.shape[axis]))
    else:
        k = random.choice(range(1, arr.size))
    largest = random.choice([True, False])
    sorted = True
    print('axis={}, k={}, largest={}, sorted={}'.format(axis, k, largest, sorted))

    top_values, top_indices = topk_np(arr, k, axis=axis, largest=largest, sorted=sorted)
    
    # topk(): argument 'dim' must be int, not NoneType, so fix it!
    if axis is None:
        arr = arr.flatten()
        axis = 0
    arr_pt = torch.from_numpy(arr)
    top_values_pt, top_indices_pt = torch.topk(arr_pt, k, dim=axis, largest=largest, sorted=sorted)

    assert np.allclose(top_values, top_values_pt.numpy())
    assert np.allclose(top_indices, top_indices_pt.numpy())


def test_for_signed_int_types():
    # could change its shape
    shape = (100, 34, 43, 54)
    # for the consistency of indices, use unique numbers
    arr = np.arange(np.prod(shape), dtype=np.int32)
    np.random.shuffle(arr)

    axis = random.choice([*range(arr.ndim), None])
    if axis is not None:
        k = random.choice(range(1, arr.shape[axis]))
    else:
        k = random.choice(range(1, arr.size))
    largest = random.choice([True, False])
    sorted = True
    print('axis={}, k={}, largest={}, sorted={}'.format(axis, k, largest, sorted))

    top_values, top_indices = topk_np(arr, k, axis=axis, largest=largest, sorted=sorted)
    # topk(): argument 'dim' must be int, not NoneType, so fix it!
    if axis is None:
        arr = arr.flatten()
        axis = 0
    arr_pt = torch.from_numpy(arr)
    top_values_pt, top_indices_pt = torch.topk(arr_pt, k, dim=axis, largest=largest, sorted=sorted)

    assert np.allclose(top_values, top_values_pt.numpy())
    assert np.allclose(top_indices, top_indices_pt.numpy())


def test_for_unsigned_int_types():
    # could change its shape
    shape = (100, 34, 43, 54)
    # for the consistency of indices, use unique numbers
    arr = np.arange(np.prod(shape), dtype=np.uint32)
    np.random.shuffle(arr)

    axis = random.choice([*range(arr.ndim), None])
    if axis is not None:
        k = random.choice(range(1, arr.shape[axis]))
    else:
        k = random.choice(range(1, arr.size))
    largest = random.choice([True, False])
    sorted = True
    print('axis={}, k={}, largest={}, sorted={}'.format(axis, k, largest, sorted))

    top_values, top_indices = topk_np(arr, k, axis=axis, largest=largest, sorted=sorted)
    # topk(): argument 'dim' must be int, not NoneType, so fix it!
    if axis is None:
        arr = arr.flatten()
        axis = 0
    # torch.from_numpy does not support uint32, so fix it!
    arr_pt = torch.from_numpy(arr.astype(np.float32))
    top_values_pt, top_indices_pt = torch.topk(arr_pt, k, dim=axis, largest=largest, sorted=sorted)

    assert np.allclose(top_values, top_values_pt.numpy())
    assert np.allclose(top_indices, top_indices_pt.numpy())


if __name__ == '__main__':
    num_repeats = 100
    for k in range(num_repeats):
        # test_for_float_types()
        test_for_signed_int_types()
        # test_for_unsigned_int_types()

eric-wieser · 2021-05-27T17:26:57Z

numpy/core/fromnumeric.py

+    if largest:
+        index_array = np.argpartition(-a, k-1, axis=axis, order=None)
+    else:
+        index_array = np.argpartition(a, k-1, axis=axis, order=None)


Negating like this won't work for unsigned types

Thanks for your reply, I will test it for unsigned types again.

Hi, I have fixed this bug with the follwing codes and updated test codes.

if largest: index_array = np.argpartition(a, axis_size-k, axis=axis) topk_indices = np.take(index_array, -np.arange(k)-1, axis=axis) else: index_array = np.argpartition(a, k-1, axis=axis) topk_indices = np.take(index_array, np.arange(k), axis=axis)

eric-wieser · 2021-05-28T09:43:40Z

Note that as a new bit of API, this should hit the mailing list.

My impression is that in its current form topk is not a good fit for numpy, as its API is quite different from that of sort/argsort and partition/argpartition. Maybe separate topk/argtopk functions would resolve that, or maybe the mailing list can suggest a package where this function would be a better fit.

The largest feature here feels like it should perhaps be a reverse argument to sort / argsort / partition / argpartition. IMO that would be suitable for includsion in numpy, and we might already have an issue about that somewhere else.

BvB93 · 2021-05-28T12:33:08Z

The largest feature here feels like it should perhaps be a reverse argument to sort / argsort / partition / argpartition. IMO that would be suitable for includsion in numpy, and we might already have an issue about that somewhere else.

Are there any cases wherein reversal cannot be easily achieved with, e.g., the likes of ar[::-1]?

bashtage · 2021-05-28T15:18:46Z

Not that I think this is a great fit, but topk is a bad name. Reading the mailing list thread I was wondering what would a TO PK function actually do? top_k or just top as in PySpark would be better names (the k feels redundant).

Ref: data-apis#629 Ref: numpy/numpy#19117 Ref: numpy/numpy#15128 Ref: https://mail.python.org/archives/list/numpy-discussion@python.org/thread/F4P5UVTAKRJJ3OORI6UOWFSUEE5CNTSC/#PELUDW5ACUBHBNK5IVGWIWTQHBM2HXUP

Following previous discussion at numpy#15128. I made a small change to the interface in the previous discussion by changing the `mode` keyword into a `largest` bool flag. This follows API such as from [torch.topk](https://pytorch.org/docs/stable/generated/torch.topk.html). Carrying from the previous discussion, a parameter might be useful is `sorted`. This is also implemented in `torch.topk`, and follows from previous work at numpy#19117. Co-authored-by: quarrying

eric-wieser reviewed May 27, 2021

View reviewed changes

quarrying force-pushed the main branch from 6a703b8 to d9ef740 Compare May 27, 2021 18:05

quarrying closed this May 28, 2021

quarrying reopened this May 28, 2021

quarrying force-pushed the main branch from d9ef740 to 3ecb4e4 Compare May 28, 2021 12:35

EHN: add numpy.topk

80c3862

quarrying force-pushed the main branch from 3ecb4e4 to 80c3862 Compare May 28, 2021 13:10

rgommers added the 01 - Enhancement label May 29, 2021

rkern mentioned this pull request Apr 12, 2022

Feature request: Allow np.argmax to output top K maximum values #15128

Open

ogrisel mentioned this pull request May 17, 2023

RFC: add topk and / or argpartition data-apis/array-api#629

Open

kgryte mentioned this pull request Dec 14, 2023

feat: add API specification for returning the k largest elements data-apis/array-api#722

Open

JuliaPoo mentioned this pull request Jun 12, 2024

WIP: top_k draft implementation #26666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EHN: add numpy.topk #19117

EHN: add numpy.topk #19117

quarrying commented May 27, 2021 •

edited

Loading

eric-wieser May 27, 2021

quarrying May 27, 2021

quarrying May 28, 2021 •

edited

Loading

eric-wieser commented May 28, 2021

BvB93 commented May 28, 2021

bashtage commented May 28, 2021 •

edited by rgommers

Loading

EHN: add numpy.topk #19117

Are you sure you want to change the base?

EHN: add numpy.topk #19117

Conversation

quarrying commented May 27, 2021 • edited Loading

eric-wieser May 27, 2021

Choose a reason for hiding this comment

quarrying May 27, 2021

Choose a reason for hiding this comment

quarrying May 28, 2021 • edited Loading

Choose a reason for hiding this comment

eric-wieser commented May 28, 2021

BvB93 commented May 28, 2021

bashtage commented May 28, 2021 • edited by rgommers Loading

quarrying commented May 27, 2021 •

edited

Loading

quarrying May 28, 2021 •

edited

Loading

bashtage commented May 28, 2021 •

edited by rgommers

Loading