Skip to content

OPT: Optimize indexing using dynamic thread block sizes. #3111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 23, 2021

Conversation

umar456
Copy link
Member

@umar456 umar456 commented Mar 20, 2021

This optimization dynamically sets the block size based on the output array
dimension. Originally we had a block size of 32x8 threads per block. This
configuration was not ideal when indexing into a long array where you
had few columns and many rows. The current approach creates blocks of
256x1, 128x2, 64x4 and 32,8 to better accommodate smaller dimensions.

Description

This optimization dynamically sets the block size based on the output array
dimension. Originally we had a block size of 32x8 threads per block. This
configuration was not ideal when indexing into a long array where you
had few columns and many rows. The current approach creates blocks of
256x1, 128x2, 64x4 and 32,8 to better accommodate smaller dimensions.

Changes to Users

None

Checklist

  • Rebased on latest master
  • Code compiles
  • Tests pass
  • [ ] Functions added to unified API
  • [ ] Functions documented

@umar456 umar456 added the perf label Mar 20, 2021
@umar456 umar456 requested a review from 9prady9 March 20, 2021 04:09
@9prady9
Copy link
Member

9prady9 commented Mar 20, 2021

What is the speedup range ?

@umar456
Copy link
Member Author

umar456 commented Mar 20, 2021

TEST(Index, blah) {
    array a   = randu(20000000, 10);
    array idx = seq(20000000);

    array b = a(idx);
    b.eval();
    af::sync();
}

Master:

 Time(%)  Total Time (ns)  Instances    Average      Minimum     Maximum                                                    Name                                                
 -------  ---------------  ---------  ------------  ----------  ----------  ----------------------------------------------------------------------------------------------------
    49.7       10,107,916          1  10,107,916.0  10,107,916  10,107,916  void cuda::index<float>(cuda::Param<float>, cuda::CParam<float>, cuda::AssignKernelParam, int, int) 
    38.4        7,811,232          1   7,811,232.0   7,811,232   7,811,232  void cuda::kernel::uniformPhilox<float>(float*, unsigned int, unsigned int, unsigned int, unsigned …
     8.1        1,645,554          1   1,645,554.0   1,645,554   1,645,554  KER9745534647381087054                                                                              
     3.9          791,546          1     791,546.0     791,546     791,546  void cuda::range<float>(cuda::Param<float>, int, int, int) 

This PR:

 Time(%)  Total Time (ns)  Instances    Average     Minimum    Maximum                                                   Name                                                
 -------  ---------------  ---------  -----------  ---------  ---------  ----------------------------------------------------------------------------------------------------
    60.6        7,809,308          1  7,809,308.0  7,809,308  7,809,308  void cuda::kernel::uniformPhilox<float>(float*, unsigned int, unsigned int, unsigned int, unsigned …
    20.3        2,615,049          1  2,615,049.0  2,615,049  2,615,049  void cuda::index<float>(cuda::Param<float>, cuda::CParam<float>, cuda::AssignKernelParam, int, int) 
    12.8        1,646,066          1  1,646,066.0  1,646,066  1,646,066  KER9745534647381087054                                                                              
     6.3          812,761          1    812,761.0    812,761    812,761  void cuda::range<float>(cuda::Param<float>, int, int, int) 

3.8x faster

This optimization dynamically sets the block size based on the output array
dimension. Originally we had a block size of 32x8 threads per block. This
configuration was not ideal when indexing into a long array where you
had few columns and many rows. The current approach creates blocks of
256x1, 128x2, 64x4 and 32x8 to better accommodate smaller dimensions.
@9prady9 9prady9 merged commit d56c3bc into arrayfire:master Mar 23, 2021
@9prady9 9prady9 deleted the index_opt branch March 23, 2021 02:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants