Skip to content

Move Lookup tables in CUDA backend to texture memory to reduce global constant memory usage #2791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 14, 2020

Conversation

9prady9
Copy link
Member

@9prady9 9prady9 commented Mar 12, 2020

There is negligible to no difference between texture based look-up table and constant memory look-up table. Hence, shifted to texture memory look-up table to reduce global constant memory usage.

@9prady9 9prady9 added this to the v3.7.1 milestone Mar 12, 2020
@9prady9 9prady9 requested a review from umar456 March 12, 2020 10:14
@9prady9 9prady9 force-pushed the lut_move branch 2 times, most recently from afc0ff8 to e0a0307 Compare March 13, 2020 16:11
class LookupTable1D {
public:
LookupTable1D() = delete;
LookupTable1D(const LookupTable1D& arg) = delete;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to have a copy constructor for this right? You can copy a texture if you need it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we have to take care of how texture object is copied also. Just handle copy won't work. Move operation wont have issues though. We don't need neither for these use cases.

9prady9 added 2 commits March 13, 2020 23:30
cuda::kernel::locate_features is the CUDA kernel that uses the fast
lookup table. Shared below is performance of the kernel using constant
memory vs texture memory. There is neglible to no difference between two
versions. Hence, shifted to texture memory LUT to reduce global constant
memory usage.

Performance using constant memory LUT
-------------------------------------

Time(%)    Time   Calls      Avg       Min       Max  Name
1.48%  101.09us      3  33.696us  32.385us  34.976us  void cuda::kernel::locate_features<float, int=9>
1.34%  91.713us      2  45.856us  45.792us  45.921us  void cuda::kernel::locate_features<double, int=9>
1.02%  69.505us      2  34.752us  34.400us  35.105us  void cuda::kernel::locate_features<unsigned int, int=9>
0.99%  67.456us      2  33.728us  32.768us  34.688us  void cuda::kernel::locate_features<int, int=9>
0.95%  65.186us      2  32.593us  31.201us  33.985us  void cuda::kernel::locate_features<short, int=9>
0.93%  63.874us      2  31.937us  30.817us  33.057us  void cuda::kernel::locate_features<unsigned short, int=9>

Performance using texture LUT
-----------------------------

Time(%)    Time   Calls      Avg       Min       Max  Name
1.45%  99.776us      3  33.258us  32.896us  33.504us  void cuda::kernel::locate_features<float, int=9>
1.33%  91.105us      2  45.552us  44.961us  46.144us  void cuda::kernel::locate_features<double, int=9>
1.02%  70.017us      2  35.008us  34.273us  35.744us  void cuda::kernel::locate_features<unsigned int, int=9>
0.97%  66.689us      2  33.344us  32.065us  34.624us  void cuda::kernel::locate_features<int, int=9>
0.95%  65.249us      2  32.624us  31.585us  33.664us  void cuda::kernel::locate_features<short, int=9>
0.95%  65.025us      2  32.512us  30.945us  34.080us  void cuda::kernel::locate_features<unsigned short, int=9>
cuda::kernel::extract_orb is the CUDA kernel that uses the orb
lookup table. Shared below is performance of the kernel using constant
memory vs texture memory. There is neglible to no difference between two
versions. Hence, shifted to texture memory LUT to reduce global constant
memory usage.

Performance using constant memory LUT
-------------------------------------

Time(%)  Time   Calls      Avg       Min       Max  Name

3.02%  292.26us   24  12.177us  11.360us  14.528us  void cuda::kernel::extract_orb<float>
2.16%  209.00us   16  13.062us  11.616us  16.033us  void cuda::kernel::extract_orb<double>

Performance using texture LUT
-----------------------------

Time(%)    Time   Calls      Avg       Min       Max  Name

2.84%  270.63us     24  11.276us  9.6970us  15.040us  void cuda::kernel::extract_orb<float>
2.20%  209.28us     16  13.080us  10.688us  16.960us  void cuda::kernel::extract_orb<double>
@9prady9 9prady9 merged commit 0d61c6f into arrayfire:master Mar 14, 2020
@9prady9 9prady9 deleted the lut_move branch March 14, 2020 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants