-
Notifications
You must be signed in to change notification settings - Fork 548
Move Lookup tables in CUDA backend to texture memory to reduce global constant memory usage #2791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
afc0ff8
to
e0a0307
Compare
src/backend/cuda/texture.hpp
Outdated
class LookupTable1D { | ||
public: | ||
LookupTable1D() = delete; | ||
LookupTable1D(const LookupTable1D& arg) = delete; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to have a copy constructor for this right? You can copy a texture if you need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we have to take care of how texture object is copied also. Just handle copy won't work. Move operation wont have issues though. We don't need neither for these use cases.
cuda::kernel::locate_features is the CUDA kernel that uses the fast lookup table. Shared below is performance of the kernel using constant memory vs texture memory. There is neglible to no difference between two versions. Hence, shifted to texture memory LUT to reduce global constant memory usage. Performance using constant memory LUT ------------------------------------- Time(%) Time Calls Avg Min Max Name 1.48% 101.09us 3 33.696us 32.385us 34.976us void cuda::kernel::locate_features<float, int=9> 1.34% 91.713us 2 45.856us 45.792us 45.921us void cuda::kernel::locate_features<double, int=9> 1.02% 69.505us 2 34.752us 34.400us 35.105us void cuda::kernel::locate_features<unsigned int, int=9> 0.99% 67.456us 2 33.728us 32.768us 34.688us void cuda::kernel::locate_features<int, int=9> 0.95% 65.186us 2 32.593us 31.201us 33.985us void cuda::kernel::locate_features<short, int=9> 0.93% 63.874us 2 31.937us 30.817us 33.057us void cuda::kernel::locate_features<unsigned short, int=9> Performance using texture LUT ----------------------------- Time(%) Time Calls Avg Min Max Name 1.45% 99.776us 3 33.258us 32.896us 33.504us void cuda::kernel::locate_features<float, int=9> 1.33% 91.105us 2 45.552us 44.961us 46.144us void cuda::kernel::locate_features<double, int=9> 1.02% 70.017us 2 35.008us 34.273us 35.744us void cuda::kernel::locate_features<unsigned int, int=9> 0.97% 66.689us 2 33.344us 32.065us 34.624us void cuda::kernel::locate_features<int, int=9> 0.95% 65.249us 2 32.624us 31.585us 33.664us void cuda::kernel::locate_features<short, int=9> 0.95% 65.025us 2 32.512us 30.945us 34.080us void cuda::kernel::locate_features<unsigned short, int=9>
cuda::kernel::extract_orb is the CUDA kernel that uses the orb lookup table. Shared below is performance of the kernel using constant memory vs texture memory. There is neglible to no difference between two versions. Hence, shifted to texture memory LUT to reduce global constant memory usage. Performance using constant memory LUT ------------------------------------- Time(%) Time Calls Avg Min Max Name 3.02% 292.26us 24 12.177us 11.360us 14.528us void cuda::kernel::extract_orb<float> 2.16% 209.00us 16 13.062us 11.616us 16.033us void cuda::kernel::extract_orb<double> Performance using texture LUT ----------------------------- Time(%) Time Calls Avg Min Max Name 2.84% 270.63us 24 11.276us 9.6970us 15.040us void cuda::kernel::extract_orb<float> 2.20% 209.28us 16 13.080us 10.688us 16.960us void cuda::kernel::extract_orb<double>
There is negligible to no difference between texture based look-up table and constant memory look-up table. Hence, shifted to texture memory look-up table to reduce global constant memory usage.