CUDA Implementation of A Biologically Inspired Object Recognition System

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

CUDA implementation of a biologically inspired object recognition system

Sharat Chikkerur
Abstract The HMAX model is a biologically motivated architecture for computer vision whose components are in close agreement with existing physiological evidence. The model is capable of achieving close to human level performance on several rapid object recognition tasks. However, the model is computationally bound and has limited engineering applications in its current form. In this work, we wish to increase the computational eciency using a two-fold approach: (i) We present several ltering approximations that improve the algorithm itself and (ii) We make use of the inherent parallelism in the computation using SIMD capabilities of the CUDA platform. We compare the performance both in terms of accuracy and timing with the baseline matlab implementation.

Introduction

The HMAX model is a biologically motivated model that mimics the organization of the rst few stages for the human visual pathway. More specically, it tries to explain the purely feed forward mechanism for rapid object recognition that occurs in the rst 100-150ms of visual processing. The model is organized with alternating layers consisting of simple(S) units and complex(C) units (See Figure 1). S units have specic preferred input patterns (such as oriented lines) and perform simple pattern matching. Complex units spatially pool the outputs from the S units and select the maximum, thus providing spatial invariance. Complexity of the preferred input to the S units and the degree of location and scale invariance achieved at the C units increases as we move up the hierarchy. The model requires very few training examples and competes with the state-of-the-art system in object recognition task. A detailed description of the HMAX model can be found in [1]. While the HMAX model is surprisingly accurate and a biologically faithful, it is currently bounded by computation and is not very useful for real-life application. In this report, we study a specic instantiation of the model, with two S,C layer pairs [1] that simulate visual processing in area V1/V2-V4/PIT.

1.1

Computational Complexity

S1 layer: The S1 layer computation consists of performing normalized cross correlation between a bank of 64 gabor lters (16 scales and 4 directions and the image. This operation is non-linear and has a computational complexity of O(N 2 M 2 ), where M M is the size of the lter and N N is the size of the image. C1 layer: The C1 layer is computed by performing a spatial and scale-level max pooling operation on the S1 outputs.The computational cost of this operation is O(N 2 M ), since max operation can be performed in a separable fashion (x direction rst and then y direction). In the current instantiation, the C1 layer consists of 8 bands of pooling. S2 layer: During training, several hundred C1 patches of dierent sizes are randomly selected from the training images and serve as prototype features that represent a single object (class specic dictionary) or all objects (universal dictionary) [1]. Each S2 unit is designed to detect the presence of the prototype 1

Figure 1: HMAX Model Architecture. S1 layers consists of simple gabor lter at several scales and orientation. C1 pools the lter outputs spatially and across adjacent scales. S2 is tuned to conjunctions of orientations. C2 provides further spatial and scale invariance. The C2 outputs are directly fed to a classier

within the C1 unit responses. Each patch serves as a lter that is correlated with the 8 C1 bands to generate a tuning response map. Deriving the S2 unit outputs is the most computationally intensive part of the model, involving O(P N 2 M 2 ) computation. Here P represents the number of patches. C2 layer: The C2 layer provides the nal invariance by detecting the maximum response of the corresponding S2 units over all scales and positions. There are as many C2 units as the number of patches in the representation dictionary. The computational complexity of this step is O(N 2 M P ).

1.2

Parallelism

It can be seen from the previous section that while the algorithm is serial(feed-forward) in nature, each S stage requires computing responses of several lters on the same input. Each of the responses can be computed independently, making it embarrassingly parallel in nature. We attempt to improve the performance of the algorithm on two fronts. Firstly, we attempt at improving the theoretical eciency of the algorithm itself. Secondly, we attempt to improve the computational performance by using parallel execution. In embarrassingly parallel applications these provide independent speed-ups. 1.2.1 Algorithm speedup of the S1 layer

The S1 layer is obtained by performing pattern matching with short lines at several dierent scales and orientations. The oriented lines are modeled as anisotropic Gabor lters given by, G(x, y ) where, u v = exp = u2 + v 2 2 2 cos x y 2 (1) (2)

cos() sin()

sin() cos()

Here represents the eccentricity/anisotropy of the lter, is the orientation and is the wavelength of the lter. S1 unit response S 1(s, d) at scale s and direction d is computed using a normalized cross correlation 2

Figure 2: Bank of gabor lters used to compute S1 features. The receptive elds sizes and other parameters are outlined in [1].

operation S 1(x, y ) = (x, y, ) I (x, y ) G I 2 (x, y ) H (x, y ) (3)

is the zero mean,unit normal version of the lter and H (x, y ) a lter with all ones and the same size Here G as the Gabor lter. The responses at dierent scales are obtained by performing normalized convolution with lters of dierent sizes as shown in Figure 2. However, it is to be noted that in the next stage (C1), the ltered output are sub-sampled after the max operation, with the range of max-pooling being proportional to the lter size. Mathematically, the input to the next stage from a lter of scale s is (x/s, y/s) I (x, y ) s S 1s (x, y ) = G This can be simplied to (x, y ) (I (x, y ) s) S 1s (x, y ) G (4)

(5)

Thus, instead of convolving the image with increasingly larger lter (computational complexity of O(N 2 M 2 ), where M is the lter size), we reduce it to O(N 2 /s2 M 2 ) at scale S, saving considerable computational time.

1.3

Device Architecture

The NVIDIA CUDA capable family of graphics processor units allow us to access the highly parallel processors for general purpose (SIMD-type) computation. Here we contrast the GPU architecture with that of the CPU. Existing CPUs that have in the order of 10s of computing cores (PS3 has 8+1 core, Tirela chips have 64 compute cores). In contrast, CUDA enabled GPUs have in the order of 100s of compute cores (GTX 8800 has 96 compute cores). CPUs can spawn several 10s of threads each of which can share the same address space. In contrast, each processor in the GPU can spawn several 100s of threads. Each of the threads share a slower(but larger) global memory space and a fasther (but smaller) shared memory.

1.4

Execution Model

In a traditional multi-threaded programming model, multiple threads operate independently of each other. However, the NVIDIA GPU follows a slightly dierent programming model. The code to be executed in 3

the device is called a kernel. The GPU can execute only one kernel at a time. However, the code can be executed simultaneously on all the cores in paralle. Further, each kernel is divided into a grid of blocks. Each block in the grid has access to a fast (but small) amount of shared memory. Each block is further composed into an array of threads. Typically, each block invokes 100s of threads. The kernel has access to information that indicates which block and which thread is currently under execution. A common strategy for programming is to split the problem into blocks and execute each kernel block in parallel.

1.5

Memory Hierarchy

When programming for the CPU, most programming languages allow transparent access to memory abstracting the memory hierarchy. However, when programming for the GPU, the programmer has the ability to directly address any part of the memory hierarchy and in fact is encouraged to consider the hierarchy to optimize the running time. The GPU memory is divided into Register: Each core in the GPU has access to 128 registers that are local. This serves as the fastest read write location (one cycle) available to each thread. Shared: Threads in each block can access a small about ( 16 Kb) of shared memory that is several times faster than the global ( 1Gb) memory. A common strategy in GPU programs is to stage the data in shared memory, use each thread to operate on a separate part of the shared memory and then copy the results back into main memory. Global: The GPU has a large amount ( 1 Gb) of main memory that can be accessed by all grids, blocks and threads. However, reading and writing to this area is slow requiring several hundred cycles. This cost can be amortized if the computational cost is much greater than the communication cost. Constant: This memory is read only, but can still be access across all grids. Texture: Texture memory is optimized for spatially local memory access (common in graphics and image processing). It provides a much faster read-only access when compared to global memory. Further, memory regions bound to a texture can be accessed in normalized co-ordinates.

1.6

Implementation Strategy

In this section, we outline our strategy for implementing dierent layers within the model. 1.6.1 S1 Layer

Computing the S1 layer involves convolving a pyramid of images (scaled to a constant factor) with a lterbank (of size four in our case). We parallelize this computation by iterating over dierent sizes of images and assigning each output pixel to a thread. To obtain maximum eciency, the threads are organized as 16 16 blocks. Both the input and output memory are aligned for coalesced read and writes. Both the input image and the lter are stored in texture memory prior to callin gthe kernel. Further speedup was obtained by unrolling the convolution loop using templates.

1.7

C1 layer

Computing the C1 layer involves computing maximum in a local neighborhood across all the layers of the S1 (four in our case, corresponding to the four lters). We parallel this computation by iterating over the dierent image sizes and assigning each output pixel to a dierent thread in the kernel. The input and output memory are allocated using cudaMallocPitch for coalesced memory access.

size 100 128 256 512

MATLAB 2.73 3.35 15.88 70.03

GPU 0.11 0.12 0.18 0.45

Speedup 25.75 X 27.91 X 88.04 X 155.61 X

Table 1: The table shows the time required to compute C2 on images of dierent sizes using MATLAB and CUDA implementation. The matlab was run on a 3.0 GHz dual core machine with 2GB main memory. The C1 and S2 layers had depths of 4 and 200 respectively. The CUDA version of the algorithm still gives us a large improvement in speed.

1.8

S2 layer

The computation is similar to the S1 layer except for the fact that both the input and the lters are now three dimensional. We parallelize this computation by iterating over C1 inputs of dierent sizes and assigning each output pixel to a dierent thread. Further speedup was achieved by coalesced memory access and loop unrolling. It is to be noted that unlike the S1 layer, the S2 layer is computed by measure similarity of the C1 layer with several hundred patches/lters instead of just four as in the S1 layer. Thus, not all the blocks are executed in parallel by the GPU. However, the scheduling is handled by the CUDA framework and is transparent to the programmer.

1.9

C2 layer

Computing the C2 layer involves taking the global maximum across all sizes(bands) in each dimension of the S2 units. For simplicity, we compute the C2 layer entirely on the CPU, since the communication and initialization cost would outweigh the computational benet obtained by transferring it to the GPU.

Conclusion

The HMAX/CBCL hierarchical object recognition algorithm is limited in its application due to its computational complexity. In this report, we showed speed-ups due to changes in algorithm and implementation. We show that the GPU versions give considerable speedup. We discussed the GPU architecture, execution model and memory hierarchy and showed that while the programming model oers lesser abstraction, it gives us far more computational benet than a MATLAB implementation.

Code
Matlab version: http://code.google.com/p/cbcl-model-matlab/ Multi-threaded version: http://code.google.com/p/cbcl-label-server/ CUDA version: http://code.google.com/p/cbcl-model-cuda/

The code can be obtained from the following links:

References
[1] T. Serre, Wolf L., S. Bileschi, M. Reisenhuber, and T. Poggio. Robust object recognition with cortex-like mechanisms. PAMI, 2007.

You might also like