PC Project Report Latex

Implementation of Gaussian Blur through
Parallel Computing
Sahil Asole, Ahan Kamat, Krish Shah, Sandip T. Shingade
Department of Computer Engineering and Information Technology, Veermata Jijabai

Technological Institute, Mumbai, India
{sbasole, kashah, akamat} b21@it.vjti.ac.in, {stshingade}@it.vjti.ac.in
Abstract. points
1. This project report presents the implementation and performance
evaluation of a 2D convolution algorithm using the Gaussian blur
filter on parallel computing systems. Gaussian blur, a widely-used
technique for noise reduction in image processing, demands high
computational power. We compare the performance of multi-core
CPUs and GPUs, employing the Google Colaboratory platform to
analyze the efficiency and speedup achieved through parallel com-
puting.
– 2. This study leverages the parallel processing capabilities of multi-
core CPUs and GPUs to enhance the performance of the Gaussian
blur filter. By distributing the computational workload across mul-
tiple processors, the project aims to achieve significant reductions
in processing time. The Google Colaboratory platform is utilized to
run the experiments, providing access to powerful GPU resources
and enabling efficient execution of parallel algorithms.
– 3. The report details the methodology for implementing the Gaus-
sian blur filter using Python libraries for both CPU (pymp) and GPU
(PyCUDA). A comprehensive performance comparison is conducted,
evaluating the processing times for various image resolutions and ker-
nel sizes. The results demonstrate a substantial speedup when using
GPU parallelization compared to multi-core CPU implementations.
The findings underscore the potential of parallel computing to meet
the demands of modern image processing tasks, making it a viable
solution for applications requiring high computational throughput
– 4.To maximize efficiency, optimization techniques such as shared
memory utilization in GPUs, memory coalescing, and minimizing
thread divergence are employed. These techniques reduce memory
access latency and improve the overall throughput of the Gaussian
Blur operation.
– 5. Parallel implementation shows good speedup compared to serial
run, especially with large images. The performance grows with the
number of processing units and it is very good for a real-applications
that require high resolution image processing. Results Our exper-
imental results illustrate the advantages of parallel programming
for quick and effective Gaussian Blur effects. Studies by Chauhan,
Munesh Singh (1984) showcase the potential of Optimizing Gaussian
Blur Filter using CUDA Parallel Framework, which our implemen-
tation strives to emulate within the Walnut framework.
– 6. Keywords: Gaussian Blur, OpenMP.
1 Introduction
Introduction to the Topic: Gaussian Blur is a fundamental image processing

technique widely utilized in various applications such as photography, computer
graphics, and computer vision. Its primary purpose is to reduce image noise and
detail by smoothing the image, achieved through the convolution of the image
with a Gaussian function.
Problem Illustration: Gaussian Blur due to its popularity will be com-
putationally expensive, especially in high-resolution images and large Gaussian
kernel sizes. Older, linear processing algorithms designed to run through data
in serial are often not able to live up to the performance requirements for many
real-time applications, for reasons that I will get into later in this article, but
this results in slower processing times and ultimately a less efficient system.
Problem Identification: The Gaussian Blur operation, while essential for
noise reduction and image smoothing, is computationally intensive and time-
consuming, especially for high-resolution images. Traditional sequential process-
ing methods are inadequate for real-time applications, resulting in slow and
inefficient performance. As the size of images and the complexity of Gaussian
kernels increase, the computational demands escalate, exacerbating the perfor-
mance bottlenecks.
Need for Solution: To address the inefficiencies and performance bottle-
necks of traditional sequential Gaussian Blur implementations, it is essential to
leverage parallel computing techniques. By distributing the computational work-
load across multiple processors, parallel computing can significantly enhance pro-
cessing speed and efficiency, making Gaussian Blur feasible for high-resolution
images and real-time applications.
Review of Related Work:
– Optimizing image convolution is one of the important topics in image pro-

cessing that is being widely explored and developed. The effect of optimizing
Gaussian blur by running the filter on CPU multicore systems and its im-
provement from single CPU had been explored by Novák et al. (2012). Also,
exploring the effect of running Gaussian blur filter using CUDA has been
explored. In the previous studies, the focus was getting the best performance
from multicore CPUs or from GPU compared to the sequential code. Samet
et al. (2015) presented a comparison between the speed up of real time appli-
cations on CPU and GPU using C++ language and Open Multi-Processing
OpenMP. Also, Reddy etal. (2017) presented a comparison between the per-
formance of CPU and GPU was studied for image edge detection algorithm
using C language and CUDA on NVIDIA GeForce 970 GTX. In our study we
are exploring the performance improvement between two different parallel
systems, CPU multicore and GPU using python parallel libraries pymp and
CUDA respectively. We are using Intel Xeon CPU and NVIDIA Tesla P100
GPU.
Paper Structure: This paper is organized as follows:
– Section I: Introduction.
– Section II: Preliminaries.
– Section III: EXPERIMENTAL SETUP.
– Section IV: Conclusion.
1.1 CPU Acceleration

OpenMP can significantly accelerate ray tracing computations on CPUs by lever-
aging the parallel processing capabilities of modern multicore processors. In ray
tracing, a substantial portion of the computational workload involves casting
rays and tracing their intersections with scene geometry, which can be highly
parallelizable. OpenMP provides directives and libraries that allow developers
to exploit this parallelism effectively.
1.2 GPU Acceleration

GPU acceleration plays a crucial role in achieving real-time ray tracing perfor-
mance, particularly in dynamic scenes where the computational demands are
high. By leveraging the parallel processing capabilities of modern GPUs, ray
tracing algorithms can be optimized for rapid intersection tests and shading
calculations. Techniques such as CUDA or OpenCL programming enable devel-
opers to harness the power of GPUs for accelerating ray tracing tasks, resulting
in faster rendering times and smoother interactive experiences.
1.3 Dynamic Scene Management

Managing dynamic scenes poses a unique challenge for real-time ray tracing sys-
tems. Traditional ray tracing approaches may struggle to handle objects that
change position, shape, or appearance over time. Dynamic scene management
techniques involve strategies for efficiently updating scene data structures, such
as BVHs, to account for moving objects and changing geometry. Adaptive algo-
rithms that dynamically adjust the granularity of BVH nodes or utilize spatial
partitioning schemes optimized for dynamic scenes can help maintain rendering
performance without sacrificing accuracy.
1.4 Temporal Coherence

Exploiting temporal coherence is essential for optimizing ray tracing performance
in dynamic scenes. Temporal coherence refers to the consistency of scene geom-
etry and shading properties across consecutive frames. By reusing information
from previous frames and minimizing redundant calculations, temporal coher-
ence techniques can reduce the computational overhead of ray tracing in dynamic
environments. Approaches such as motion prediction, frame interpolation, and
caching of intermediate results contribute to maintaining high frame rates and
visual consistency in real-time ray traced scenes.
1.5 Conclusion
In this report, we have outlined the relevant and essential approaches for ad-
dressing the challenges of implementing gaussain blur in dynamic scenes. By
leveraging techniques such as bounding volume hierarchies, GPU acceleration,
dynamic scene management, and temporal coherence.
2 Preliminaries
2.1 CPU and GPU
A Multicore CPU is a single computing component with more than one inde-
pendent core. OpenMP (Open Multi-Processing) and TBB (Threading Building
Blocks) are widely used application programming interfaces (APIs) to make use
of multicore CPU efficiently (Polesel et al, 2000). In this study, a general purpose
platform using Python pymp tool is used to parallelize the algorithm. On the
otherhand, GPU is a single instruction and multiple data (SIMD) stream archi-
tecture which is suitable for applications where the same instruction is running
in parallel on different data elements. In image convolution, image pixels are
treated as separate data elements which makes GPU architecture more suitable
for parallelizing the application (Polesel et al,2000). In this study, we are using
CUDA platform with GPU since CUDA is the most popular platform used to
increase the GPU utilization. The main difference between CPU and GPU, as
shown in Figure 1 (Reddy et al, 2017), is the number of processing units. In
CPU it has less processing units with cache and control units while in GPU it
has more processing units with its own cache and control units. GPUs contain
hundreds of cores which causes higher parallelism compared to CPUs.
2.2 NVIDIA, CUDA Architecture and Threads

The GPU follows the SIMD programming model. The GPU contains hundreds
of processing cores,called the Scalar Processors (SPs). Streaming multiprocessor
(SM) is a group of eight SPs forming the graphic card. Group of SPs in the same
SM execute the same instruction at the same time hence they execute in Single
Instruction Multiple Thread (SIMT) fashion (Lad et al, 2012). Compute Unified
Device Architecture (CUDA), developed by NVIDIA, is a parallel processing ar-
chitecture, which with the help of the GPU produced a significant performance
improvement. CUDA enabled GPU is widely used in many applications as image
and video processing in chemistry and biology, fluid dynamic simulations, com-
puterized tomography (CT), etc (Bozkurt et al, 2015). CUDA is an extension
of C language for executing on the GPU, that automatically creates parallelism
with no need to change program architecture for making them multithreaded.
It also supports memory scatter bringing more flexibilities to GPU (Reddy et
al,2017). The CUDA API allows the execution of the code using a large num-
ber of threads, where threads are grouped into blocks and blocks make up a
grid.Blocks are serially assigned for execution on each SM(Lad et al, 2012).
2.3 Gaussian Blur Filter

The Gaussian blur, is a convolution technique used as a pre-processing stage of
many computer vision algorithms used for smoothing, blurring and eliminating
noise in an image (Chauhan, 2018). Gaussian blur is a linear low-pass filter, where
the pixel value is calculated using the Gaussian function (Novák et al, 2012).
The 2 Dimensional (2D) Gaussian function is the product of two 1 Dimensional
(1D) Gaussian functions, defined as shown in equation (1) (Novák et al, 2012):
G(x, y) = 1/(2pi * sigma 2 ) ∗ e( − (x2 + y 2 )/(2sigma2 ))
equation (1)
where (x,y) are coordinates and ‘’ is the standard deviation of the Gaussian
distribution. The linear spatial filter mechanism is in the movement of the cen-
ter of a filter mask from one point to another and the value of each pixel (x,
y) is the result of the filter at that point is the sum of the multiplication of
the filter coefficients and the corresponding neighbor pixels in the filter mask
range (Putra et al, 2017). The outcome of the Gaussian blur function is a bell
shaped curve as shown in Figure 1 as the pixel weight depends on the distance
metric of the neighboring pixels. The filter kernel size is a factor that affects the
performance and processing time of the convolution process. In our study, we
used odd numbers for the kernel width: 7x7, 13x13, 15x15 and 17x17.
Fig. 1: Figure 1: The 2D Gaussian Function.
2.4 Related Work

Optimizing image convolution is one of the important topics in image processing
that is being widely explored and developed. The effect of optimizing Gaussian
blur by running the filter on CPU multicore systems and its improvement from
single CPU had been explored by Novák et al. (2012). Also, exploring the effect
of running Gaussian blur filter using CUDA has been explored. In the previous
studies, the focus was getting the best performance from multicore CPUs or
from GPU compared to the sequential code. Samet et al. (2015) presented a
comparison between the speed up of real time applications on CPU and GPU
using C++ language and Open Multi-Processing OpenMP. Also, Reddy etal.
(2017) presented a comparison between the performance of CPU and GPU was
studied for image edge detection algorithm using C language and CUDA on
NVIDIA GeForce 970 GTX. In our study we are exploring the performance
improvement between two different parallel systems, CPU multicore and GPU
using python parallel libraries pymp and CUDA respectively. We are using Intel
Xeon CPU and NVIDIA Tesla P100 GPU.
3 EXPERIMENTAL SETUP
In our experiment, we are using Google Colaboratory or “colab” to run our
code. Google Colaboratory is a free online cloud-based Jupyter notebook envi-
ronment that allows us to train PyCUDA and which gives you easy, pythonic
access NVIDIA’s CUDA parallel computation API. To be able to run our code
on Google Colabs, we used the Python programming language. Python pymp
library is used for the multiprocessor code. This package brings OpenMP-like
functionality to Python. It takes the good qualities of OpenMP such as mini-
mal code changes and high efficiency and combines them with the Python Zen
of code clarity and ease-of-use. Python PyCUDA library is used to be able to
call CUDA function in our python code and PyCUDA gives you easy, pythonic
access to NVIDIA’s CUDA parallel computation API.
3.1 Architecture of the Used GPU

NVIDIA Tesla P100 GPU accelerators are one of the advanced data center ac-
celerators, powered by the breakthrough NVIDIA Pascal™ architecture and de-
signed to boost throughput and save money for HPC and hyperscale data centers.
The newest addition to this family, Tesla P100 for PCIe enables a single node to
replace half a rack of commodity CPU nodes by delivering lightning-fast perfor-
mance in a broad range of HPC applications (Nvidia Corporation, 2016). CUDA
toolkit 10.1 is used.
3.2 Architecture of the Used CPU

Intel Xeon is a high-performance version of Intel desktop processors intended for
use in servers and high-end workstations. Xeon family spans multiple generations
of microprocessor cores. Two CPUs available with two threads per core and one
core persocket (http://www.cpu-world.com/CPUs/Xeon/).
3.3 Experimental Results

Three images are used in our experiment, their particular sizes are of 256 x
256, 1920 x 1200 and 3840 × 2160. We ran our Gaussian filter algorithm on the
images using different kernel sizes and calculated the average processing time
of 10 runs. First, the color channels of the images are being extracted to red
channel, green channel and blue channel. In the second step the convolution of
each channel respectively with the Gaussian filter. Using the pycuda library in
python code, a CUDA C function had been used to be able to run the code on
the GPU. The data is transferred from host to device and after the convolution
operation is being brought back to the host. In the end we merge all the channels
together to get the output of the blurred image. The processing time calculated
is the time taken by the convolution function of the filter on the CPU using
two threads and on the GPU as shown in Table 1 and Table 2 respectively. The
processing time and speedup are shown in Figure 2 and 3 respectively. Figures
4 and 5 respectively show the original image and blurred image of resolution
256 x 256 using 7 x 7 filter kernel. From the results of Bozkurt et al. (2015),
it is observed that our performance speedup on GPU was higher than the one
proposed by Bozkurt et al. (2015) by more than 3 times.
Fig. 2: Table 1: CPU time
Fig. 3: Table 2: GPU time.
4 Conclusion
In conclusion, In this paper, a feature-preserving interpolative vector quanti-
zation method is proposed to compress images. On compressing an image, the
Fig. 4: Processing Time.
Fig. 5: Speedup.
Fig. 6: original image

Fig. 7: blurred image using 7x7 kernel
proposed method decomposes the image into a low frequency image (i.e., the ap-
proximation image) and a high frequency image (i.e., the residual image which
is generated by subtracting the approximation image from the original image).
The proposed method applies the MVP algorithm to generate the approxima-
tion image and uses the FPVQ algorithm to quantize the residual image. The
encoder sends the top left pixel’s intensity value in every block of the approxi-
mation image and the index of the represented codeword for every block of the
residual image to the network. The decoder reconstructs the image on the other
end of the network by adding the approximation image (generated by interpo-
lating received intensity values) and the quantized residual image (generated by
replacing received indices with their corresponding codewords in a pre-stored
codebook). The experimental results show that the proposed method usually
yields satisfactory results.
5 REFERENCES
Andrea Polesel et al, 2000, Image Enhancement via Adaptive Unsharp Masking.
IEEE Transactions on Image Processing, VOL. 9, NO. 3 Ritesh Reddy et al,
2017, Digital Image Processing through Parallel Computing in Single-Core and
Multi-Core Systems using MATLAB. IEEE International Conference On Recent
Trends in Electronics Information Communication Technology (RTEICT). India
Jan Novák et al, GPU Computing: Image Convolution. Karlsruhe Institute of
Technology. Shrenik Lad et al, Hybrid Multi-Core Algorithms for Regular Image
Filtering Applications. International Institute of Information Technology. Hyder-
abad, India Ferhat Bozkurt et al, 2015, Effective Gaussian Blurring Process on
Graphics Processing Unit with CUDA. International Journal of Machine Learn-
ing and Computing, Vol. 5, No. 1. Singapore, Asia. Munesh Singh Chauhan,
2018, Optimizing Gaussian Blur Filter using CUDA Parallel Framework. Infor-
mation Technology Department, College of Applied Sciences. Ibri, Sulatanate of
Oman. B. N. Manjunatha Reddy et al, 2017, Performance Analysis of GPU V/S
CPU for Image Processing Applications. International Journal for Research in
Applied Science Engineering Technology (IJRASET). India. Ridho Dwisyah Pu-
tra et al, 2017, A Review of Image Enhancement Methods. International Journal
of Applied Engineering Research ISSN 0973-4562 Volume 12, Number 23 (2017),
pp. 13596-13603. India. Nvidia Corporation, 2016, NVIDIA® TESLA® P100
GPU ACCELERATOR. NVIDIA Data sheet

PC Project Report Latex

Uploaded by

Copyright:

Available Formats

PC Project Report Latex

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PC Project Report Latex

Uploaded by

Copyright:

Available Formats

Implementation of Gaussian Blur through

Sahil Asole, Ahan Kamat, Krish Shah, Sandip T. Shingade

Department of Computer Engineering and Information Technology, Veermata Jijabai

Introduction to the Topic: Gaussian Blur is a fundamental image processing

– Optimizing image convolution is one of the important topics in image pro-

Paper Structure: This paper is organized as follows:

1.1 CPU Acceleration

1.2 GPU Acceleration

1.3 Dynamic Scene Management

1.4 Temporal Coherence

2.2 NVIDIA, CUDA Architecture and Threads

2.3 Gaussian Blur Filter

Fig. 1: Figure 1: The 2D Gaussian Function.

2.4 Related Work

3.1 Architecture of the Used GPU

3.2 Architecture of the Used CPU

3.3 Experimental Results

Fig. 2: Table 1: CPU time

Fig. 3: Table 2: GPU time.

Fig. 6: original image

You might also like