Pehlivan 2019

Designing an Obstacle Detection and Alerting System
for Visually Impaired People on Sidewalks
Sude Pehlivan and Mazlum Unay Aydin Akan

Department of Biomedical Engineering Department of Biomedical Engineering
Izmir Kâtip Celebi University Izmir Kâtip Celebi University
Izmir, Turkey Izmir, Turkey
sudepehlivan35@gmail.com, unaymazlum@gmail.com aydin.akan@ikc.edu.tr
Abstract— It is known that being visually impaired is one of The basic definition of the image is that it is a two-
the most challenging experiences in life and several people have dimensional matrix that can be shown as coordinates (r, c) with
been facing this situation. Since computer vision and machine a piece of specific information called intensity [3]. When this
learning have started to be growing fields, designing an assisting physical context wanted to be represented in a computer
system have become simpler for both researchers and engineers. screen, it has to be converted to digital data. This digital data
There are two fundamental elements in obstacle detection can be used to extract features of specific objects in order to
systems that are software part and hardware part. In recent detect and classify them by using feature descriptors such as
years, operating systems have been reshaped around deep Histogram of Gradients (HOG), Speeded Up Robust Features
learning algorithms, open source libraries and programming (SURF), and Scale Invariant Feature Transformation (SIFT)
platforms. There are several deep learning algorithms and open
and there are 20 obstacle detection aids that are based on
source libraries and in this project, we focused on programming
the system with Python and using Tensorflow models to create a
computer vision according to a paper published in 2016 and
classifier on a tiny computer called Raspberry Pi. We used a most of them were using these basic feature extractors [4].
model called ssdlite_mobilnet_v2_coco to detect 9 different On the other hand, there is a more efficient way of
objects which could present on sidewalks. Since the auditory classifying the objects which is called Deep Learning (DL). It
sense is the most important sense for visually impaired people is the subset of Machine Learning (ML) and the most common
and must not be blocked, we proposed a combination of auditory DL algorithm is the Convolutional Neural Networks (CNN)
and tactile senses to alert the user. A system that is called eSpeak
and it uses multiple layers and neural networks. There are three
was used to alert the user via headphones by reading the name of
layers as convolutional layers to create a feature map, pooling
the detected object. At the same time, 3 different vibration
sensors were located at 3 different positions as of right, middle layer to decrease the dimensions of the key points and fully
and left. When an obstacle was detected inside one of the pre- connected layers to modify feature maps to the one-
defined bounding boxes, the related vibration sensor was dimensional vector. To find the output of a layer, the output
activated with the name of the object. from the previous layer is used as input of the current layer and
it is subjected to an activation function [5].
Keywords — Tensorflow; Raspberry Pi; obstacle detection
In most common DL algorithms, supervised learning is
I. INTRODUCTION used where the desired output must be obtained for a given
According to the tenth revision of World Health input by adjusting the weight factors of the system by training
Organization (WHO), when visual acuity is between 6/18 and and labeling. Adjustments of the weights are implemented by
3/60 or angle of vision is less than 20 degrees, this situation is calculating error and error is minimized by moving in the
defined as low vision. On the other hand, blindness is the opposite direction of the gradient of the cost function [5].
situation of having a visual acuity less than 3/60 or having an In order to make tasks easier, there are DL libraries that
angle of vision less than 10 degrees. It is important to consist of several special functions such as Torch, Theano and
understand that visual impairment encompasses both blindness Tensorflow [6].
and low vision [1]. Recently, since computer vision and
bioengineering have become more efficient, researchers from II. MATERIALS
both fields have become closer to find a solution for visual
impairment. There are a considerable amount of studies both A. Raspberry Pi
cell-based and electronic-based to encourage visually impaired Raspberry Pi is an economical and tiny computer that can
people to maintain daily activities without being in need as perform from a simple task like blinking a led to a complicated
individuals. Among electronic-based approaches, obstacle one such as image processing. In this project, Raspberry Pi
avoidance systems are one of the most studied areas which are 3B+ is used. It has a 32 bit Advanced RISC Machines (ARM)
based on the idea of recognition of an obstacle and alerting the chip processor, a 3.5 mm jack analog audio output, a 5 V micro
user [2]. Universal Serial Bus (USB) to connect the card to power, a
978-1-7281-2420-9/19/$31.00 ©2019 IEEE

Camera Serial Interface (CSI) connector to connect camera Detector (SSD) model and MobileNet Deep Neural Network
module, General Purpose Input Output (GPIO) pins, and SD (DNN) as feature extractor including Common Objects in
card channel to reserve data inside it since Raspberry Pi does Context (COCO) which is a dataset [14].
not store any data [8]. SSD is based on creating bounding boxes and scores for a
B. Pi Camera detected object using convolutional networks. At the end of
every basic network, there are multi-scale convolutional
A USB camera or a pi camera can be used to take images or
feature layers. Training of the system is performed with
record a video. In this project, a standard 5 Megapixel (Mpix)
ground truth boxes [15].
camera module was used which had the capacity of 1080p at
30 Frame per Second (FPS) and 720p at 60 FPS [9]. Depthwise separable convolution is the basic idea of
MobileNet. A basic convolution operation is the sum of the
C. Tactile Output and Audio Output products of two matrices by sliding one of them on the other.
In this project, 3 standard vibrators were used to alert the However, in the case of depthwise convolution, one kernel is
user via tactile sense. Each vibrator was assigned to a certain convolved with only one channel of N channeled input.
range of locations as left, right and middle. Afterward, pointwise convolution is performed where the
In order to get the name of the class that had been detected, kernel has a size of 1x1xN [16].
a speech synthesizer called eSpeak was used. It performs a In this model, there are 80 classes and in this project 9 of
procedure called formant synthesis and uses these formants to them were used to perform object detection on a sidewalk. See
create a speaking voice which is unfortunately not natural but Table 1.
robotic [10]. Table 1 Names and ID Numbers of the Pre-defined Objects
D. Python and OpenCV Name of the Pre-defined Objects ID Number of the Pre-defined
Objects
Python is a programming language which has several
Person 1
open-source libraries and it is supported by Raspberry Pi. Bicycle 2
To be able to perform image and video processing, Python Car 3
needs OpenCV. It is an open source library that allows one to Traffic Light 10
implement different tasks such as face detection and it has Fire Hydrant 11
some dependencies and most important one of them is Stop Sign 13
NumPy. It is needed to perform numeric tasks and array- Bench 15
related applications in scientific operations. Although they are Cat 17
Dog 18
not dependencies, SciPy and Matplotlib packages must be
installed to be able to carry out some specific tasks such as III. METHOD
plotting and signal processing [11].
At first, RASBIAN operating system, Tensorflow and
E. Tensorflow was installed along with several packages, tools and
Google has been developed Tensorflow as a library for ssdlite_mobilenet model was downloaded.
machine learning and deep learning applications. The system In the hardware setup, a keyboard and a mouse were
uses tensors as basic elements which are multidimensional connected to the USB port of the Raspberry Pi and the Pi
arrays and performs several tasks using them [12]. camera was connected to CSI port. Three vibration sensors
Tensorflow uses the dataflow graphs to define ML were connected to GPIO 4, 16 and 18 along with speakers
algorithms. In a Tensorflow model, there are three basic which were connected to the analog audio output and also to
elements [6] [13]. the USB port to provide power. Finally, Raspberry Pi was
 Operations are represented as nodes in a dataflow connected to power via micro USB. See Figure 1.
graph that perform several tasks on tensors.
 Tensors are multidimensional arrays and they are
represented as edges in a dataflow graph. Data from
one node to another is transferred through the edges.
Edges can be a normal edge to carry tensors from one
node to another or they can be special edges to
control the relationship of two nodes.
 Sessions are environments to execute dataflow
graphs. To be able to run the code, a session must be
created with tf.session and then executed with
sess.run command.
F. A Tensorflow Model: ssdlite_mobilenet_v2_coco
In this project, ssdlite_mobilenet_v2_coco_2018_05_09
model was used. This model is a pre-trained Tensorflow
model which is the combination of the Single Shot Multi-box Figure. 1. Hardware Setup
In the software part, necessary packages were imported
and appended inside the working directory for python. After
the GPIO settings and camera arrangements, label map was
loaded to get categories and indexes. In order to run the
computational graph, a session was created where images
were defined as input tensors and boxes, scores, number of
detected objects and classes were defined as output tensors.
Three boxes were drawn as left, right and middle and the real
detection was performed by running the session. Boxes, scores
and detected classes were displayed on the screen if the scores
of the detected objects were above the 0.6 threshold. When the
detected object had a pre-defined index, the center of the
object was calculated and if the center was inside one of the
pre-defined boxes, the related GPIO pin was activated to give
tactile output. Detected objects were printed on the screen and
the names of the objects that were read by eSpeak engine were
given as auditory output from speakers.
Figure. 2. Screenshot of the Results
IV. SUMMARY AND CONCLUSIONS
Since a pre-trained model was used, to visualize the
A. Summary accuracy of the model Tensorboard was not available. Instead,
In this project, it was aimed to design obstacle detection the general performance of the models was given in Figure 3.
and alerting system on sidewalks for visually impaired people It can be seen that SSD w/MobileNet has the lowest mAP
in order to provide them guidance. In the hardware setup, a value; however, this model has lower processing time than
Raspberry Pi was used along with three vibration sensors to others. The reason for the usage of SSD lite w/MobileNet in
alert the user about the direction and a speaker to alert the user this project was that it had the same amount of mAP as SSD
about the name of the object. As programming language w/MobileNet with lower processing time [17] [18].
Python was chosen because of its speed, compatibility with
Raspberry Pi and package availability. A Python package C. Limitations and Future Work
called Tensorflow was used to perform object detection and In the future, the system can be improved by
classification with a pre-trained model considering some current limitations of hardware setup such
B. Results as camera constraints, Graphics Processing Unit (GPU) speed,
and the choice of the board. In this project, Pi camera was
In this study, pre-defined objects were detected and chosen as a 5 Mpix standard module which can be replaced
results were displayed as both visual and auditory to compare with an 8 Mpix new model to increase resolution which can
the correctness of auditory findings along with tactile output. affect the precision of the detection. Although FPS of the
It was concluded that auditory results were mostly matched camera was set as 1, from the written FPS information on the
with visual results with a delay on the screen due to the FPS screen, it was seen that it was fluctuating between 0.2 and 2
limitations of the camera. The reason behind using both which affected the detection precision and the response time
auditory and tactile outputs is that the response of a visually of the system to the movement.
impaired person to the question of whether he wanted to use
an auditory or tactile output was the usage of the combination
of these sensory outputs.
When an object was recognized, the center of the

object was calculated and represented as a pink circle. There
were pre-drawn boxes on the screen as red, yellow and blue to
represent the direction of the object as left, middle and right.
When the center of the object was inside one of the boxes,
related vibration sensor was activated. Each class had a
different bounding box color to visualize the results and on top
of the bounding boxes, names of the objects and scores of the
detection were written and since this project was designed in
laboratory, to be able to represent the results the class “chairs”
was used. See Figure 2.
Figure. 3. Mean average precisions (mAPs) and speeds of different model and
feature extractor combinations [18]
In this project, the model that was used was pre- [7] Bahrampour, S., Ramakrishnan, N., Schott, L., & Shah, M. (2016).
Comparative study of caffe, neon, theano, and torch for deep learning.
trained with the COCO database and it was the combination of
[8] Richardson, M., & Wallace, S. (2012). Getting started with raspberry PI.
two Tensorflow models which are SSD and MobileNet. Due " O'Reilly Media, Inc.".
to the GPU limitations of the Raspberry Pi, used model was [9] Nguyen, H. Q., Loan, T. T. K., Mao, B. D., & Huh, E. N. (2015, July).
chosen as lite which is compatible with Central Processing Low cost real-time system monitoring using Raspberry Pi. In 2015
Unit (CPU). Seventh International Conference on Ubiquitous and Future Networks
(pp. 857-859). IEEE.
In the future works, if this model is trained with [10] Kaur, R., & Sharma, D. (2016). An Improved System for Converting
different and larger scaled databases, there may be a higher Text into Speech for Punjabi Language using eSpeak. International
Research Journal of Engineering and Technology (IRJET), 3(04), 500-
chance of increment in precision and accuracy. Besides, rather 504.
than using the same model, a different Tensorflow model can [11] Van der Walt, S., Schönberger, J. L., Nunez-Iglesias, J., Boulogne, F.,
be chosen to perform object detection. Furthermore, if a better Warner, J. D., Yager, N., ... & Yu, T. (2014). scikit-image: image
card with a higher GPU speed is chosen, it will be possible to processing in Python. PeerJ, 2, e453.
train the model and in this way, the need for a pre-trained [12] Zaccone, G. (2016). Getting Started with TensorFlow. Packt Publishing
Ltd.
model will be eliminated. If GPU limitation is eliminated it
[13] Girija, S. S. (2016). Tensorflow: Large-scale machine learning on
won’t be necessary to use only Tensorflow models, it is also heterogeneous distributed systems. Software available from tensorflow.
possible to use different ones such as You Only Look Once org.
(YOLO), Tesla or RetinaNet to improve detection rate. [14] Jokela, J. (2018). Person counter using real-time object detection and a
small neural network.
This study can be used to create a prototype with a [15] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., &
suitable design to provide comfort and improved quality of life Berg, A. C. (2016, October). Ssd: Single shot multibox detector. In
for visually impaired people all around the world. Design can European conference on computer vision (pp. 21-37). Springer, Cham.
be placed on the belt with three vibration sensors located at the [16] Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W.,
Weyand, T., ... & Adam, H. (2017). Mobilenets: Efficient convolutional
right, left and middle. Instead of speakers, headphones can be neural networks for mobile vision applications. arXiv preprint
used to not only provide a better understanding of the words arXiv:1704.04861.
for users but also to eliminate noise pollution for the [17] Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., ... &
surrounding environment. Camera model can be placed and Murphy, K. (2017). Speed/accuracy trade-offs for modern convolutional
stabilized on the chest along with the Raspberry Pi on the belt. object detectors. In Proceedings of the IEEE conference on computer
vision and pattern recognition (pp. 7310-7311).
Further, this design can be improved as locating the camera [18] Fleury, D., & Fleury, A. (2018). Implementation of Regional-CNN and
module into the glasses and vibration sensors onto the gloves. SSD machine learning object detection architectures for the real time
It would be better to insert a Global System for Mobile analysis of blood borne pathogens in dark field microscopy.
Communications (GSM) to inform a member of the family in
case of an accident and a Global Positioning System (GPS) to
navigate the user.
In conclusion, this project was performed to make

visually impaired people feel more self- confident with living
their lives fully without any concerns. Furthermore, it can be a
precursor for other projects that will be performed in the
future with boards that have higher GPU speed, cameras with
higher resolution and higher frame rate, larger databases and
more comfortable and improved designs.
KAYNAKLAR
[1] World Health Organization. (2007). Global Initiative for the Elimination
of Avoidable Blindness: action plan 2006-2011.
[2] Dakopoulos, D., & Bourbakis, N. G. (2010). Wearable obstacle
avoidance electronic travel aids for blind: a survey. IEEE Transactions
on Systems, Man, and Cybernetics, Part C (Applications and Reviews),
40(1), 25-35.
[3] Gonzalez, R. C., & Wintz, P. (1977). Digital image processing(Book).
Reading, Mass., Addison-Wesley Publishing Co., Inc.(Applied
Mathematics and Computation, (13), 451.
[4] Kaur, P., & Kaur, S. Aids for Visually Impaired Persons for Obstacle
Acknowledgment: A Study and Proposed New Framework. International
Journal of Computer Applications, 975, 8887.
[5] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature,
521(7553), 436.
[6] Goldsborough, P. (2016). A tour of tensorflow. arXiv preprint
arXiv:1610.01178.

Pehlivan 2019

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Pehlivan 2019

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pehlivan 2019

Uploaded by

Copyright:

Available Formats

Designing an Obstacle Detection and Alerting System

for Visually Impaired People on Sidewalks

Sude Pehlivan and Mazlum Unay Aydin Akan

978-1-7281-2420-9/19/$31.00 ©2019 IEEE

When an object was recognized, the center of the

In conclusion, this project was performed to make

You might also like