Third Eye Smart Aid For Visually Impaired

Third Eye Smart Aid for Visually Impaired
Sahana V (  sahana26@gmail.com )
JSS Academy of Technical Education, Bengaluru
Shashidhar R
JSS Science and Technology University
Bindu S. N.
Chandana A. N.
Nishrutha C. G
Research Article
Keywords: Assistive technology, Mobile Application, Paper money, Visually impaired, Speech-to-Text
conversion, Text-to-Speech conversion
Posted Date: June 28th, 2023
DOI: https://doi.org/10.21203/rs.3.rs-3086323/v1
License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
Page 1/25
Abstract
Smartphones are less likely to be considered as assistive technology for visual impairment among a
large majority of health care providers, excluding vision rehabilitation professionals, and the public who
are not familiar with accessible features and apps. They face significant challenges in terms of
accessibility and inclusion in the smartphone environment. The world of money has completely changed,
but some things remain the same. The quickly expanding utilization of CREDIT/ DEBIT cards and all the
other different forms of money transfer mechanisms, including electronic payment, has certainly made
its mark in today’s world. However, paper money is still broadly used in today’s world for ordinary
exchanges because of its convenience. Due to similarity of the texture of the paper, and the different
sizes that exist between different categories, visually impaired people face problems with monetary
transactions because of their inability to recognize paper currency. The inability to read printed
documents is a disadvantage for blind people. Speech based applications can help improve support for
visually impaired people. The proposed system aims to provide assistive technology for currency
recognition, text recognition to help visually impaired people.
1 Introduction
According to the estimates from the World Health Organization (WHO), around 39 million people across
the globe are suffering from complete blindness and around 246 million have low vision i.e., severe or
moderate visual impairment [1]. Among these around 90 percent of the world’s visually impaired people
live in developing countries [2]. These people face a lot of challenges and troubles due to inaccessible
infrastructure and other social challenges. They face difficulty in using smartphones to perform basic
functionalities like messaging or calling, navigation problems, recognizing different denominations which
further lead to inaccessibility in getting involved in day-to-day chores. Therefore, visually impaired people
thus need an assistive tool to help them cope with these difficulties and simplify them to an extent.
The identification and validation of cash denominations is one of the major problems for people who are
blind. Due to the similarities in sizes and colors of the contemporary Indian banknotes, this challenge is
particularly difficult to solve [3]. Numerous automatic and semi-automated currency identification
solutions are suggested in the literature to help visually impaired recognize different currency
denominations. People who are blind or visually challenged rely on voice-based smartphone features like
Talkback, Siri, and Google Assistant to complete routine activities. However, the deployment and
adaptability of mobile-based solutions with specialized machine/deep learning models for recognition
are difficult. The existing models are comparatively bulkier, and integration with low-end mobile
smartphones is impractical, thus causing deploy ability issues.
Additionally, those who are legally blind or have severe low vision problems are unable to adjust the
camera view or other settings to achieve the best results. A reliable and consistent recognition system is
therefore required. Deep learning-based networks have gotten more sophisticated recently in an effort to
boost performance. The complexity of the networks increases exponentially as the depth increases since
Page 2/25
more parameters are needed. For devices with limited resources, such as mobile phones and edge-based
devices, the bulkier models are inappropriate. The suggested model uses MobileNetV2 pretrained with the
ImageNet database as the foundation model. It is a lightweight neural network. To aid an effective
training and evaluation of the proposed model, we have gathered a large-scale Indian Currency Dataset,
ICD. The ICD dataset is a combination of the existing datasets [4], [5], [6], [7], [8]. It consists of images with
folded and partial note images, with varied illumination and background conditions.
Text recognition is another issue that the visually impaired experience [9]. Many initiatives to help blind
persons read have been developed in this era of technological advancement. Through an optical
mechanism, optical character recognition (OCR) technology automatically recognizes the character. OCR
technology turns printed text into machine-encoded text from scanned documents, newspapers, and
magazines [10]. In the proposed system, a technology that will let blind people read in real-time is
developed. Text to speech (TTS) synthesizer and OCR technology have been employed. Artificial human
voice is produced using OCR-based speech synthesis.
We have built an android application named “Third Eye: Blind app”, which provides assistance in currency
and text recognition customized for visually impaired people. Our android application utilizes the
compressed trained model for real-time recognition of the underlying currency denomination and OCR
technology for text recognition.
As of today, many people suffer from visual impairment. In modern society, visually impaired people need
helpful tools for them to be able to perform routine tasks such as operating digital devices. In today’s
technologically advanced world, machine-learning algorithms along with a proper deployment, the
platform has become the go-to solution for almost everything. The proposed system is a low-cost and
easy-to-use application to help visually impaired people.
The contributions of this work are:
Diverse Dataset. One of the largest (around 10k images) and most diversified dataset of Indian
Paper Currency images, representing real scenarios and cases.
Quantitative Analysis. A thorough quantitative analysis of the proposed network on multiple publicly
available datasets has been performed.
Qualitative Analysis. A qualitative analysis has been performed to investigate the transparency and
intuition behind the proposed network predictions.
Extensive Experiments. Extensive experiments have been executed on two different models to
demonstrate the predictions of Indian Currency denominations.
OCR for documents and non-documents. OCR-based text recognition for documents such as books,
newspapers to non-documents such as image documents, grocery items, packaged foods wrappers
etc.
Page 3/25
This paper is organized as follows. Proposed methods and experiments are discussed in section 2. In
section 3, results are discussed along with our proposed android application “Third Eye”. Conclusion and
Future scope are discussed in Section 4.
2 Methods
This section describes the proposed system architecture and experimental setup for currency recognition
and text recognition, and its components in detail.
2.1 Proposed Architecture for Currency Recognition

The proposed deep neural network architecture formulates the problem of currency denomination
recognition as a multi-class classification problem. Given an image of Indian paper currency as input, our
goal is to robustly recognize the denomination even with the partial or folded view.
Transfer learning
Transfer learning is a machine learning technique that learns to complete a new task by using the
knowledge from a model that has previously been used to do other tasks [11]. With this approach, a basic
model that has already been trained on a sizable dataset is used to carry out tasks that need the
utilization of different datasets. Transfer learning is popular because it enables accurate categorization
to be performed with a minimal dataset [12]. This is due to the fact that it is highly challenging for a deep
learning model that was trained entirely from scratch using a tiny dataset to achieve high accuracy since
the model does not receive sufficient information about the fluctuations in the data [13]. As a result, it is
unable to learn the crucial data points [14]. In transfer learning, the pre-trained model acquires the
knowledge from the extensive dataset, which is represented as the weights in the network. These weights
are then applied to a separate network with a different dataset and task. As a result, we "transfer" the first
network's learned features to the second network rather than training the second network from scratch,
typically using a small dataset. The initial layers of a deep convolutional network are typically employed
in computer vision to learn the basic properties of an image, such as its lines and shapes. In its final
layers, the network learns about the unique attributes required for a given task, such as picture
categorization. As a result, in transfer learning, the base network or model's weights are typically kept
constant (frozen) [15]. The final layer of the network, which is often a fully connected layer, is then where
the learning process is completed so that the network can learn the appropriate features to categorize the
new dataset. In order to further enhance the network's performance, numerous studies also recommend
building additional layers on top of the basic model, which we referred to as the head model.
MobileNetV2
The proposed method utilizes MobileNetV2 as the base model [16]. MobileNetV2 was developed from
MobileNetV1 [17] by adding inverted residual with linear bottleneck modules as shown in Fig. 1.
MobileNet architecture is based on depthwise separable convolution.
Page 4/25
The typical 2D convolution process all of the input channels directly to produce one output channel by
convolving in the depth dimension (channel) as well. With depthwise convolution, the input image and
the filter are divided into several channels, and each input channel is then convolved with the associated
filter channel. These output channels are then piled back up after the filtered output channels have been
created. In separable depthwise convolution, the stacked output channels are then combined into one
channel by filtering them using a 1×1 convolution, also known as pointwise convolution. The depthwise
separable convolution yields the same results as the conventional convolution, but it is more effective
because it uses less parameters. MobileNetV1, which generates output with a size of 7×7×1280 pixels,
comprises 28 convolutional layers if the depthwise and pointwise convolution are counted as
independent layers. Both MobileNetV1 and MobileNetV2 accept photos as input that are 224x224x3
pixels in size. As a result, the dataset's input photos are shrunk and cropped to 224×224 pixels. After the
initial convolution layer with 32 filters, MobileNetV2 adds 19 inverted residual bottleneck layers, followed
by a pointwise convolution that results in output with a size of 7×7×1280 pixels. Residual block uses a
skip-connection to join the beginning and end of a convolutional block in order to transmit data to the
deeper layer of the network. The convolutional block's beginning and finish often have more channels
than the layers in between in a normal residual block. In contrast, the connected layers in the inverted
residual block used in MobileNetV2 have fewer channels than the layers in between, which results in
significantly fewer parameters than in the conventional residual block.
Figure 2 illustrates how MobileNetV1 and MobileNetV2 layers differ from one another. On MobileNetV2,
numerous convolutional blocks have a skip-connection, however in MobileNetV1, the skip-connection is
absent [18].
2.2 Proposed Method for Text Recognition

The proposed techniques involved in text recognition are described below. Given an image as input, our
goal is to recognize the text present within the image.
Optical Character Recognition (OCR)
A process called OCR converts printed texts into digital image files [19]. OCR can be used in internet
connected mobile device applications that extract text captured using the device's camera. OCR API is
used to extract the text from the image file captured and provided by the device. The OCR API returns the
extracted text, along with information about the location of the detected text in the original image back to
the device app for further processing such as text-to-speech or display [20]. Mobile Vision Text API
provides an easy way to integrate OCR on almost all Android devices [21].
Mobile Text Vision API
The ML Kit Text Recognition API can recognize text in any Latin-based character set.
Text structure
Page 5/25
Text is divided into blocks, lines, elements, and symbols by the text recognizer.
A block is a continuous group of text lines, such as a column or paragraph.

A line is a group of words that are all on the same axis.
In most Latin languages, an element is a contiguous group of alphanumeric characters on the same
axis (a "word"), but in other languages it is a word.
In most Latin languages, a symbol is a single alphanumeric character on a single axis, although in
others it is a character.
Examples are shown in descending order graphically below in Fig. 3. A text block is the first highlighted
block in cyan. Text lines make up the second group of highlighted blocks in blue. Words are the third
group of highlighted blocks, which are dark blue.
2.3 Experimental Setup for Currency Recognition

This section describes the experimental setup, the dataset and comparative approaches used for
evaluation for Indian currency recognition.
Dataset
A detailed description of the five Indian paper currency datasets used in the evaluation is included here,
along with details about how the proposed network will be implemented. Below are brief descriptions of
each dataset:
1. IEEE Dataset: This image dataset [4] consists of 2900 images (1900 images of Indian banknotes and
1000 images of Thai banknotes). The dataset consists of 10 classes namely 10 New, 10 Old, 20, 50 New,
50 Old, 100 New, 100 Old, 200, 500, 2000 of Indian banknotes. The dataset has been split according to
the 80:10:10 ratio for training, validation and testing sets and divided them into 1763 images for training,
196 for validation, and 185 for testing as the dataset is smaller.
2. Datasets from Kaggle: Dataset [5] consists of 995 images overall, split over training and validation
categories. The training directory consists of 804 files and the validation directory consists of 191 files.
Dataset [6] consists of 4002 files split over training, validation, and test categories. The training directory
consists of 3566 files, validation directory consists of 345 files and the testing directory consists of 91
files. Both the datasets are constructed over 8 classes: 10, 20, 50, 100, 200, 500, 2000 and Background.
3. Other Datasets: We have also performed some preliminary analysis on other publicly available
datasets [7], and [8]. The split for training, validation and test sets is 60:20:20.
After certain dataset processing such as removal of non-existing notes, corrupt images, and misclassified
images, we were finally able to obtain 10108 training images and 2192 validation images distributed over
8 categories: 10, 20, 50, 100, 200, 500, 2000 and Background.
Page 6/25
Data Augmentation
We have used augmentation techniques such as random rotation up to 45 degrees (clockwise) and
random zoom up to 0.1. The images were augmented with a 10% shift in width and height. Additionally, a
sheer augmentation with 20-degree counter-clockwise direction and horizontal flip have been used in the
data augmentation process.
Evaluation Metrics
The following metrics are used to evaluate the given multi-class classification problem: accuracy, average
accuracy, and weighted average accuracy. The percentage of accurate predictions made for each class is
known as accuracy. The average of class-wise forecasts is average accuracy. The imbalanced
distribution of images across all currency denomination classes is countered by the weighted average
accuracy, which is calculated by counting the number of images in each category and then averaging
them.
Comparative Approaches
This section briefly discusses the approach used for the comparative analysis and evaluation. We
compared the proposed network with EfficientNetB5. Slight variations in accuracy were obtained by
varying the training hyperparameters. So, for fair assessment, we have trained the models with uniform
benchmark configurations.
Training
We have examined several publicly accessible datasets of Indian paper currency using the proposed
MobileNetV2 model and EfficientNetB5 model in order to thoroughly analyze the proposed framework.
The success across this diverse range of datasets demonstrates the model's actual generalizability. We
have presented the quantitative analysis of MobileNetV2's performance on the proposed ICD, along with
the datasets we combined.
Both the models are pre-trained on ImageNet [22]. We trained the models using consistent benchmark
setups to ensure fair assessment. The source image is scaled down to 224 × 224. Both models are
trained using 16-batch sizes across 20 epochs. Categorical cross-entropy is used as the loss function. All
models appear to perform better overall when the learning rate is set to 0.001. In the models under
consideration, we experimented with stochastic gradient descent (SGD) and adam as optimizers. But we
discovered that the adam optimizer yields superior outcomes. The values for the hyperparameters depth
multiplier, alpha, and dropout are 1, 1, and 0.001, respectively. The training and testing are carried out
using Nvidia Tesla T4 GPUs.
The categorical cross entropy is given by the Eq. (1)
i=n
LC E = −∑ ti loglogp fornclasses
i,
i=1
Page 7/25
(1)
Where ti is the truth label and pi is the Softmax probability for the ith class.
Adam optimizer shown in Eq. (3) involves a combination of two gradient descent methodologies.
w t+1 = w t − αmt
(2)
where,
δL
mt = βmt−1 + (1 − β) [ ]
δw t
(3)
mt = aggregate of gradients at time t [current] (initially, mt = 0)
mt−1 = aggregate of gradients at time t-1 [previous]
Wt = weights at time t
Wt+1 = weights at time t + 1
αt = learning rate at time t
∂L = derivative of Loss Function
∂Wt = derivative of weights at time t
β = Moving average parameter (const, 0.9)
1. IEEE Dataset: The following classes 10New, 10Old, 20, 50New, 50Old, 100New, 100Old, 200, 500, 2000
of the dataset were clubbed and categorized under the following classes: 10, 20, 50, 100, 200, 500, and
2000. The accuracy achieved was 96.72% with EfficientNetB5 and 83.34% with MobileNetV2. Figure 4
shows the accuracy graphs plotted on IEEE dataset.
2. Datasets from Kaggle: Datasets [5] and [6] both are distinguished into 8 categories: 10, 20, 50, 100,
200, 500, 2000 and Background. MobileNetV2 scores the average accuracy of 74.75% while the
EfficientNetB5 model scores 91.04%. Figure 5 shows the accuracy graphs plotted on the combined
Kaggle datasets.
3. Other Datasets: The average accuracy of dataset [7] with SGD optimizer was observed to be 57% on
EfficientNetB5 model and 92.24% with Adam. Therefore, for further compilation, Adam optimizer has
been used. Dataset [8] with classes 10, 20, 50, 100, and 1000, consisting of older currencies was sorted
Page 8/25
and only the classes 10, 20, 50 and 100 were tested on MobileNetV2 and EfficientNetB5. The accuracies
achieved are 95.48% and 81.34% respectively and are as shown in below graphs Fig. 6.
4. ICD dataset: We classified our dataset into the following categories: 10, 20, 50, 100, 200, 500 and 2000.
Adam optimizer was used and the model was trained with batch size 16, epochs 20, and learning rate
0.0001. The accuracy achieved was 83.98% with MobileNetV2 and 86.01% with EfficientNetB5.
MobileNetV2 and EfficientNetB5 were tested on 6 datasets including the proposed dataset, as shown in
Table 1.
Table 1
Performance of the proposed dataset
Dataset MobileNetV2 EfficientNetB5
[4] 83.34 96.72
[5] + [6] 74.75 91.04
[7] + [8] 95.48 91.34
Proposed ICD dataset 83.98 86.01
Analyzing the Models
The MobileNetV2 model scores 83.98 while the efficientNetB5 model scores 86.01.
The MobileNetV2 model requires fewer trainable parameters – 663560 compared to the
EfficientNetB5 model which requires 131800056.
The MobileNetV2 model requires lower execution time (560ms) compared to the EfficientNetB5
model.
The MobileNetV2 model is so lightweight (2.6 MB) compared to the EfficientNetB5 model, hence will
be easier to deploy it on the android.
Hence, we can conclude the MobileNetV2 model is appropriate for working with the currency dataset and
for its deployment on android. Figure 8 shows the features which led to our selection of MobileNetV2 over
EfficientNetB5.
2.4 Experimental Setup for Text Recognition

For text detection and recognition, our proposed system makes use of the Google Mobile Vision Text API
[20]. The Android library is used with the text-to-speech engine. The design of our proposed Android-
based live text reader is described in the sections below.
Following are the steps involved in the implementation process:
Page 9/25
In the MainActivity, the application checks for camera-permission.
On receiving the permission, a TextRecognizer object is created.
CameraSource object is implemented to start the camera.
A processor is set to the TextRecognizer to detect if any text is available on the camera screen.
The values of TextBlock are used to create StringBuilder objects and are added to the textView, which
will be updated every time there is text in the camera view.
TextRecognizer: This object processes images and determines what text appears within them. Once it’s
initialized, it can be used to detect text in all types of images.
CameraSource: This is a camera manager pre-configured for Vision processing. Here we will set the
resolution to 1280*1024 and turn auto-focus on, because it will help in recognizing smaller text much
faster. Also, we set the cameraSource to use the rear camera by default.
Detector.Processor < TextBlock>: For TextRecognizer to read text straight from the camera, we have to
implement a Detector Processor, which will handle detections as often as they become available.
Following segmentation, text is converted to speech using a text-to-speech engine. The text to speech
engine is run using built-in Android libraries. The speech-out service is handled by the phone's built-in
feature. The proposed system's flowchart is shown in Fig. 9.
3 Results
This section presents the systematic analysis and detailed observations pertaining to the quantitative
experiments.
3.1 Currency Recognition

After experimenting with five different datasets, we combined them along with certain preprocessing to
form the final dataset. This final dataset was used with the train, test, and validation sets divided in the
ratio of 70:20:10. The 70% dataset was used for the training of the model with the activation function
‘relu’. We have used a softmax activation function in the last layer for this multi-class classification
problem. The optimizer used is Adam and the loss function is binary cross entropy. As a result, after
training the model up to 100 epochs, we observed an accuracy of 94.35%.
The accuracy graphs for 20, 32, and 100 epochs are as shown below in Fig. 10. Figure 11 depicts the bar
chart for the same.
The precision, recall and f1-score obtained for each epoch is shown in Table 2.
Page 10/25
Table 2
Validation and test report for 20, 32 and 100 epochs.
Hyperparameter Validation Report Test Report
precision recall F1 score precision recall F1 score
20 epochs 0.69 0.73 0.69 0.79 0.77 0.77
32 epochs 0.75 0.77 0.74 0.86 0.84 0.84
100 epochs 0.83 0.87 0.85 0.88 0.91 0.86
Confusion matrix is shown in Fig. 12 where 1580 were correctly classified out of 2192 images during
validation, while on the test set 106 images were classified correctly out of 125 images.
3.2 Application
This section briefly illustrates the utility of the proposed MobileNetV2 model in the real-world scenario of
currency recognition and OCR text recognition via our proposed android app ‘Third Eye: A Blind aid’.
In order to help BVIPs recognize Indian banknotes, Third Eye utilizes a deep learning algorithm to identify
the underlying currency denomination. Our Android app has text recognition built in using OCR. Third Eye
offers an intuitive user interface that was created with the blind in mind and put to the test. To use this
feature, the user needs to hold the currency in front of the smartphone's back camera. The software then
notifies the user of the currency denomination via an auditory alert. With the frozen trained deep learning
model in .pb format, Third Eye recognizes the denomination of the currency. The user must hold the
camera in front of printed or handwritten text for text recognition to work. The app reads out the text for
the user.
4 Conclusion
This paper focuses on the issue of visually impaired people having trouble reading Indian currency and
text, and it offers an automated end-to-end solution. We have used a bigger dataset that includes images
with various backgrounds, conditions, lighting, and orientations. The proposed lightweight network
(MobileNetV2) makes use of depth-wise separable convolution techniques and controlled multi-dilation
techniques. The generalization and prediction capabilities of the proposed framework have been
thoroughly evaluated using publicly accessible datasets. In terms of parameters (3.6M) and accuracy,
MobileNetV2 is more straightforward and effective. Mobile Vision API is used to implement OCR. An
android application ‘Third Eye’ is presented to recognize Indian currency denominations and text for
Visually impaired.
The proposed framework is suitable for a mobile-compatible environment offering a trade-off between
memory, speed, and high accuracy. In the future, the proposed framework can be further improvised by
examining fine-grained detectors for capturing the other discriminative clues and motifs present in the
Page 11/25
currency images; by considering more feature-rich learning and light weight models; and by training on
more diverse and larger datasets. The framework can be also extended for global currency recognition,
for serial numbers recognition, for detecting fake currency and recognizing text of different languages.
Declarations
Ethical Approval
This declaration is “not applicable”.
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that
could have appeared to influence the work reported in this paper.
Authors' contributions
All authors contributed to the study conception and design. Material preparation, data collection and
analysis were performed by Sahana V, Shashidhar R., and Bindu S.N. The first draft of the manuscript
was written by Chandana A. N., Nishrutha C.G. and all authors commented on previous versions of the
manuscript. All authors read and approved the final manuscript.”
Funding
The authors declare that no funds, grants, or other support were received during the preparation of this
manuscript.
Availability of data and materials
Data “available on request”.
References
1. World Health Organization: WHO. (2022). Blindness and vision impairment. www.who.int.
https://www.who.int/news-room/fact-sheets/detail/blindness-and-visual-impairment
2. Haileamlak, A. (2022b). The Burden of Visual Impairment and Efforts to Curve it Down. PubMed,
32(5), 874.
3. Singh, Mandhatya, et al. "IPCRF: An End-to-End Indian Paper Currency Recognition Framework for
Blind and Visually Impaired People." IEEE Access 10 (2022): 90726-90744.
4. Vidula Meshram, Pornpat Thamkrongart, Kailas Patil, Prawit Chumchu, Shripad Bhatlawande, July 8,
2020, "Dataset of Indian and Thai Banknotes", IEEE Dataport, doi: https://dx.doi.org/10.21227/cjb5-
n039.
Page 12/25
5. Indian Currency Note Images dataset 2020. (2020, September 21). Kaggle.
https://www.kaggle.com/datasets/vishalmane109/indian-currency-note-images-dataset-2020.
6. Indian currency notes. (2019, December 6). Kaggle.
https://www.kaggle.com/datasets/shobhit18th/indian-currency-notes
7. Anilsathyan. (n.d.). indian-currency-classification/data at master · anilsathyan7/indian-currency-
classification. GitHub. https://github.com/anilsathyan7/indian-currency-
classification/tree/master/datacurrency_dataset.zip. (n.d.). Google Docs.
8. https://drive.google.com/file/d/0B7Am6nOVeP7N1lQOUVUYlBuc0E/edit?
resourcekey=0U_ItA0WqcWQljjphvAxKGw
9. S. A. Edupuganti, V. Durga Koganti, C. S. Lakshmi, R. Naveen Kumar and R. Paruchuri, "Text and
Speech Recognition for Visually Impaired People using Google Vision," 2021 2nd International
Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 2021, pp. 1325-1330,
doi: 10.1109/ICOSEC51865.2021.9591829.
10. IBM. https://www.ibm.com/cloud/blog/optical-character-recognition
11. Brownlee, J. (2019). A Gentle Introduction to Transfer Learning for Deep Learning.
MachineLearningMastery.com. https://machinelearningmastery.com/transfer-learning-for-deep-
learning
12. Marcelino, P. (2022, May 6). Transfer learning from pre-trained models - Towards Data Science.
Medium. https://towardsdatascience.com/transfer-learning-from-pre-trained-models-f2393f124751
13. Alzubaidi, L., Zhang, J., Humaidi, A.J. et al. Review of deep learning: concepts, CNN architectures,
challenges, applications, future directions. J Big Data8, 53 (2021). https://doi.org/10.1186/s40537-
021-00444-8
14. Abid, A. (2021, December 31). Fixing Your Machine Learning Model’s Failure Points. Medium.
https://towardsdatascience.com/fixing-your-machine-learning-models-failure-points-e3ec0a047895
15. Calton, Landon, and Zhangping Wei. "Using artificial neural network models to assess hurricane
damage through transfer learning." Applied Sciences 12.3 (2022): 1466.
16. Sandler, M., A. Howard, M. Zhu, A. Zhmoginov, and L. C. Chen. (2018) “MobileNetV2: Inverted
residuals and linear bottlenecks.” Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition: 4510–4520, doi: 10.1109/CVPR.2018.00474.
17. Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile vision
applications." arXiv preprint arXiv:1704.04861 (2017).
18. Indraswari, Rarasmaya, Rika Rokhana, and Wiwiet Herulambang. "Melanoma image classification
based on MobileNetV2 network." Procedia computer science 197 (2022): 198-207.
19. Gupt, M. (2019). Optical Character Recognition on Android – OCR. Truiton.
https://www.truiton.com/2016/11/optical-character-recognition-android-ocr/
20. S. A. Edupuganti, V. Durga Koganti, C. S. Lakshmi, R. Naveen Kumar and R. Paruchuri, "Text and
Speech Recognition for Visually Impaired People using Google Vision," 2021 2nd International
Page 13/25
Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 2021, pp. 1325-1330,
doi: 10.1109/ICOSEC51865.2021.9591829.
21. Text recognition v2. (n.d.). Google Developers. https://developers.google.com/ml-kit/vision/text-
recognition/v2
22. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet: A large-scale hierarchical image
database,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 248–255
Figures
Figure 1
Bottleneck residual block
Page 14/25
Figure 2
MobileNetV2Convolutional Blocks
Figure 3
Text Segmentation
Page 15/25
Figure 4
Accuracy graphs of EfficientNetB5 and MobileNetV2 models for IEEE dataset.
Page 16/25
Figure 5
Accuracy graphs of EfficientNetB5 and MobileNetV2 models for Kaggle datasets.
Page 17/25
Figure 6
Accuracy graphs of EfficientNetB5 and MobileNetV2 models for github datasets.
Figure 7
Page 18/25
Accuracy graphs of EfficientNetB5 and MobileNetV2 models for the final dataset.
Figure 8
Comparison of MobileNetV2 and EfficientNetB5 with respect to epoch running time, average size of the
model, and average trainable parameters.
Page 19/25
Figure 9
Flowchart of Text Reader API.
Page 20/25
Figure 10
Accuracy graphs for 20, 32 and 100 epochs.
Page 21/25
Figure 11
Training and validation accuracy for MobileNetV2 model.
Page 22/25
Figure 12
Confusion matrix for validation and test sets.
Figure 13
Third-Eye Application User-interface.
Page 23/25
Figure 14
Currency Recognizer
Page 24/25
Figure 15
Text Recognizer
Page 25/25

Third Eye Smart Aid For Visually Impaired

Uploaded by

Copyright:

Available Formats

Third Eye Smart Aid For Visually Impaired

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Third Eye Smart Aid For Visually Impaired

Uploaded by

Copyright:

Available Formats

Third Eye Smart Aid for Visually Impaired

Posted Date: June 28th, 2023

The contributions of this work are:

2.1 Proposed Architecture for Currency Recognition

2.2 Proposed Method for Text Recognition

Optical Character Recognition (OCR)

Mobile Text Vision API

A block is a continuous group of text lines, such as a column or paragraph.

2.3 Experimental Setup for Currency Recognition

The categorical cross entropy is given by the Eq. (1)

mt = aggregate of gradients at time t [current] (initially, mt = 0)

mt−1 = aggregate of gradients at time t-1 [previous]

Wt+1 = weights at time t + 1

αt = learning rate at time t

∂L = derivative of Loss Function

∂Wt = derivative of weights at time t

β = Moving average parameter (const, 0.9)

[4] 83.34 96.72

[5] + [6] 74.75 91.04

[7] + [8] 95.48 91.34

Proposed ICD dataset 83.98 86.01

Analyzing the Models

2.4 Experimental Setup for Text Recognition

Following are the steps involved in the implementation process:

3.1 Currency Recognition

precision recall F1 score precision recall F1 score

20 epochs 0.69 0.73 0.69 0.79 0.77 0.77

32 epochs 0.75 0.77 0.74 0.86 0.84 0.84

100 epochs 0.83 0.87 0.85 0.88 0.91 0.86

This declaration is “not applicable”.

Availability of data and materials

Data “available on request”.

Bottleneck residual block

Accuracy graphs of EfficientNetB5 and MobileNetV2 models for IEEE dataset.

Accuracy graphs of EfficientNetB5 and MobileNetV2 models for Kaggle datasets.

Accuracy graphs of EfficientNetB5 and MobileNetV2 models for github datasets.

Flowchart of Text Reader API.

Accuracy graphs for 20, 32 and 100 epochs.

Training and validation accuracy for MobileNetV2 model.

Confusion matrix for validation and test sets.

Third-Eye Application User-interface.

You might also like