Computational Intelligence

Smart Innovation, Systems and Technologies 281
Janmenjoy Nayak · H. S. Behera ·

Bighnaraj Naik · S. Vimal ·
Danilo Pelusi Editors
Computational
Intelligence in
Data Mining
Proceedings of ICCIDM 2021
Smart Innovation, Systems and Technologies
Volume 281
Series Editors
Robert J. Howlett, Bournemouth University and KES International,
Shoreham-by-Sea, UK
Lakhmi C. Jain, KES International, Shoreham-by-Sea, UK
The Smart Innovation, Systems and Technologies book series encompasses the topics
of knowledge, intelligence, innovation and sustainability. The aim of the series is to
make available a platform for the publication of books on all aspects of single and
multi-disciplinary research on these themes in order to make the latest results avail-
able in a readily-accessible form. Volumes on interdisciplinary research combining
two or more of these areas is particularly sought.
The series covers systems and paradigms that employ knowledge and intelligence
in a broad sense. Its scope is systems having embedded knowledge and intelligence,
which may be applied to the solution of world problems in industry, the environment
and the community. It also focusses on the knowledge-transfer methodologies and
innovation strategies employed to make this happen effectively. The combination
of intelligent systems tools and a broad range of applications introduces a need
for a synergy of disciplines from science, technology, business and the humanities.
The series will include conference proceedings, edited collections, monographs,
handbooks, reference books, and other relevant types of book in areas of science and
technology where smart systems and technologies can offer innovative solutions.
High quality content is an essential feature for all book proposals accepted for the
series. It is expected that editors of all accepted volumes will ensure that contributions
are subjected to an appropriate level of reviewing process and adhere to KES quality
principles.
Indexed by SCOPUS, EI Compendex, INSPEC, WTI Frankfurt eG, zbMATH,
Japanese Science and Technology Agency (JST), SCImago, DBLP.
All books published in the series are submitted for consideration in Web of Science.
More information about this series at https://link.springer.com/bookseries/8767

Janmenjoy Nayak · H. S. Behera · Bighnaraj Naik ·
S. Vimal · Danilo Pelusi
Editors
Computational Intelligence
in Data Mining
Proceedings of ICCIDM 2021
Editors
Janmenjoy Nayak H. S. Behera
Department of Computer Science Department of Information Technology
Maharaja Sriram Chandra BhanjaDeo Veer Surendra Sai University of Technology
(MSCB) University Sambalpur, Odisha, India
Baripada, Odisha, India
S. Vimal
Bighnaraj Naik Department of AI & DS
Department of Computer Application Ramco Institute of Technology
Veer Surendra Sai University of Technology Rajapalayam, Tamil Nadu, India
Sambalpur, Odisha, India
Danilo Pelusi
Faculty of Communication Sciences
University of Teramo
Teramo, Italy
ISSN 2190-3018 ISSN 2190-3026 (electronic)

Smart Innovation, Systems and Technologies
ISBN 978-981-16-9446-2 ISBN 978-981-16-9447-9 (eBook)
https://doi.org/10.1007/978-981-16-9447-9
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
ICCIDM Committee
Chief Patron
Dr. K. Someswara Rao, Chairman, AITAM
Patrons
Sri L. L. Naidu, Secretary

Sri. K. Madhu Kumar, Director
Smt. V. Sudha Priya, Vice Chairman
Sri T. Naga Seshu, Director
Prof. V. V. Nageswara Rao, Director
Dr. K. Ravi Kumar, Joint Secretary
Sri V. Naga Sanketh, Director
Sri T. Naga Raju, Treasurer
Dr. K. B. Madhu Sahu, Director, R&D
Dr. A. S. Srinivasa Rao, Principal
Honorary Advisory Chair
Prof. V. E. Balas, Aurel Vlaicu University, Romania
v
vi ICCIDM Committee
Honorary General Chairs
Prof. Lakhmi C. Jain, University of Canberra, Australia

Prof. Ajith Abraham, Director, MIR LAB, USA
Prof. Margarita Favorskaya, Reshetnev Siberian State University of Science and
Technology, Russia
General Chairs
Dr. B. K. Panigrahi, Department of EEE, Indian Institute of Technology (IIT), Delhi,

India
Dr. H. S. Behera, Department of IT, Veer Surendra Sai University of Technology
(VSSUT), Burla, Odisha, India
Convenors
Dr. Ch. Ramesh, Department of CSE, AITAM, AP

Dr. Janmenjoy Nayak, Department of Computer Science, Maharaja Sriram Chandra
BhanjaDeo (MSCB) University, Baripada, Odisha, India
Program Chairs
Dr. A. K. Das, IIEST, Shibpur, WB

Dr. D. P. Mohapatra, NIT Rourkela, Odisha, India
Dr. Danilo Pelusi, University of Teramo, Italy
Dr. J. C. Bansal, South Asian University, New Delhi, India
Dr. Bighnaraj Naik, VSSUT, Burla, Odisha, India
Dr. S. Vimal, Ramco Institute of Technology, India
International Advisory Committee
Dr. Weiping Ding, Deputy Dean of School of Information Science and Technology,
Nantong University
Dr. Florin PopentiuVladicescu, University of Politehnica, Romania
Dr. Vincenzo Piuri, University of Milan
Dr. Shaikh A. Fattah, Bangladesh
ICCIDM Committee vii
Dr. Claude Delpha, Université Paris Saclay

Dr. Mukesh Prasad, University of Technology
Dr. Uttam Ghosh, Vanderbilt University
Dr. Charlie (Seungmin) Rho, Chung-Ang University
Dr. Naveen Chilamkurti, La Trobe University
Dr. Ahmed A. Elngar, Beni-Suef University
Dr. Dac-Nhuong Le, Haiphong University
Dr. Irfan Mehmood, University of Bradford
Dr. Alireza Souri, Islamic Azad University, Iran
Dr. Qin Xin, University of the Faroe Islands
Dr. Mamoun Alazab, Charles Darwin University
Dr. Muhammad Bilal, Hankuk University of Foreign Studies
Dr. Xiaochun Cheng, Middlesex University
Dr. Raffaele Mascella, University of Teramo
Dr. Korhan Cengiz, Kadir Has University
Dr. Noor Zaman, Tayloz’s University
Dr. Yong Deng, Institute of Fundamental and Frontier Science Chengdu
Dr. Luca Tallini, University of Teramo
Dr. Subramaniam Ganesan, Oakland University
Dr. Khan Muhammad, Sejong University
Dr. Ramani Kannan, Universiti Teknologi PETRONAS
National Advisory Committee
Dr. D. Vishnu Murthy, Dean, FS, AITAM, AP

Dr. M. Jayamanmadha Rao, Dean, SA, AITAM, AP
Dr. Chappa. Ramesh, Dean, IQAC, AITAM, AP
Dr. D. T. V. Dharmajeerao, Dean, A&P, AITAM, AP
Dr. D. Azad, Dean, R&D, AITAM, AP
Dr. D. Vijaya Kumar, Dean, Academics, AITAM, AP
Dr. D. Yugandhar, Associate Dean, ACH, AITAM, AP
Dr. U. D. Prasan, HOD, CSE, AITAM, AP
Dr. D. Sreeramulu, HOD, ME, AITAM, AP
Dr. Y. Ramesh, HOD, IT, AITAM, AP
Dr. K. Kiran Kumar, HOD, EEE, AITAM, AP
Dr. P. Dinakar, HOD, CE, AITAM, AP
Dr. B. Rama Rao, HOD, ECE, AITAM, AP
Dr. Koppala Venugopal, HOD, MBA, AITAM, AP
Dr. R. Santhi Kumar, HOD, BS&H, AITAM, AP
Prof. S. Panda, VSSUT, Burla
Prof. R. P. Panda, VSSUT, Burla
Prof. U. Maulik, Jadavpur University
Prof. P. P. Choudhury, ISI, Kolkata
viii ICCIDM Committee
Prof. J. K. Mondal, Kalyani University

Prof. S. Bhattacharjee, NIT, Surat
Prof. D. Ramesh, Professor, Department of CSE, JNTU, Jagityal
Prof. Kenji Suzuki, Tokyo Institute of Technology, Japan
Prof. P. K. Dash, Director, R&D, SOA University, BBSR
Prof. B. B. Biswal, Director, NIT Meghalaya
Prof. Ashish Ghosh, ISI, Kolkata
Prof. Saroj K. Meher, ISI, Bangalore
Prof. G. Panda, IIT, BBSR
Prof. S. G. Sanjeevi, NIT, Warangal
Prof. D. K. Pratihar, IIT, Kharagpur
Prof. P. K. Hota, VSSUT, Burla
Prof. Arnab Bhattacharya, IIT, Kanpur
Prof. S. P. Maity, IIEST, Shibpur
Prof. J. V. R. Murthy, JNTU Kakinada
Prof. D. V. L. N. Somayajulu, NIT, Warangal
Prof. G. K. Nayak, IIIT, BBSR
Prof. N. B. Venkateswarlu, GVP(W), Visakhapatnam
Prof. D. P. Mohapatra, NIT, Rkl
Prof. M. R. Patra, BU, Berhampur
Prof. S. Dehuri, FMU, Balasore
Prof. R. Rajeswara Rao, Vice Principal, JNTU, Vizianagaram
Dr. K. Jagadeesh, IIIT, Kanchipuram, India
Prof. Pascal Lorenz, University of Haute Alsace, France
International Technical Committee Members
Dr. Adel M. Alimi, REGIM-Lab., ENIS, University o Sfax, Tunisia

Dr. Rabindra Kumar Sahu, VSSUT, Burla
Dr. Mazhar Rathore, Hammad Bin Khalifa University, Doha, Qatar
Dr. Yurui Ming, UTS, Sydney, Australia
Dr. Fuyuan Xiao, Southwest University, China
Dr. Soumya Ranjan Nayak, Amity University, Noida, India
Dr. M. R. Kabat, Department of CSE, VSSUT, India
Dr. Ram Sarkar, Jadavpur University, WB, India
Dr. B. Acharya, NIT, Raipur, India
Dr. Ahmad Taher Azar, Faculty of Computers and Information, Benha University,
Egypt
Dr. A. R. Routray, Fakir Mohan University, Balasore, Odisha, India
Dr. Santanu Phadikar, Maulana Abul Kalam Azad University of Technology, West
Bengal
Dr. A. K. Turuk, Department of CSE, NIT, RKL, India
Dr. Swati Vijay Shinde, Pimpri Chinchwad College of Engineering, Pune
ICCIDM Committee ix
Dr. Sarat Ch. Nayak, GITAM University, Visakhapatnam, AP, India

Dr. Manohar Mishra, SOA University, Bhubaneswar
Dr. G. T. Chandra Sekhar, Sri Sivani College of Engineering, Srikakulam, Andhra
Pradesh
Dr. Ananya Barui, Center of Healthcare Science and Technology, IIEST, Shibpur,
India
Dr. Zehong Cao, University of Tasmania, Australia
Dr. M. Balaji, JNTUK, Kakinada
Dr. Santosh Ku. Sahoo, CVR College of Engineering, Hyderabad, India
Dr. Motahar Reza, Gitam University, Hyderabad, India
Dr. Sunanda Das, JAIN University, Bangalore, India
Dr. Varun G. Menon, SCMS School of Engineering and Technology, Kerala, India
Dr. S. K. Hafizul Islam, IIIT, Kalyani
Dr. Shrikant V. Sonekar, J D College of Engineering and Management, Nagpur
Dr. Soumendra Das, AITAM, Tekkali
Dr. B. B. Chaudhury, Indira Gandhi Institute for Technology, Dhenkanal, Odisha
Organizing Chairs
Mr. K. Eswara Rao, Department of CSE, AITAM, AP

Mr. Ch. Ravi Kishore, Department of CSE, AITAM, AP
Publicity Chairs
Dr. T. Ravi Kumar, Department of CSE, AITAM, AP

Dr. M. V. B. Chandra Sekhar, Department of CSE, AITAM, AP
Dr. K. Yogeswara Rao, Department of CSE, AITAM, AP
Sri L. V. Satyanarayana, Department of CSE, AITAM, AP
Publication Chairs
Dr. Y. Ramesh, Department of CSE, AITAM, AP

Dr. K Prasada Rao, Department of CSE, AITAM, AP
Mr. T. Chalapatha Rao, Department of CSE, AITAM, AP
Mrs. H. Swapnarekha, Department of IT, AITAM, AP
Mr. P. Sasibhusana Rao, Department of CSE, AITAM, AP
x ICCIDM Committee
Sponsorship Chairs
Dr. B. Rajesh, Department of MBA, AITAM, AP

Dr. T. P. R. Vital, Department of CSE, AITAM
Dr. Santosh, Department of ECE, AITAM, AP
Dr. B. Ramesh Naidu, Department of IT, AITAM, AP
Registration Chairs
Dr. G. Nageswara Rao, Department of IT, AITAM, AP

Dr. Ch. Dhanunjaya Rao, Department of CSE, AITAM, AP
Mrs. N. Preethi, Department of CSE, AITAM, AP
Smt. K. Sangeetha, Department of CSE, AITAM, AP
Web Chair
Sri P. Suresh Kumar, Department of CSE, AITAM, AP
Organizing Committee Members
Sri R. Srinivas, Department of CSE, AITAM, AP

Sri. S. Vishnu Murthy, Department of CSE, AITAM, AP
Sri Pramod Kumar Sahu, Department of CSE, AITAM, AP
Sri T. Prabhakara Rao, Department of CSE, AITAM, AP
Sri. B. Vijay, Department of CSE, AITAM, AP
Sri B. Ramesh, Department of CSE, AITAM, AP
Sri G. Stalin Babu, Department of CSE, AITAM, AP
Sri P. Ram Kishore, Department of CSE, AITAM, AP
Sri G. Vijay Kumar, Department of CSE, AITAM, AP
Sri G. S. Pavan Kumar, Department of CSE, AITAM, AP
Sri D. Sreenu Babu, Department of CSE, AITAM, AP
Mrs. K. B. Anusha, Department of CSE, AITAM, AP
Sri M. Yugandhar, Department of CSE, AITAM, AP
ICCIDM Committee xi
Reviewer Committee
Anindya Banerjee, Capgemini Technology Services India

Arun Agarwal, ITER, Siksha ‘O’ Anusandhan Deemed to be University
Md. Kowsher, Stevens Institute of Technology, USA
Dr. Rakesh Dani, Graphic Era Deemed to be University
Sushreeta Tripathy, ITER, Siksha ‘O’ Anusandhan Deemed to be University
Luca Talini, University of Teramo, Italy
Dr. P. Sirish Kumar, Aditya Institute of Technology and Management, AP
Dr. Sagaya Aurelia, CHRIST (Deemed to be) University, Bangalore
Sunil Sharma, Pacific University, Udaipur
Jayaraman Valadi, Flame University, Pune
Abhijit Barman, National Institute of Technology Silchar
Dr. C. Annal Deva Priya Darshini, Madras Christian College, Chennai
Prof. Nitin N. Pise, MIT-WPU, Pune
Dr. R. V. S. Lalitha, Aditya College of Engineering and Technology, Surampalem
Sourav Das, Future Institute of Technology, Kolkata
Dr. A. V. Krishna Prasad, MVSR Engineering College, Hyderabad
Dr. K. Prabu, Sudharsan College of Arts and Science, Pudukkottai Tamil Nadu
Dr. Shrikant V. Sonekar, J D College of Engineering and Management, Nagpur
Maharashtra
Dr. Shaik Altaf Hussain Basha, Krishna Chaitanya Institute of Technology and
Sciences (KITS), Markapur
Subhadip Mukherjee, Kharagpur College, Kharagpur, West Bengal
Sangeetha Rajesh, Somaiya Vidyavihar University, Maharashtra
Dr. Kavita Arora, MRIIRS, Faridabad
Dr. P. Kauser Ahmed, VIT, Vellore, Tamil Nadu
Dr. Vijay Anant Athavale, Panipat Institute of Engineering and Technology, Haryana
M. Swarna Sudha, Ramco Institute of Technology, Tamil Nadu
Mr. K. Vignesh Saravanan, Ramco Institute of Technology, Tamil Nadu
Dr. M. E. Paramasivam, Sona College of Technology, Tamil Nadu
Prof. Suja A. Alex, St. Xavier’s Catholic College of Engineering, Tamil Nadu
Dr. V. E. Jayanthi, PSNA College of Engineering and Technology, Dindigul, India
S. Prathap Singh, St. Joseph’s Institute of Technology, Chennai
Dr. G. R. S. Murthy, Gayatri Vidya Parishad College for Degree and PG Courses
(A), Visakhapatnam
Naila Aziza HOUACINE, USTHB—LRIA Algiers
Dr. Sachin Lakra, Manav Rachna University, Faridabad, Haryana, India
Dr. Bhavana Narain, MATS University, Raipur (CG)
S. Valarmathi, Avinashilingam University, Coimbatore—641043, Tamil Nadu
Dr. Garima Mathur, Poornima University, Jaipur
Dr. S. Sumathi, Sri Ramakrishna Institute of Technology, Coimbatore
Dr. P. M. K. Prasad, GVP College of Engineering for Women Visakhapatnam
Dr. M. Nandhini, Government Arts College, Udumalpet, Tamil Nadu
xii ICCIDM Committee
Prachi Sharma, Sagar Institute of Research and Technology, Bhopal

Dr. M. Umadevi, Sri Adi Chunchanagiri Women’s College, Cumbum
Dr. Nagendra Panini Challa, Shri Vishnu Engineering College for Women,
Bhimavaram
Dr. Mahmood Ali Mirza, Krishna University, Machilipatnam, Andhra Pradesh
Dr. R. G. Vidhya, HKBK College Of Engineering and Technology, Bangalore
Dulari Bhatt, SAL College of Engineering—382721
Prashant Jani, Institute of Computer Technology, Ganpat University, India
Madhuri Chopade, Gandhinagar Institute of Technology
Dr. Arti Jain, Jaypee Institute of Information Technology (JIIT), Noida
Rahul A. Vaghela, Gandhinagar Institute of Technology
Dr. Rambabu Pemula, Raghu Engineering College, Visakhapatnam
Dr. S. N. Thirumala Rao, Narasaraopeta Engineering College
Poly Rani Ghosh, Jatiya Kabi Kazi Nazrul Islam University, Trishal, Mymensingh,
Bangladesh
Dr. D. T. Mane, JSPM’s Rajarshi Shahu College of Engineering, Pune, Maharashtra
Jitendra Singh Tamang, Sikkim Manipal Institute of Technology, Majhitar, Rangpo
Dr. Nilu Singh, Koneru Lakshmaiah Education Foundation (KLEF), Vijayawada
Dr. Akansha Kumar, Principal Data Scientist (Jio Platforms Ltd., Hyderabad)
Dr. K. Selvani Deepthi, ANITS Engineering College, Visakhapatnam
Dr. B. S. Shylaja, Dr. Ambedkar Institute of Technology, Outer Ring Road Near
Gnana Bharathi, Bengaluru
Dr. T. Satyanarayana Murthy, Bapatla Engineering College, AP
Prof. Swati Kadlag, Symbiosis International University Institute of Technology, Pune
H. Swapnarekha, Aditya Institute of Technology and Management (AITAM), Tekkali
Satish Kumar Patnala, ANITS Engineering College, Sangivalasa, Visakhapatnam
P. Suresh Kumar, Aditya Institute of Technology and Management, Tekkali, AP
D. Karun Kumar Reddy, Dr. Lankapalli Bullayya College of Engineering, Visakha-
patnam, Andhra Pradesh—530013
Dr. Sarat Ch. Nayak, GITAM University, Visakhapatnam, India
Dr. Soumya Ranjan Nayak, Amity University, Noida
Dr. Bighnaraj Naik, Veer Surendra Sai University of Technology, Burla, Odisha
Mr. Byomokesha Dash, Aditya Institute of Technology and Management (AITAM),
Tekkali, AP, India
Subhashree Mohapatra, ITER, Bhubaneswar
Dr. S. Siva Kumar, KL University, Andhra Pradesh
Dr. Kalyan Kumar Jena, Parala Moharaja College of Engineering and Technology,
BPUT, Odisha
Dr. Sunanda Das, JAIN University, Bangalore, India
Dr. G. T. Chandrasekhar, Sri Sivani College of Engineering, Srikakulam, AP
Dr. O. K. Sikha, Amrita School of Engineering, Coimbatore
Dr. Himansu Das, KIIT University, Bhubaneswar
Dr. Ch. Dhanunjay Rao, Aditya Institute of Technology and Management, Tekkali
Preface
The twenty-first century has witnessed the emergence of some groundbreaking

advancement in intelligent computing technologies that have revolutionized our way
of life. ICCIDM 2021 is an annual conference with unique and valuable dimen-
sions by focusing on various aspects of advancements in Computational Intelligence.
ICCIDM 2021 is a mega event for bringing the researchers from industry, academia,
and practitioners together from around the globe for disseminating the ideas, prob-
lems, and solution relating to various data mining and related domains. The partic-
ipants/readers will able to enhance their skills and thoughts for the development of
effective intelligent problems. Furthermore, ICCIDM 2021 promotes new research
dimensions and advancements in the theme of computational intelligence for the
analysis and exploration of large amounts of data for effective pattern recognition
problems.
After the five successful versions of ICCIDM, the 6th International Conference on
Computational Intelligence in Data Mining (ICCIDM-2021) is going to be organized
by the Department of Computer Science Engineering, Aditya Institute of Technology
and Management (AITAM), Tekkali, AP, India. This year, we especially encourage
papers on Computational Intelligence in Data Mining with successfully highlighting
modern applications and research challenges of post pandemic scenario. The covered
subjects related to intelligent learning, data mining, big data analytics, artificial
intelligence, natural language processing, deep learning applications, Internet of
Things, etc. are addressed. This proceeding addresses new applications and proposes
methodologies for evaluating and the roadmap for achieving the vision of intelligent
computing for computational intelligence in data mining. The proceedings capture
the current state-of-the- art research in the learning technology field and will signif-
icantly impact the research community in the longer term. In consideration of the
participant’s request, the conference committee has made the difficult decision to
convert the face-to-face meeting of this year to a virtual mode conference. We do
appreciate the realm of possibilities offered by virtually online conferences. However,
we miss the opportunity to bring academicians and researchers face-to-face together
due to the COVID-19 pandemic and the resulting stay-at-home orders.
xiii
xiv Preface
In this sixth version, two new themes such as deep learning and big data are
being specifically focused with a wider range of applications. Moreover, a separate
theme of proposals is invited on applications of computational intelligence in inte-
gration to text & video recognition, sentiment analysis, advanced image processing
and COVID data analysis. The articles are subdivided into four tracks such as
Advance computational intelligent techniques and their applications, computation
with modeling, nature inspired computation and neural network and data mining
applications. This series of ICCIDM contains good quality articles based on the
major and minor thematic areas of the conference, which are very specialized in the
respective domain. This volume is a wide range of collections of articles on applica-
tions of computational intelligence on Multi-sensor data fusion for Occupancy detec-
tion, Sentiment Analysis, Automated System for Facial Mask Detection and Social
Distancing During COVID-19 Pandemic, Detection of Insider Threats, Advanced
Persistent Threat detection, Diagnosing Plant Diseases, Multimodal MRI Analysis,
Alzheimer Disease Classification, Robot Motion using Gradient Generalized Artifi-
cial Potential Fields with obstacles, Analysis of Kidney Disease Dataset, Deepfakes
for Video Conferencing, Face Recognition, Impact of UV-C Treatment on Fruits and
Vegetables, Modeling and Forecasting Stock Closing Prices, Software Reliability
Prediction, Disaster Event Detection, Breast Cancer prediction, Customer Segmen-
tation, Solar Radiation Prediction, QCM Sensor based Alcohol Classification, Breast
cancer mammography identification etc.
For the sake of this well-prepared conference, all the accepted papers are double-
blind peer-reviewed by the concerned subject experts. These papers have gone
through a strict reviewing process performed by the qualified national and interna-
tional reviewers with the constraint of containing highly informative and insightful
contributions based on research quality. Atleast two subject expertise based reviewers
have reviewed each submission for a better outcome in terms of selecting the papers.
And the editors had a great time working in collaboration with international advisory,
program and technical committee members. We are very pleased to report that the
quality of the submissions this year turned out to be very high.
This version of the ICCIDM 2021 objective provides an interactive forum for
presentation and discussion on computational intelligence research and related fields.
In addition, this volume of proceedings provides an opportunity for readers to engage
with a selection of refereed papers that highlight the role of intelligent computing
methods and algorithms in the fields of data sciences. In this respect, on behalf of the
Organizing Committee, we would like to thank the reviewers, the technical committee
members, international and national advisory board members and the organizers for
their valuable efforts and work during this recurring pandemic challenge. We are
indebted to their great effort and professionalism. Furthermore, we wish to express
our heartfelt gratitude to the conference keynote speakers. Moreover, the support and
cooperation of the Springer technical team for the timeline production of this volume
Preface xv
are deeply acknowledged. Finally, we wish to thank all the authors of submitted
technical papers, demos and exhibitions, for contributing to these great proceedings.
Editors
Baripada, India Janmenjoy Nayak
Sambalpur, India H. S. Behera
Sambalpur, India Bighnaraj Naik
Rajapalayam, India S. Vimal
Teramo, Italy Danilo Pelusi
Acknowledgements
After the successful five versions of ICCIDM, the sixth version believed in more
quality aspects of computational intelligence based research and developments in
various disciplines of data mining. Though the COVID-19 pandemic has affected a
wide variety of researchers, but the interest towards the research in various disciplines
of computational intelligence for solving useful data mining problems remained
same. In fact, there are many novel solutions for combating the COVID-19 with
different dimensions. This version attracted several researchers or academicians
across the globe to choose this venue for submitting their research and provide
an extensive height to the reputation of ICCIDM conference for research findings
and sharing the knowledge among national and international experts. The ICCIDM
team is really thankful to all the perspective authors whose valuable research findings
made this event truly exceptional.
We have been fortunate enough to work with a brilliant international as well as
national advisory, technical and programme committee members. It was our great
pleasure to work with such eminent members of program and technical committee for
their suggestions, which has made all possible ways to filter good quality articles out
of all submitted papers. We would like to convey our sincere thanks and obligation
to the benevolent reviewers for sparing their precious time and putting in endeavor
to review the papers in a stipulated time and providing their important insights in
brainstorming the presentation, content and quality of proceeding.
We are especially thankful to the organizing team of ICCIDM for their enormous
support in all the form to make this international event happen in a successful way.
The efforts of such young yet experienced and dynamic organizing committee has
lead all the way to mark the success path of this event in this year. Moreover, the
inputs and efforts of program chair(s) are highly appreciable for enhancement of the
quality of each accepted article.
We are highly thankful to the Management of Aditya Institute of Technology
And Management (AITAM)—Tekkali, Srikakulam, AP, India especially Prof.
V. V. Nageswar Rao (Director), Principal, Director, R&D, Deans & Associate Deans
and Head of the Departments for their constant support and motivation for making
xvii
xviii Acknowledgements
the conference successful. The editors would also like to thank Springer Edito-
rial Members for their constant support for in time publishing the proceedings.
The ICCIDM conference and proceedings ensured the acknowledgments to a huge
congregation of people.
Contents
Multi-Sensor Data Fusion for Occupancy Detection Using

Dempster–Shafer Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Sunanta Sarkar, Amit Ghosh, and Sankhadeep Chatterjee
Sentiment Analysis: A Recent Survey with Applications
and a Proposed Ensemble Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Srishti Jain and Vishal Gupta
An Automated System for Facial Mask Detection and Social
Distancing during COVID-19 Pandemic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Rutuja Rashinkar, Swapnil Rokade, Pragati Janjal, Gauri Pawar,
and Swati Shinde
Detection of Insider Threats Using Deep Learning: A Review . . . . . . . . . . 41
P. Lavanya and V. S. Shankar Sriram
An Incisive Analysis of Advanced Persistent Threat Detection
Using Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
M. K. Vishnu Priya and V. S. Shankar Sriram
Intelligent Computing Systems for Diagnosing Plant Diseases . . . . . . . . . . 75
Maitreya Sawai, Sameer More, Prasanna Nagardhane,
Subodh Pandhare, and Manjiri Ranjanikar
Multimodal MRI Analysis for Segmentation of Intra-tumoral
Regions of High-Grade Glioma Using VNet and WNet Based Deep
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Sonal Gore, Prajakta Bhosale, Ashley George, Ashwin Mohan,
Prajakta Joshi, and Anuradha Thakare
Early Onset Alzheimer Disease Classification Using Convolution
Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Happy Ramani and Rupal A. Kapdi
xix
xx Contents
A Study on Evaluating the Performance of Robot Motion Using

Gradient Generalized Artificial Potential Fields with Obstacles . . . . . . . . 113
Syed Muzamil Basha, Syed Thouheed Ahmed, and Naif K. Al-Shammari
Exploratory Analysis of Kidney Disease Data Set—A Comparative
Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Aniket Muley and Sagar Joshi
Deepfakes for Video Conferencing Using General Adversarial
Networks (GANs) and Multilingual Voice Cloning . . . . . . . . . . . . . . . . . . . . 137
Jayesh Shelar, Dipak Ghatole, Mayank Pachpande,
Dhanashree Bhandari, and S. V. Shinde
Topic Evolution Model for Interactive Information Search . . . . . . . . . . . . 149
Harshal Adhav and Vikram Singh
A Novel Automated Human Face Recognition and Temperature
Detection System Using Deep Neural Networks—FRTDS . . . . . . . . . . . . . 165
Varalatchoumy M and Pranav Durai
A Novel BFS and CCDS-Based Efficient Sleep Scheduling
Algorithm for WSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
B. Srinivasa Rao
Face Recognition: A Review and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Amit Verma, Aarti Goyal, Nitish Kumar, and Hitesh Tekchandani
COVID-19 Time Series Prediction and Lockdown Effectiveness . . . . . . . . 211
Rajdeep Biswas and Soumi Dutta
Performance Evaluation of Electrogastrogram (EGG) Signal
Compression for Telemedicine Using Various Wavelet Transform . . . . . . 225
M. Gokul, M. Sameera Fathimal, S. Jothiraj, and Pradeep Murugesan
The Impact of UV-C Treatment on Fruits and Vegetables
for Quality and Shelf Life Improvement Using Internet of Things . . . . . . 235
N. Sneha and Bhagya M. Patil
Modeling and Forecasting Stock Closing Prices with Hybrid
Functional Link Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Subhranginee Das, Sarat Chandra Nayak, and Biswajit Sahoo
Whale Optimization Algorithm Based Optimal Power Flow
to Reduce Generation Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
T. Papi Naidu, B. Venkateswararao, and G. Balasubramanian
An Artificial Electric Field Algorithm and Artificial Neural
Network-Based Hybrid Model for Software Reliability Prediction . . . . . . 271
Ajit Kumar Behera, Mrutyunjaya Panda, Sarat Chandra Nayak,
and Ch.Sanjeev Kumar Dash
Contents xxi
Disaster Event Detection from Text: A Survey . . . . . . . . . . . . . . . . . . . . . . . . 281

Anchal Gupta, Monika Rani, and Sakshi Kaushal
Context-Adaptive Content-Based Filtering Recommender System
Based on Weighted Implicit Rating Approach . . . . . . . . . . . . . . . . . . . . . . . . 295
K. Navin and M. B. Mukesh Krishnan
A Deep Learning-Based Classifier for Remote Sensing Images . . . . . . . . . 309
Soumya Ranjan Sahu and Sucheta Panda
Performance Evaluation of Machine Learning Algorithms
to Predict Breast Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
S. Siva Sunayna, S. N. Thirumala Rao, and M. Sireesha
Topology Dependent Ant Colony-Based Routing Scheme
for Software-Defined Networking in Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
B. S. Shylaja, S. R. Deepu, and R. Bhaskar
On Computational Complexity of Transfer Learning Approaches
in Facial Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Alexandra-S, tefania Moloiu, Grigore Albeanu,
and Florin Popent, iu-Vlădicescu
Adaptive Classifier Using Extreme Learning Machine
for Classifying Twitter Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
M. Arun Manicka Raja, S. Swamynathan, and T. Sumitha
Decision Making on Covid-19 Containment Zones’ Lockdown Exit
Process Using Fuzzy Soft Set Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
R. K. Mohanty, B. K. Tripathy, and Sudam Ch. Parida
Deep Learning on Landslides: An Examination of the Potential
Commitment an Expectation of Danger Evaluation in Sloping
Situations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
J. Aruna Jasmine and C. Heltin Genitha
The Good, The Bad, and The Missing: A Comprehensive Study
on the Rise of Machine Learning for Binary Code Analysis . . . . . . . . . . . . 397
S. Priyanga, Roopak Suresh, Sandeep Romana, and V. S. Shankar Sriram
Ensemble Machine Learning Approach to Detect Various Attacks
in a Distributed Network of Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
Aparna Pramanik and Asit Kumar Das
Predictive Analytics of Engineering and Technology Admissions . . . . . . . 419
Sachin Bhoite, Punam Nikam, and Ajit More
Investigating the Impact of COVID-19 on Important Economic
Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
Debanjan Banerjee, Arijit Ghosal, and Imon Mukherjee
xxii Contents
Classification of Tumorous and Non-tumorous Brain MRI Images

Based on a Deep-Convolution Neural Network Model . . . . . . . . . . . . . . . . . 445
Debkumar Chowdhury, Sanjukta Mishra, Sonu Kumar,
Shiwam Kumar Prasad, Sourav Kumar Mandal, Gourab Biswas,
Devesh Sharma, Vishal Lohia, and Kartik Sau
Social Distance Monitoring and Face Mask Detection Using Deep
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
K. Yagna Sai Surya, T. Geetha Rani, and B. K. Tripathy
An Effective VM Consolidation Mechanism by Using
the Hybridization of PSO and Cuckoo Search Algorithms . . . . . . . . . . . . . 477
Sudheer Mangalampalli, Pokkuluri Kiran Sree,
S. S. S. N. Usha Devi N, and Ramesh Babu Mallela
Customer Segmentation via Data Mining Techniques:
State-of-the-Art Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
Saumendra Das and Janmenjoy Nayak
Solar Radiation Prediction Using Artificial Neural Network:
A Comprehensive Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
Bireswar Paul and Hrituparna Paul
A Concise Review on Automatic Text Summarization . . . . . . . . . . . . . . . . . 523
Dishank Jani, Nehal Patel, Hemant Yadav, Sanket Suthar,
and Sandip Patel
Identification of Heart Failure in Early Stages Using
SMOTE-Integrated AdaBoost Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
B. Kameswara Rao, U. D. Prasan, Mokka. Jagannadha Rao,
Rajyalaxmi Pedada, and Pemmada Suresh Kumar
A Comparative Study of Different Forecasting Models for Energy
Demand Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
Tanvir Islam, Saber Elsayed, Daryl Essam, and Ruhul Sarker
Sentimental Analysis of Streaming COVID-19 Twitter Data
on Spark-Based Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
S. P. Preethi and Radha Senthilkumar
Efficient Approximate Multipliers for Neural Network Applications . . . . 577
Zainab Aizaz, Kavita Khare, and Aizaz Tirmizi
Explainable AI (XAI) for Social Good: Leveraging AutoML
to Assess and Analyze Vital Potable Water Quality Indicators . . . . . . . . . 591
Prakriti Dwivedi, Akbar Ali Khan, Sareeta Mudge, and Garima Sharma
Contents xxiii
Prediction of Dynamic Virtual Machine (VM) Provisioning

in Cloud Computing Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
Biswajit Padhi, Motahar Reza, Indrajeet Gupta, Poorna Sai Nagendra,
and Sarath S. Kumar
Explainability of Deep Learning-Based System in Health Care . . . . . . . . . 619
Shakti Kinger and Vrushali Kulkarni
A Hybrid MSVM COVID-19 Image Classification Enhanced
with Swarm Feature Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
Bhupinder Singh and Ritu Agarwal
QCM Sensor-Based Alcohol Classification Using Ensembled
Stacking Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
Pemmada Suresh Kumar, Rajyalaxmi Pedada, Janmenjoy Nayak,
H. S. Behera, G. M. Sai Pratyusha, and Vanaja Velugula
A Novel Image Falsification Detection Using Vision Transformer
(Vi-T) Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
Manikyala Rao Tankala and Ch. Srinivasa Rao
Design of Intelligent Framework for Intrusion Detection Platform
for Internet of Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
Ch. Ravi Kishore, D. Chandrasekhar Rao, Janmenjoy Nayak,
and H. S. Behera
Autonomous Vehicles: A Survey on Sensor Fusion, Lane Detection
and Drivable Area Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695
Tejas Morkar, Suyash Sonawane, Aditya Mahajan, and Swati Shinde
Identification of Malicious Access in IoT Network from Connection
Traces by Using Light Gradient Boosting Machine . . . . . . . . . . . . . . . . . . . 711
Etuari Oram, Bighnaraj Naik, and Manas Ranjan Senapati
Big Data in Education: Present and Future . . . . . . . . . . . . . . . . . . . . . . . . . . 721
Janmenjoy Nayak, H. Swapnarekha, Ashanta Ranjan Routray,
Soumya Ranjan Nayak, and H. S. Behera
Breast Cancer Mammography Identification with Deep
Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 741
Pandit Byomakesha Dash, H. S. Behera, and Manas Ranjan Senapati
CatBoosting Approach for Anomaly Detection in IoT-Based Smart
Home Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753
Dukka Karun Kumar Reddy and H. S. Behera
About the Editors
Dr. Janmenjoy Nayak is working as Assistant Professor, Department of Computer

Science, Maharaja Sriram Chandra BhanjaDeo (MSCB) University, Baripada,
Odisha, India. Being two times Gold Medalist in Computer Science in his career, he
has been awarded with INSPIRE Research Fellowship from DST, Govt. of India (both
as JRF and SRF level) and Best Researcher Award from Jawaharlal Nehru Univer-
sity of Technology, Kakinada, AP, for the AY: 2018–2019 and many other awards
from national and international academic agencies. He has edited 14 books and eight
special issues on the applications of computational intelligence, soft computing,
and pattern recognition, published by reputed International publications. He has
published more than 170+ referred articles in various chapters, conferences, and inter-
national reputed peer-reviewed journals of Elsevier, Inderscience, Springer, IEEE,
etc. He is Senior Member of IEEE and Life Member of some of the reputed soci-
eties like CSI India, IAENG (Hong Kong), etc. He has successfully conducted and is
being associated with 14 international reputed series conferences like ICCIDM, HIS,
ARIAM, CIPR, SCDA, etc. His area of interest includes data mining, nature-inspired
algorithms, and applied artificial intelligence.
Dr. H. S. Behera is working as Associate Professor in the Department of Information

Technology, Veer Surendra Sai University of Technology (VSSUT), Burla, Odisha,
India. He has received M.Tech. in Computer Science and Engineering from National
Institute of Technology (NIT), Rourkela, Odisha, India, and Doctor of Philosophy in
Engineering (Ph.D.) from Biju Pattnaik University of Technology (BPUT), Rourkela,
Government of Odisha, India, respectively. His research and development experience
includes over 19 years in academia spanning different technical institutes in India.
His research interest includes data mining, soft computing, evolutionary computa-
tion, machine intelligence, and distributed system. He has authored/co-authored over
150+ journal/conferences papers and chapters. He has edited 11 books and serves
as Associate Editor/Member of the editorial/reviewer board of various international
journals and also guest edited 08 special issues on various topics of Inderscience and
IGI Global Journals. He has produced eight Ph.D.s in the area of data mining and
time series forecasting using soft computing techniques.
xxv
xxvi About the Editors
Dr. Bighnaraj Naik is Assistant Professor in the Department of Computer Applica-

tions, Veer Surendra Sai University of Technology, Burla, Odisha, India. He received
his doctoral degree from the Department of Computer Science Engineering and Infor-
mation Technology, Veer Surendra Sai University of Technology, Burla, Odisha,
India, Master’s degree from SOA University, Bhubaneswar, Odisha, India, and Bach-
elor’s degree from National Institute of Science and Technology, Berhampur, Odisha,
India. He has published more than 120 research papers in various reputed peer-
reviewed international conferences, referred journals and chapters. He has more than
five years of teaching experience in the field of computer science and information
technology. His area of interest includes data mining, soft computing, etc.
Dr. S. Vimal is working in the Department of Computer Science and Engineering,

Ramco Institute of Technology, Tamil Nadu, India. He has around fourteen years
of teaching experience, EMC certified Data Science Associate and CCNA certi-
fied professional too. He holds a Ph.D. in Information and Communication Engi-
neering from Anna University, Chennai, and he received Master’s degree from Anna
University, Coimbatore. He organized various funded workshops and seminars. He
has wide publications in the highly impact journals in the area of data analytics,
networking, and security issues and published four chapters. He has hosted two
special sessions for IEEE sponsored conference in Osaka, Japan, and Thailand. His
areas of interest include game modeling, artificial intelligence, machine learning, and
big data analytics. He is Senior Member in IEEE, ACM, and ISTE. He has hosted
21 special issues in IEEE, Elsevier, and Springer journals. He has served as Guest
Editor for SCI journals and edited three books in reputed International publishers.
Dr. Danilo Pelusi has received the Ph.D. degree in Computational Astrophysics
from the University of Teramo, Italy. Presently, he is holding the position of Asso-
ciate Professor at the Faculty of Communication Sciences, University of Teramo.
Associate Editor of IEEE Transactions on Emerging Topics in Computational Intel-
ligence, IEEE Access, International Journal of Machine Learning and Cybernetics
(Springer) and Array (Elsevier), he served as Guest Editor for Elsevier, Springer
and Inderscience journals, as Program Member of many conferences and as Edito-
rial Board Member of many journals. Reviewer of reputed journals such as IEEE
Transactions on Fuzzy Systems and IEEE Transactions on Neural Networks and
Machine Learning, his research interests include intelligent computing, communi-
cation system, fuzzy logic, neural networks, information theory, and evolutionary
algorithms.
Multi-Sensor Data Fusion for Occupancy
Detection Using Dempster–Shafer
Theory
Sunanta Sarkar, Amit Ghosh, and Sankhadeep Chatterjee
Abstract A data fusion technique has been proposed in this paper for detecting the
presence and absence of individuals in a room. It involves the utilization of data
from a series of sensors mainly temperature and humidity sensors, for detecting
the presence of a person. From the perspective of evidence theory, data collected
from every sensor can be viewed as a matter of evidence. A Dempster–Shafer (D-S)
theory-based data fusion model is established for modeling and consolidating the
pieces of evidence and hence generating an overall speculation of the temperature
and humidity level of a room. Testing has been carried out with a dataset that have
two classes. At first, the detection is done using various well-known classifiers such
as logistic regression which shows an accuracy level of 94%, K-nearest neighbors
shows an accuracy of 93%, support vector machines result in an accuracy of 94%,
and decision tree classifier and random forest classifier show an accuracy of 92%
and 93%, respectively. A subset of the data is used to create class membership
probabilities for every attribute for training, and hence, a mass function is created.
Finally, D-S rule is applied, and the outcomes suggest that the data fusion method
gives a higher accuracy level compared to the only involvement of classifiers.
Keywords Multi-sensor data fusion · Dempster–Shafer evidence theory ·

Dempster’s rule of combination
1 Introduction
Occupancy of a person in a room is one of the most useful pieces of information that
can be used in smart homes and also in places where security is of utter importance
like locker rooms in bank, jewelry stores, etc. In the current situation of COVID-19,
gathering of large crowds is not allowed which can be monitored using this project.
Human presence inside a room can be identified by using temperature and humidity
sensors [1]. Several temperature and humidity sensors were placed in a room, and
S. Sarkar · A. Ghosh · S. Chatterjee (B)

Department of Computer Science and Engineering, University of Engineering and Management,
Kolkata, West Bengal, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 1
J. Nayak et al. (eds.), Computational Intelligence in Data Mining, Smart Innovation,
Systems and Technologies 281, https://doi.org/10.1007/978-981-16-9447-9_1
2 S. Sarkar et al.
the corresponding readings were taken both in the absence and presence of humans.
Then the data was processed using five types of classifiers, namely logistic regression,
K-nearest neighbors, support vector machines, decision tree classifier and random
forest classifier, and their accuracy level was noted.
Multi-sensor data fusion is an innovation to empower fusion of evidence from
series of different sensors in order to frame a unified picture [2]. It is a significant
technique for producing more consistent, accurate and efficient information than that
provided by an individual sensor. By discovering the number of applications of multi-
sensor data fusion technique, we can figure it out that it is in demand tremendously.
However, it has more application in the field of artificial intelligence, controlling
autonomous vehicle, target recognition, image fusion and gear fault diagnosis in
strong noise [3–8]. Within our work, multi-sensor data fusion has been applied for
occupancy detection in a room.
It has been found that different degrees of uncertainty and inconsistency have been
associated with different sensors [9]. Inconsistency in data collected from sensors is
often the consequence of either inherent impediments in the precision with which the
sensed data is obtained or limitations imposed such as accuracy and age of the sensors.
Even when readings of the sensors are consistent and precise, there is still a possibility
of imperfection and uncertainty, for example, if one or more sensors are suspected
of being defective. This uncertainty may result in a contradictory conclusion. Since
information derived from the sources can be inconsistent and uncertain sometimes,
a fusion mechanism is used to reduce the inconsistent and imprecision.
Dempster–Shafer evidence theory and Bayesian theory are the two most used
statistical theories for evidence combination [10]. Though Bayes theory is unani-
mously agreed as correct, it is argued for its “subjectiveness” by many researchers.
Bayesian theory also represents complete ignorance. D-S evidence theory can be
viewed as an augmentation of traditional probabilistic theory that deals with data
uncertainty but does not find out enough prior probabilities. Bayesian theory is an
exclusive instance of D-S theory. Within this paper, data from different tempera-
ture and humidity sensors was first computed using only classifiers, then data fusion
mechanism using D-S theory was applied, and the results were compared. It was
observed that the accuracy level was more after applying data fusion mechanism.
Section 2 depicts the proposed method where preliminary concepts of all the five
classifiers, namely logistic regression, K-nearest neighbors, support vector machines,
decision tree classifier and random forest classifier, that have been used to perform
classification have been described along with the method used for multi-sensor data
fusion. Preliminary concept of Dempster–Shafer evidence theory has been described
briefly in Sect. 3. Section 4 deals with the results and discussions where the effective-
ness of the proposed data fusion method has been demonstrated. Section 5 provides
the conclusion.
Multi-Sensor Data Fusion for Occupancy Detection Using Dempster … 3
2 Proposed Method
In the current work, five different classifiers were used to perform classification before
performing data fusion, namely logistic regression, K-nearest neighbors, support
vector machines, decision tree classifier and random forest classifier.
2.1 Logistic Regression
A classifier named logistic regression has been utilized for solving multi-sensor data
fusion technique under the study for occupancy detection. It is one of the easiest
machine learning algorithms and can be easily implemented. As logistic regression
yields well-calibrated probabilities along with classification results, it turns out to
be efficient for the information that are obtained from the sensors. Logistic regres-
sion does not need large number of computational resources and is exceptionally
interpretable. It is not difficult to regularize and does not need any tuning. Logistic
regression works better when we eliminate attributes that are irrelevant to the output
variable and attributes that are similar to each other.
2.2 K-Nearest Neighbors
Classifier named KNN has also been utilized earlier for the study of multi-sensor
data fusion technique. The aim of utilizing this classifier is to show how much the
accuracy and precision metric change after applying proposed data fusion method to
this classifier. This classifier is a lot quicker in comparison with other classifiers. It
has also been found that KNN works better than many classifiers under the study of
occupancy detection.
2.3 Support Vector Machines
A classifier named SVM has been utilized to analyze information for classification
and regression analysis. SVM has been utilized before and even after applying multi-
sensor data fusion technique under the study of occupancy detection with the aim to
show whether there is any change in the accuracy or precision metrics.
SVM functions admirably when there is an absolute partition between classes. It
is generally memory effective. Overfitting is less in SVM.
4 S. Sarkar et al.
2.4 Decision Tree Classifier
A hierarchical classifier named decision tree also has been utilized for solving multi-
sensor data fusion technique under the study for occupancy detection. It is a flowchart-
like design in which each inner node indicates a test on a component, each leaf node
indicates a class label, and branches indicate conjunctions of features that lead to
those class labels. The ways from root to leaf indicate classification rules. No feature
scaling is needed in decision tree, and training period is likewise less.
Decision trees are not difficult to clarify. It follows similar methodology as people
follow at the time of making choices.
2.5 Random Forest Classifier
Another classifier named random forest has been utilized under the study of occu-
pancy detection. It builds multiple decision trees on provided information. Predic-
tion from all of these hierarchical trees is obtained, and then it picks up the optimal
arrangement through polling. It is an ensemble method. It has been found that it
works in a more precise and optimal manner than a single decision tree.
It can consequently deal with missing numbers. It can detect outliers and can deal
with them naturally.
2.6 Proposed Methodology
Figure 1 depicts a series of DHT11 temperature and humidity sensors which have
been used for occupancy detection. It is used to measure the performance of sensors
by using Dempster–Shafer-based data fusion technique. Hence, the collected data
from the sensors has been fused together to form a unified picture by using D-
S combination rule. Further, the hypothesis with the highest mass value has been
selected. If the hypothesis is containing singleton class, then the class has been
returned; otherwise, feature selection value (FSV) has been calculated. Then the
attribute with smallest FSV has been selected. The difference of the selected attributes
has been calculated. Finally, the class with smallest absolute difference has been
found out, and the class has been returned.
3 Dempster–Shafer Evidence Theory
The Dempster–Shafer evidence theory is derived from the theory that has been
proposed by Dempster [11], and then it has been modified by Shafer [12]. It is a
Sensor 1 Sensor 2 Sensor n
D-S Combination
Select the hypothesis with the highest

mass value
Hypothesis containing
singleton class Return that class
Yes
No
Calculate the FSV of the classes
Select the attribute with smallest FSV
Calculate the differences of the selected at-

tributes
Find the class with smallest absolute difference
Return that class
Fig. 1 Multi-sensor data fusion model based on Dempster–Shafer evidence theory
generalization of classical probability theory. It can deal with uncertain data more
readily. Some mathematical terminologies used in D-S theory are as follows.
3.1 Frame of Discernment
Let ω be the discernment frame. To compute discernment frame, set of propositions

and hypothesis that are mutually exclusive and finite has been found out in our case
temperature and humidity. To compute power set of ω, all the subsets of ω have been
found out. It is a set which is represented by 2ω [13].
6 S. Sarkar et al.
3.2 Mass Function
Each subset of power set has been found out. To all those subsets, a mass value
between 0 and 1 has been assigned.
f : 2ω → [0, 1]

f (φ) = 0 and f (B) = 1
B⊆ω
where f (B) represents the accessible evidence proportion that underpins the decla-
ration that the actual state does not belong to any subset of B. It belongs to B itself
[13].
3.3 Degree of Belief and Plausibility Degree
The lower and upper bounds of a probability interval can be computed by two
measures that are evaluated from the plausibility degree (pla) and belief degree (bl)
and the mass [13].
To evaluate the belief function degree of B, bl(B), all the n-empty subsets of B
have been found out. The mass values of all these subsets have been computed, and
a summation of the mass values has been done as shown in Eq. (1).

bl(B) = f (C) (1)
C⊆B
To evaluate the plausibility degree of B, pla(B), all sets that intersect B have been
found out. The masses of these sets have been evaluated, and then a summation has
been done as shown in Eq. (2).

pla(B) = f (C) (2)
C∩B=φ
The relationship between plausibility degree and belief functions degree is shown
in Eq. (3).
pla(B) = 1 − bl(¬B) (3)

3.4 Dempster’s Rule of Combination
Information has been collected from two independent and non-identical sensors in
the similar discernment frame ω. Let f 1 and f 2 be the two mass functions established
based on that. The resultant mass function is given by f as shown in Eqs. (4) and (5).
f (B) = f 1 ⊕ f 2 (B) (4)
and
1
f 1 ⊕ f 2 (B) = f 1 (C) f 2 (D) (5)
1− Q
C∩D=B=∅
where Q is the measure of the differences between the two mass values which is
evaluated using the formula in Eq. (6):

Q= f 1 (C) f 2 (D), Q = 1 (6)
C∩D=∅
3.5 Temperature and Humidity Evaluation
Consider a series of sensors with temperature and humidity sensor, used for occu-
pancy detection. The dataset collected from the sensors was split into training and
testing data.
Suppose ω = {d 1 , d 2 , …, d t } be the discernment frame, where the number of
classes is represented by t and assuming n number of sensors are there. First, the
power set which is 2θ was found for each set of attributes. Each attribute of the training
data has been found out. Then, the minimum and maximum values for each class have
been evaluated based on that. Overlapping has been done. Boundary information has
been obtained. This procedure is used to group the data items; however, in some
cases the attributes do not contain a singleton class. If it is found that the value of
a data item is more than or equivalent to minimum of class 1 and not as much as
minimum of class 2, then class 1 will be the possible class for those data items. If
its value is more than or equivalent to minimum of class 2, then there can be two
possible classes for that item so it can be assigned to any class, be it class 1 or class
2.
According to the Dempster–Shafer theory, the temperature and humidity classes
are evaluated by mass values. Let f n (d1 ), f n (d2 ), . . . f n (dt ) be the mass function
values of classes (d 1 , d 2 , …, d t ), respectively, and d consists two parameters
temperature and humidity.
If the data items belong to a single class, then
8 S. Sarkar et al.
f n (d1 ) = 0.9, f n (ω) = 0.1
i.e., the unpredictability f (ω) is 0.1.

If there are two possible classes for a data value,
f n (d1 ∪ d2 ) = 0.9, f n (ω) = 0.1
If there are three possible classes for a data value,
f n (ω) = 1,
i.e., the unpredictability f n (ω) is 1.

The mass values are combined together using DRC. Mass values are produced.
The highest value of mass has been evaluated, and the hypothesis associated with it
has been selected which has been further used for classification of the data items. If
the hypothesis cannot be classified as a singleton class, then a further step is done.
Standard deviations are evaluated for non-singleton classes of every attribute and
for their union in the following step. The feature selection value can be evaluated
using the formula in Eq. (7):
sd(l1) × sd(l2) × · · · × sd(lt)

FSV = (7)
sd(l1 ∪ l2 ∪ . . . ∪ lt)
where the number of classes has been represented by t which is a natural number.
The mean value of the absolute difference d between the features with smallest
FSV has been evaluated for each class using Eq. (8) which has been clearly shown
in Fig. 1.
di = |ai −a i |, i = 1, 2, . . . t (8)
The data item with the smallest d value is classified.
4 Results and Discussion
A series of sensors have been used for occupancy detection. The sensor used in our
project is DHT11. The DHT11 is a commonly used temperature and humidity sensor.
A series of DHT11 sensors were connected to an Arduino and was placed in a room.
Data from the sensors was collected for a week. Temperature and humidity values
for all the sensors were observed.
The dataset has 2000 data items with two classes. It has the following numeric
attributes: Temperature 1, Temperature 2, Temperature 3, Humidity 1, Humidity 2
and Humidity 3. The classes are Not Occupied and Occupied. Each class contains
Table 1 Results obtained

Class Precision Recall F1-score
after using logistic regression
Not occupied 0.97 0.91 0.94
Occupied 0.91 0.97 0.94

after using KNN
Occupied 0.92 0.95 0.93

after using SVM
Occupied 0.91 0.98 0.95

after using decision tree
Occupied 0.90 0.94 0.92
1000 instances. The data has been apportioned into training set and test set with a
split of 70–30%.
The current section reports the experimental results obtained after the experiments
have been carried out following the procedure depicted in Sect. 2.6. Tables 1, 2, 3, 4
and 5 depict the performance of classes (Not Occupied and Occupied) after applying
various classification algorithms. The performance shown is based upon three metrics
value, i.e., precision, recall and F1-score.
In Tables 6 and 7, the experimental outcomes of the proposed models have been
shown. The results suggest that the performance of logistic regression before applying
data fusion has an accuracy of 94%, precision 91%, F1-score 94% and recall 97%,
while after applying data fusion, the accuracy increases to 99%, precision 97%, F1-
score 98% and recall 99%. After applying data fusion, the accuracy of KNN increases
from 93 to 97%, precision from 92 to 96%, recall from 95 to 97% and F1-score from
93 to 97%. After applying data fusion, the accuracy of SVM increases from 94 to
97%, precision from 91 to 96%, recall from 98 to 99% and F1-score from 95 to
97%. After applying data fusion the accuracy of Decision Tree increases from 92
to 98%, precision from 90 to 97%, recall from 94 to 98% and F1-score from 92

after using random forest
Occupied 0.91 0.95 0.93
10 S. Sarkar et al.
Table 6 Performance of various classifiers under the study before applying proposed data fusion
technique
Classifier Accuracy Precision F1-score Recall
Logistic regression 0.94 0.91 0.94 0.97
KNN 0.93 0.92 0.93 0.95
SVM 0.94 0.91 0.95 0.98
Decision tree 0.92 0.90 0.92 0.94
Random forest 0.93 0.91 0.93 0.95
Table 7 Performance of various classifiers under the study after applying proposed data fusion
technique
Classifier Accuracy Precision F1-score Recall
Logistic regression 0.99 0.97 0.98 0.99
KNN 0.97 0.96 0.97 0.97
SVM 0.97 0.96 0.97 0.99
Decision tree 0.98 0.97 0.98 0.98
Random forest 0.99 0.97 0.98 0.99
to 98%. After applying data fusion, the accuracy of random forest increases from
93 to 99%, precision from 91 to 97%, recall from 95 to 99% and F1-score from
93 to 98%. Hence, it is evident that before applying data fusion, logistic regression
and SVM have the highest performance, while after applying data fusion, logistic
regression and random forest have the highest performance. The results also suggest
that Dempster–Shafer-based data fusion method has increased the performance of
all the models to a greater extent.
Figure 2 reports that various classification algorithms have been applied on the
collected data, and accuracy of each classification algorithm has been found out
before and after applying the proposed data fusion algorithm. The same has been
applied for precision, recall and F1-score as well. It reveals that data fusion improves
the performance of all the classifiers under the current study.
5 Conclusion
Multi-sensor data fusion technique has been applied for occupancy detection within
this paper. We have applied Dempster–Shafer evidence theory for multi-sensor data
fusion. Within our work, a technique for computing mass function using temperature
and humidity as parameters has been proposed. The dataset was classified into two
possible classes utilizing a two-venture approach that utilized the values of class
limit in the training data for every feature. Furthermore, the data items which are
not been assigned to a singleton class were classified further that used standard
Logistic Regression KNN
1 1
0.95 0.95
0.9 0.9
0.85 0.85
Before data fusion After data fusion Before data fusion After data fusion
SVM Decision Tree

1 1
0.95 0.95
0.9 0.9
0.85 0.85
Before data fusion After data fusion Before data fusion After data fusion
Random Forest
1
0.95
0.9
0.85
Before data fusion After data fusion
Fig. 2 Comparison of performance measure of logistic regression, SVM, KNN, decision tree and
random forest before and after data fusion
deviation measures for choosing an attribute for ultimate classification. The decision
rule to ascertain the class based on the fusion mass values has been proposed. Our
experiment has demonstrated that the proposed data fusion method enhances the
performance of classifiers under our study to a greater extent.
12 S. Sarkar et al.
References
1. J. Han, A. Shah, M. Luk, A. Perrig, Don’t sweat your privacy using humidity to detect human
presence (2007)
2. B. Khaleghi, A. Khamis, F.O. Karray, S.N. Razavib, Multisensor data fusion: a review of the
state-of-the-art. Inf. Fusion 14(1), 28–44 (2013)
3. G. Fortino, S. Galzarano, R. Gravina, W. Li, A framework for collaborative computing and
multi-sensor data fusion in body sensor networks. Inf. Fusion 22, 50–70 (2015)
4. A. Ardeshir Goshtasby, S. Nikolov, Image fusion: advances in the state of the art. Inf. Fusion
8(2), 114–118 (2007)
5. M. Panicker, T. Mitha, K. Oak, A.M. Deshpande, C. Ganguly, Multisensor data fusion for an
autonomous ground vehicle, in 2016 Conference on Advances in Signal Processing (CASP)
(Pune, India, 2016), pp. 507–512
6. G. Dong, G. Kuang, Target recognition via information aggregation through dempster–shafer’s
evidence theory. IEEE Geosci. Remote Sens. Lett. 12(6), 1247–1251 (2015)
7. F. Xiao, Multi-sensor data fusion based on the belief divergence measure of evidences and the
belief entropy. Inf. Fusion 46, 23–32 (2019)
8. G. Cheng, X.-H. Chen, X.-L. Shan, H.-G. Liu, C.-F. Zhou, A new method of gear fault diagnosis
in strong noise based on multi-sensor information fusion. J. Vib. Control 22(6), 1504–1515
(2016)
9. A. Ranganathan, J. Al-Muhtadi, R.H. Campbell, Reasoning about uncertain contexts in
pervasive computing environments. IEEE Pervasive Comput. 3(2), 62–70 (2004)
10. J. Zhou, L. Liu, J. Guo, L. Sun, Multisensor data fusion for water quality evaluation using
dempster-shafer theory (2013)
11. A.P. Dempster, Upper and lower probabilities induced by a multivalued mapping. Ann. Math.
Statist. 38, 325339 (1967)
12. G. Shafer, A Mathematical Theory of Evidence (Princeton University Press, Princeton and
London, 1976)
13. Q. Chen, A. Whitbrook, U. Aickelin, C. Roadknight, Data classification using dempster-shafer
theory (2014)
Sentiment Analysis: A Recent Survey
with Applications and a Proposed
Ensemble Algorithm
Srishti Jain and Vishal Gupta
Abstract With the dramatic increase in the massive amount of data on e-commercial
sites, it becomes difficult to analyze the sentiment manually. Sentiment analysis auto-
matically analyzes the user sentiment in terms of positive, negative, or neutral using
natural language processing. This paper presents a survey on sentiment analysis
with its various approaches and algorithms in detail based on supervised and unsu-
pervised learning. Further, this paper analyzes the pros and cons of some current
studies with their approaches and compares their results on various datasets. Also,
this paper discusses the uses of sentiment analysis in real life. Finally, this paper
proposes a new hybrid algorithm carrying various features for sentiment analysis
using AdaBoost and majority voting ensemble techniques.
Keywords Sentiment analysis (SA) · Supervised learning · Unsupervised

learning · Pros and cons · Applications · Ensemble methods
1 Introduction
In the contemporary world, physically examining a user perspective is extremely

difficult since we all struggle with data overload. So, sentiment analysis becomes
the most vital subject in today’s research area to examine the user perspective.
Sentiments are the reviews given by the users regarding any product, events or
celebrities, etc. Various social media websites like Facebook, Twitter, WhatsApp,
microblogs, and web forums where millions of netizens can share their opinions.
S. Jain (B) · V. Gupta

Department of CSE, University Institute of Engineering and Technology (UIET), Panjab
University, Sector-25, Chandigarh, India
V. Gupta
e-mail: vishal@pu.ac.in
14 S. Jain and V. Gupta
The primary goal of sentiment analysis is to solve a classification problem that deter-
mines whether a review is positive, negative, or neutral. Sentiment analysis can be
done on many levels: word level, sentence level, document level, and aspect level
(feature level) [1]. Many researchers are experimenting with different approaches
and algorithms in natural language processing to understand better sentiment analysis
based on machine learning algorithms. The machine learning algorithm is categorized
further into supervised and unsupervised-based learning approaches. Therefore, this
paper performs an overview of recent research based on machine learning models.
This paper also proposes an ensemble model which trains multiple classifiers using
ensemble methods such as AdaBoost and maximum voting.
The structure of this survey paper describes as follows. Section 2 discusses senti-
ment analysis and its methods, as well as a description of the findings. Section 3
examines sentiment analysis applications, as well as Sect. 4 presents the proposed
methodology, and Sect. 5 examples for sentiment analysis. Finally, Sect. 6 presents
the conclusion for sentiment analysis.
2 Sentiment Analysis and Its Approaches
Sentiment analysis determines the views of the users from social media based
on machine learning techniques which further divides it into various approaches
mentioned below:
2.1 Supervised-Based Learning Approach
Supervised learning is the most popular branch of machine learning in which we

train the machine using labeled data. A supervised learning model goal is to predict
the correct label for newly presented data. The following are some of the algorithms
for supervised sentiment-based learning:
2.1.1 Naïve Bayes (NB)
NB classifier is a probabilistic classifier based on Bayes theorem and strong inde-

pendence assumptions between features since it is highly scalable, fast, and simple
Sentiment Analysis: A Recent Survey with Applications … 15
to implement can be used in text classification. Seref and Bostanci [2] used Hadoop
tools to perform sentiment analysis on various dataset sizes to classify reviews as
positive, negative, or neutral using NB and complement Naive Bayes (CNB) discov-
ered that CNB provides the best overall accuracy. Abbas et al. [3] used multinomial
Naïve Bayes (MNB) and the term frequency-inverse document frequency (TF-IDF)
approach to classify movie reviews in text into positive and negative categories in
their paper. They used the TF-IDF method to concentrate on both the frequency and
uniqueness of words. To enhance the efficiency of CNB, Cahya et al. [4] used a
feature weighting method based on a genetic algorithm (GA).
2.1.2 Support Vector Machine (SVM)
SVM is a text classification method based on supervised machine learning. The hyper-
plane maximizes the distance between the separating hyperplanes used to perform
classification. It is capable of solving both linear and nonlinear problems. Kernel
trick used to transform a non-separable problem into a separable problem. Linear,
polynomial, and radial basis kernel function (RBF) are three types of kernel func-
tions. The improved RBF kernel of SVM (SVM-IRBF) had used by the author’s
Gopi et al. [5]. Kumar and Subba [6] proposed a TF-IDF vectorizer and SVM-based
sentiment analysis at the document level to analyze the polarity of documents in a
text data corpus.
2.1.3 K-Nearest Neighbors (KNN)
KNN is a popular machine learning supervised approach where K represents a

value from 1 to n to find the nearest neighbor. There are various methods to
measure the distance for KNN named as Euclidian distance, Hamming distance,
Manhattan distance, Minkowski distance, and cosine distance. The fused-KNN algo-
rithm mentioned by Jiang et al. [7] was used to improve the performance of the stan-
dard KNN algorithm. They used an improved neighbor sample selection strategy to
correlate the attributes of comments based on attribute correlated distance and pick
the k samples with the small attribute correlated distance as neighbors.
2.1.4 Maximum Entropy (ME)
ME classifier is a probabilistic classifier that falls under the exponential model cate-
gory. ME modeling is to calculate a probability distribution with the highest entropy
based on the evidence available. Xie et al. [8] suggest a novel maximum entropy based
on probabilistic latent semantic analysis (PLSA). The above results for supervised
learning approaches are summarized in Table 1.
2.2 Unsupervised-Based Learning Approach
It is a machine learning technique in which we train the machine using unlabeled data.
The main aim is to group unsorted data based on similarities and patterns between
them without the need for prior training. Clustering is one of the types of unsuper-
vised machine learning. The following are the types of unsupervised clustering-based
learning algorithms:
2.2.1 K-means Clustering
K-means clustering is the partitional-based clustering used to classify the dataset into
clusters by several predefined k clusters. Orkphol and Yang [15] improve K-means
clustering for a high-dimensional dataset with high sparseness. They used TF-IDF for
selecting appropriate features and the singular value decomposition (SVD) technique
to minimize the dataset’s high dimensionality. They suggested a new approach for
finding the best initial state of centroids called artificial bee colony (ABC). Rehioui
and Idrissi [16] proposed a new algorithm merging two clustering algorithms: K-
means and density-based clustering (DENCLUE) and its variants to analyze the
sentiments of tweets.
2.2.2 Hierarchical Clustering
Hierarchical clustering creates a cluster hierarchy and a dendrogram. It includes

agglomerative and divisive clustering. Single linkage, average linkage, complete
linkage, and centroid linkage are the four linkage methods used to measure the
distance between two clusters [17]. Suresh et al. [18] introduced the fuzzy hybrid
hierarchical clustering (FHHC) model, which combines bottom-up and top-down
approaches with fuzzy logic to deal with uncertainty. Using the Mamdani rule system,
Vashishtha and Susan [19] suggested a fuzzy rule-based unsupervised approach for
sentiment analysis of Twitter datasets. This paper’s main contribution is that it can
use for any lexicon, such as SentiWordNet, AFINN, and VADER, as well as any
two-class and three-class datasets.
Table 1 Summary of results related to supervised learning approaches

Author name and Model used Dataset Pros Cons
year and Ref.
Seref and NB, CNB Amazon movie Easy to implement Assumptions of
Bostanci (2018) review [9], 20 independent
[2] newsgroup features
dataset [10]
Abbas et al. MNB Movie review Laplace Independent
(2019) [3] dataset smoothing solves feature
the problem of assumptions
zero probability Unbalanced data
does not perform
well
Cahya et al. Weighting Airline Twitter Based on the The imbalance
(2019) [4] approach and dataset [11] association data is taken into
CNB between features account using a
and class labels, weighting
GA selects the method, with the
best set of features majority of the
elements being
negative
Gopi et al. (2020) Improved RBF Twitter movie It fits well with –
[5] kernel SVM review dataset data that has a lot
of dimensions
It takes low time
to predict a score
Kumar and Subba TF-IDF IMDB movie With Polynomial The
(2020) [6] vectorizer and review and and RBF kernels, computational
SVM Amazon the linear kernel is overhead
electronic items less prone to increases as the
review dataset underfitting n-gram range
[12] value increased
beyond trigram
Jiang et al. (2019) Fused-KNN Tencent news Selected neighbors When classifying
[7] and netease news accurately negative samples,
dataset represent the the outcome isn’t
characteristics of optimal
the sample
Xie et al. (2019) Maximum Restaurant It more precisely If the number of
[8] entropy-PLSA review [13] and measures semantic sentences in the
movie review terms and test corpus grows,
corpus [14] overcomes the the probability of
complexity of unlisted words
analyzing grows
polysemy and
synonyms
2.2.3 Fuzzy C-means Clustering (FCM)
FCM algorithm is a popular fuzzy clustering method. The main aim is to reduce
the objective function. Trupthi et al. [20] proposed a new topic modeling method-
ology based on latent Dirichlet allocation (LDA) to identify topics and possibilistic
fuzzy c-means (PFCM) for classification. The above results for unsupervised learning
approaches are summarized in Table 2.
3 Applications of Sentiment Analysis
Sentiment analysis helps companies and service providers to understand their buyers’
and users’ mindsets so that they can modify their products and services to meet their
demands. Some of the applications are given below with their brief explanation.
(1) Product analysis: Before purchasing a product, a new consumer may use
sentiment analysis to decide whether all of the reviewers are in favor or against
it. A graph-based approach is utilized in a study [31] to categorize user reviews
of the Redmi Note1 smartphone as positive or negative.
(2) Movie review analysis: In the paper [32], SA is performed on the movie review
dataset using various machine learning classifiers to classify the dataset into
positive and negative.
(3) Customer feedback: In paper [33], SA is used to analyze the customer tweets
on US Airline Services using supervised machine learning algorithms. TF-
IDF and bag-of-words (BOW) feature vector representation approach used
with bigrams.
(4) Stock market prediction: In paper [34], the sentiment analysis for stock price
prediction is taken into account using various classifiers such as Naïve Bayes,
SVM, and logistic regression.
(5) Government Intelligence: Sentiment analysis is to check the impact of demo-
nization in India using a machine learning method such as KNN and support
vector classifier (SVC) [35].
4 Proposed Methodology
The detailed flowchart of the proposed approach has shown in Fig. 1.

Table 2 Summary of results related to unsupervised learning approaches

Author name Model used Dataset Pros Cons
and year and
Ref.
Orkphol and K-means iPhone X dataset It solves the issue It takes more
Yang (2019) clustering and of high sparseness computing
[15] artificial bee and time than
colony dimensionality in normal
a large dataset k-means
ABC algorithm is
stable and has a
small number of
control
parameters
Suresh et al. Hybrid Twitter dataset No need to In the
(2017) [18] hierarchical specify the presence of
clustering number of clusters noise and
Deals with outliers, it
uncertainty does not
perform well
Vashishtha and A fuzzy The Sanders Twitter Since it is –
Susan (2019) rule-based [21], The Nuclear unsupervised, it
[19] unsupervised Twitter [22], the Apple needs no training
approach Twitter [23], The time and is
Stanford Twitter unaffected by
Sentiment Test Set dataset size
(STS-Test) [24], The This method can
Sentiment140 Twitter apply to every
dataset, SemEval 2017 lexicon and
[25], SemEval 2016 dataset
[26], SemEval 2015
[27] and Twitter data
Rehioui and K-means and Twitter sentiment Reduce a large –
Idrissi (2020) DENCLUE corpus-3 [21], Twitter number of clusters
[16] dataset [28], Test Lead to a good
data-manual [29], clustering
Twitter performance
airline-sentiment [11]
Trupthi et al. LDA-PFCM Twitter-sanders-apple2 Identify the –
(2018) [20] dataset [30] relevant topic
related to the
keyword and
improve the
accuracy
Fig. 1 Proposed
architecture for sentiment
analysis Data Collection
Data pre-processing
Feature Vector Extraction
Feature Selection (Filter method)
SVM- CNB MNB RF ME KNN

IRBF
Ensemble Techniques (Majority Voting & AdaBoost)
Performance Evaluation
4.1 Preprocessing
In this first stage, data that does not contain any emotion remove to prevent the
challenge from being too big in the later stages of sentiment analysis and store the
preprocess data into the F1_Data file. The following steps are:
• Tokenize the data so that it can express in words.
• Convert all the entire words to lower case.
• Remove all the web addresses and stop words that do not contain any emotion by
comparing them with the stop words dictionary [36].
• Replace all the acronyms with their true meanings by contrast with an acronym
dictionary, such as (2day → today) [37].
• Stemming
• POS (part of speech) tagging like “This mobile is good and perfect.” becomes
“This/DT mobile/NN is/VBZ very/RB good/JJ and/CC perfect/JJ.”
• Not remove punctuation and emoticons.
4.2 Feature Extraction and Feature Selection
After preprocessing, various feature extraction methods, such as TF-IDF and BOW,
apply to the features extracted from the dataset. These feature extraction techniques
create the feature vector of the data. The following features included:
• Unigram, bigram features considering one or two consecutive terms into account,
respectively.
• Bi-tagged features extracted from a POS tagging pattern in which one word
contains either an adjective or an adverb [38, 39].
• The total count of positive expressions includes positive words, positive emoti-
cons, and positive exclamations [16].
• The total count of negative expressions includes negative words, negative
emoticon, and negative exclamations [16].
• The total count of neutral expressions includes neutral words, neutral emoticons,
and neutral exclamations [16].
• The emoticon lexicon [40] uses to extract the sentiment associated with emoticons.
• SentiWordNet [41]
• In this stage, feature selection techniques such as the filter method [42] use to
select the best possible set of features for constructing a machine learning model.
4.3 Feature Training Using Various Machine Learning

Classifiers
In the third stage, machine learning classifiers use to train these selected features.
MNB, CNB, SVM-IRBF, ME, random forest (RF), and KNN classifiers use for
sentiment classification.
4.4 Applied Ensemble Techniques and Evaluate

the Performance
At the last stage, we apply ensemble techniques where many algorithms are used for
classification and evaluate the performance using accuracy, precision, recall, and F1
score.
4.4.1 Majority Voting
This technique is based on a maximum voting procedure and uses to solve

classification problems.
4.4.2 AdaBoost (Adaptive Boosting)
AdaBoost is a boosting algorithm that transforms weak classifiers into strong ones.
In this technique, weights are reassigned to each instance, with higher weights to
wrongly categorized ones. Algorithm 1 shows the detailed steps of the proposed
approach.
5 Illustrative Example
The examples for sentiment analysis are given in Table 3.
Table 3 Illustrative examples for sentiment analysis

Sentence No. Sentence Sentiment polarity
Sentence 1 Nice and cozy restaurant with super friendly staff. They Positive
made delicious pizza and great mojito. Loved it!
Sentence 2 The movie was bad and tedious :( Negative
Sentence 3 @AmericanAir your message was delayed. I just Neutral
responded
Sentence 4 Your dress looks remarkable today, Mrs. Christie Positive
Sentence 5 Zoya is very aggressive—Every time you lose in a game, Negative
you start kicking things
6 Conclusion
In this paper, we described detailed techniques of sentiment analysis based on the

supervised and unsupervised learning approaches. The main motive of this paper is
to explore the methods used in the existing studies along with their pros and cons.
Moreover, this paper also discusses applications of sentiment analysis in brief. We
developed a sentiment analysis algorithm based on the findings and experimented
with several assembly classification techniques such as AdaBoost and majority voting
ensemble methods.
References
1. A. Kumar, T.M. Sebastian, Sentiment analysis: a perspective on its past, present and future. I.J.
Intell. Syst. Appl. 10, 1–14 Published Online September 2012 in MECS (http://www.mecs-
press.org/). https://doi.org/10.5815/ijisa.2012.10.01.
2. B. Seref, E. Bostanci, Sentiment analysis using Naive Bayes and complement Naive Bayes
classifier algorithms on Hadoop framework, in 2018 2nd International Symposium on Multi-
disciplinary Studies and Innovative Technologies (ISMSIT) (Ankara, Turkey, 2018), pp. 1–7.
https://doi.org/10.1109/ISMSIT.2018.8567243
3. M. Abbas, K. Ali Memon, A. Aleem Jamali, Multinomial Naive Bayes classification model
for sentiment analysis. IJCSNS Int. J. Comput. Sci. Netw. Secur. 19(3), 62 (2019). https://doi.
org/10.13140/RG.2.2.30021.40169
4. R.A. Cahya, D. Adimanggala, A.A. Supianto, Deep feature weighting based on genetic algo-
rithm and Naïve Bayes for Twitter sentiment analysis, in 2019 International Conference on
Sustainable Information Engineering and Technology (SIET) (Lombok, Indonesia, 2019),
pp. 326–331. https://doi.org/10.1109/SIET48054.2019.8986107
5. A.P. Gopi, R.N.S. Jyothi, V.L. Narayana et al., Classification of tweets data based on polarity
using improved RBF kernel of SVM. Int. J. Inf. Technol. (2020). https://doi.org/10.1007/s41
870-019-00409-4
6. V. Kumar, B. Subba, A Tfidf Vectorizer and SVM based sentiment analysis framework for
text data corpus, in 2020 National Conference on Communications (NCC) (Kharagpur, India,
2020), pp. 1–6. https://doi.org/10.1109/NCC48643.2020.9056085
7. C. Jiang, Y. Li, L. Li, A. Liu, C. Liu, News readers’ sentiment analysis based on fused-
KNN algorithm, in 2019 4th International Conference on Computational Intelligence and
Applications (ICCIA) (Nanchang, China, 2019), pp. 21–29. https://doi.org/10.1109/ICCIA.
2019.00012
8. X. Xie, S. Ge, F. Hu et al., An improved algorithm for sentiment analysis based on maximum
entropy. Soft. Comput. 23, 599–611 (2019). https://doi.org/10.1007/s00500-017-2904-0
9. https://snap.stanford.edu/data/web-Movies.html. Accessed on 21 May 2021
10. http://qwone.com/~jason/20Newsgroups/. Accessed on 22 May 2021
11. Airline-twitter-sentiment, 2015. [Online]. Available: https://www.crowdflower.com/data/air
line-twitter-sentiment/. Accessed on 15 May 2021
12. C. Sindhu, D. Rajkakati, C. Shelukar, Context-based sentiment analysis on amazon product
customer feedback data, ed. by D. Hemanth, G. Vadivu, M. Sangeetha, V. Balas. Artificial
Intelligence Techniques for Advanced Computing Applications. Lecture Notes in Networks
and Systems, vol 130. (Springer, Singapore, 2021). https://doi.org/10.1007/978-981-15-5329-
5_48
13. S. Brody, N. Elhadad, Restaurant review corpus (2009). http://people.dbmi.columbia.edu/noe
mie/ursa. Accessed on 18 May 2021
14. B. Pang, L. Lee, Movie review data (2002). http://www.cs.cornell.edu/people/pabo/movie-rev

iew-data. Accessed on 19 May 2021
15. K. Orkphol, W. Yang, Sentiment analysis on microblogging with K-means clustering and
artificial bee colony. Int. J. Comput. Intell. Appl. 18(03), 1950017 (2019). https://doi.org/10.
1142/S1469026819500172
16. H. Rehioui, A. Idrissi, New clustering algorithms for Twitter sentiment analysis. IEEE Syst. J.
1–8 (2020). https://doi.org/10.1109/JSYST.2019.2912759
17. M. Jafarzadegan, F. Safi-Esfahani, Z. Beheshti, Combining hierarchical clustering approaches
using the PCA method. Expert Syst. Appl. 137, 1–10 (2019). ISSN 0957-4174. https://doi.org/
10.1016/j.eswa.2019.06.064
18. H. Suresh, S. Gladston Raj, A fuzzy based hybrid hierarchical clustering model for twitter
sentiment analysis, ed. by J. Mandal, P. Dutta, S. Mukhopadhyay. Computational Intelligence,
Communications, and Business Analytics. CICBA 2017. Communications in Computer and
Information Science, vol 776 (Springer, Singapore, 2017). https://doi.org/10.1007/978-981-
10-6430-2_30
19. S. Vashishtha, S. Susan, Fuzzy rule based unsupervised sentiment analysis from social media
posts. Expert Syst. Appl. 112834 (2019). https://doi.org/10.1016/j.eswa.2019.112834
20. M. Trupthi, S. Pabboju, G. Narsimha, Possibilistic fuzzy C-means topic modelling for Twitter
sentiment analysis. Int. J. Intell. Eng. Syst. 11 (2018)
21. Sanders dataset, 2011. [Online]. Available: http://www.sananalytics.com/lab/twitter-sentim
ent/. Accessed on 23 May 2021
22. Nuclear Twitter Dataset, Retrieved Jan 31, (2019). From https://data.world/crowdflower/emo
tions-about-nuclear-energy
23. Apple Twitter Dataset, Retrieved Jan 31, (2019). From https://data.world/crowdflower/%20a
pple-twitter-sentiment. Accessed on 18 May 2021
24. Ankit, N. Saleena, An ensemble classification system for Twitter sentiment analysis. Proc.
Comput. Sci. 132, 937–946 (2018). ISSN 1877-0509. https://doi.org/10.1016/j.procs.2018.
05.109 (https://www.sciencedirect.com/science/article/pii/S187705091830841X)
25. SemEval (2017), Retrieved Apr 20, 2019 from http://alt.qcri.org/semeval2017/task4/
28. Twitter dataset, (2014). https://drive.google.com/file/d/0BwPSGZHAP_yoN2pZcVl1Qmp
1OEU/view?usp=sharing. Accessed on 17 May 2021
29. Testdata.manual.2009.06.14, (2015). http://help.sentiment140.com/for-students/. Accessed on
14 May 2021
30. Twitter-sanders-apple, (2015). http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW3/.
Accessed on 19 May 2021
31. M. Bordoloi, S.K. Biswas, E-commerce sentiment analysis using graph based approach, in
2017 International Conference on Inventive Computing and Informatics (ICICI) (Coimbatore,
India, 2017), pp. 570–575. https://doi.org/10.1109/ICICI.2017.8365197
32. A. Rahman, M.S. Hossen, Sentiment analysis on movie review data using machine learning
approach, in 2019 International Conference on Bangla Speech and Language Processing
(ICBSLP) (2019), pp. 1–4. https://doi.org/10.1109/ICBSLP47725.2019.201470
33. Z. Iqbal, M. Yadav, Implementation of supervised learning techniques for sentiment analysis
of customer Tweets on airline services. Int. J. Eng. Appl. Sci. Technol. 5(3), 351–357 (2020).
Published Online July 2020 in IJEAST. ISSN No. 2455-2143
34. R. Gupta, M. Chen, Sentiment analysis for stock price prediction, in 2020 IEEE Conference
on Multimedia Information Processing and Retrieval (MIPR), pp. 213–218 (2020). https://doi.
org/10.1109/MIPR49039.2020.00051
35. A. Srivastava, M. Jaiswal, A. Prasad, T.J. Siddiqui, Sentiment analysis: the effects of
demonetization in India using KNN & SVC (2020)
36. Stopwords dictionary, 2014. [Online]. Available: https://github.com/igorbrigadir/stopwords/
blob/master/en/t101_minimal.txt. Accessed on 24 Oct 2018
37. Acronyms dictionary, 2015. [Online]. Available: http://www.netlingo.com/acronyms.php.

Accessed on 17 May 2021
38. P. Kalarani, S. Selva Brunda, Sentiment analysis by POS and joint sentiment topic features
using SVM and ANN. Soft. Comput. 23, 7067–7079 (2019). https://doi.org/10.1007/s00500-
018-3349-9
39. Yadav, D. Pandya, SentiReview: sentiment analysis based on text and emoticons, in 2017
International Conference on Innovative Mechanisms for Industry Applications (ICIMIA)
(Bangalore, 2017), pp. 467–472. https://doi.org/10.1109/ICIMIA.2017.7975659
40. M. Aman Ullah, S Marium, S. Begum, N. Dipa, An algorithm and method for sentiment analysis
using the text and emoticon. ICT Express 6 (2020). https://doi.org/10.1016/j.icte.2020.07.003
41. Kaur, G. Sikka, L.K. Awasthi, Sentiment analysis approach based on N-gram and KNN classi-
fier, in 2018 First International Conference on Secure Cyber Computing and Communication
(ICSCCC) (Jalandhar, India, 2018), pp. 1–4. https://doi.org/10.1109/ICSCCC.2018.8703350
42. A. Madasu, S. Elango, Efficient feature selection techniques for sentiment analysis. Multimed.
Tools Appl. 79, 6313–6335 (2020). https://doi.org/10.1007/s11042-019-08409-z
An Automated System for Facial Mask
Detection and Social Distancing
during COVID-19 Pandemic
Rutuja Rashinkar, Swapnil Rokade, Pragati Janjal, Gauri Pawar,

and Swati Shinde
Abstract Since December 2019, the world has started getting affected by a widely
spreading virus which we all call the coronavirus. This virus is spread all across the
globe, causing many severe health problems and deaths too. COVID-19 is spread
when a healthy person comes in contact with the droplets generated when an infected
person coughs or sneezes. So, the WHO has suggested some precautionary measures
against the spread of this disease. These measures include wearing a mask in public,
maintaining social distancing, avoiding mass gatherings. To help reduce the virus’
spread, in this paper, we are proposing a system that detects unmasked people, iden-
tifies them, checks if social distancing is followed or not, and also provides a feature
of contact tracing. The proposed system consists of mainly two modules: face mask
detection and social distancing. There are two more modules which include face
recognition and contact tracing. We used two datasets for training our models. First
one to detect masks on faces. For this purpose, we collected the image dataset from
GitHub and Kaggle. And, the second dataset was for face recognition in which we
took our own images for training purposes. It is hoped that our model contributes
toward reducing the spread of this disease. Along with COVID-19, this model can
also help reduce the spread of similar communicable disease scenarios.
Keywords COVID-19 · Facial mask detection · Face recognition · Social

distancing · Contact tracing · MobileNet · FaceNet · YOLOv3
1 Introduction
Up to December 2019, we were living normally. But after December 2019, a deadly
virus emerged that changed our lives. It is the most widespread virus—coron-
avirus. It causes cough, fever, serious lung damage, and similar effects. It was
and continues to be so prevalent that the World Health Organization called it a
global pandemic. As per worldometers.info Web site, up to August 10, 2021, there
were approximately 203,456,760 cases of COVID-19 reported worldwide. Of those,
R. Rashinkar (B) · S. Rokade · P. Janjal · G. Pawar · S. Shinde

Pimpri Chinchwad College of Engineering, Pune, Maharashtra 411044, India
28 R. Rashinkar et al.
approximately 183,465,516 people have recovered, but unfortunately, approximately

4,320,088 people have died.
As the numbers also tell us that huge damage to human life as well as to the
economy was caused, the World Health Organization proposed some precautionary
measures to reduce these numbers. These measures include: wash hands often with
soap or use sanitizer, maintain a safe distance from those who are sneezing or
coughing, wear a mask in public environments, stay home if unwell. But, people
are not able to adapt to this new normal and, thus, fail in following these measures
against COVID-19.
To make sure that people are following at least the basic preventive measures
like wearing a mask in public and maintaining proper distance when around people,
we are proposing an automated system which would help in reducing the virus’
spread. This system mainly consists of two modules: face mask detection—to detect
masks on people’s faces and recognize unmasked people and social distancing—to
determine if people are maintaining proper distance or not.
Along with this, we are also carrying out contact tracing, that is, tracing back
people who were in contact with the infected person.
2 Related Works
Since the outbreak of COVID-19, many researchers have studied the symptoms,
preventive measures, etc. of COVID-19. Many have developed various models for
various purposes which would help in controlling the spread of the virus.
Many systems were proposed for mask detection. Hue-Saturation-Intensity (HSI)
color space from Rahman et al. [1] presented a system for smart cities which was
useful in reducing the spread of the virus. It only determined whether people were
masked or unmasked. The model proposed by Rahman et al. in [1] had some issues
like it was not able to differentiate a masked face and a face covered with hands.
Another model Chavda et al. [2] used a two-stage face mask detector. The first stage
used a RetinaFace model Jiang et al. [3], Ejaz et al. [4] for robust face detection. The
second stage involved training three different lightweight face mask classifier models
on the dataset. The NASNetMobile-based model was finally selected for classifying
masked and unmasked faces. Ge et al. in [5] used locally linear embedding for
detecting masked faces. They used MAFA dataset for training and included three
types of locations, that is, of face, eyes, and the mask along with three types of face
characteristics, namely left, right, front, left-front, right-front. In addition to this,
various occlusion degrees were also taken into consideration of training the masked
faces to the model.
There are numerous systems used for face recognition. MobileNetV2 architecture
and computer vision are used to help maintain a safe environment and monitor public
places to ensure individual’s safety. Along with cameras for monitoring, a micro-
controller has been used in Yadav [6]. Masked face recognition is done using the
An Automated System for Facial Mask Detection and Social … 29
FaceNet pretrained model Ejaz and Islam [7]. Three datasets are used in training the
model in Ejaz and Islam [7].
A method for distance measurement using only a single target image was proposed
by Xu et al. [8]. In this target image, it has some measurements such that they
can achieve image pixels to real distance ratio for distance calculation. Distance
information is drawn by integrating the moving target detection method which is
formed by Gaussian mixture model (GMM) and the Hue-Saturation-Intensity (HSI)
color space. Punn et al. [9] have used the formulae using focal length of camera
along with Euclidean distance to calculate the distance among people in Punn et al.
[9]. They have proposed a deep learning approach for automating the monitoring of
people for social distancing using surveillance video. They made use of YOLOv3
architecture to distinguish between human beings and the background.
3 Methodology
We are proposing a system where people are monitored using an automated system.
For creating this system, we have used a webcam initially. One can also use CCTV
for the same purpose. With the help of these cameras, real-time video is captured and
given as input to the proposed system. The input video can be of any duration as per
the requirements. But for testing purpose, we gave a 2-minute real-time video input.
The system will then detect whether a person is masked or unmasked and whether
social distancing is maintained in that particular frame of the given input video. If
any person is found without a face mask, then his/her face is recognized, and this
information is stored in an Excel sheet that will be generated for further action, and
also, notification will be sent to that individual as well as the respective authority.
3.1 Image Preprocessing
We require image processing as proposed by Hire et al. [10] for the purpose of
detecting face masks, recognizing a person’s identity, and for differentiating people
from the background. So, the first and the most crucial step in our system is image
processing. And, for this, we are taking images from the live video which will be
provided as an input to our system. This video is captured using a webcam. The
captured video is processed in the form of continuous frames, and these frames
need to be preprocessed as there is a large amount of data which is not useful for
our purpose. These frames are in RGB color format and are required to convert
into the grayscale format as stated by Mane and Shinde [11]. After transforming to
the grayscale format, we get the frames in which all the unnecessary information
is eliminated. Following the transformation into the grayscale format, the frames
are reshaped uniformly and are normalized in the range from 0 to 1. Performing
normalization helps in making the image processing process faster as the frames
contain only the helpful features.
3.2 Deep Learning Architecture
We have made use of the deep learning architecture. The deep learning algorithms
learn different types of features from the given input Somvanshi et al. [12], Patil and
Shinde [13]. Important features like edge detection, line detection such as vertical
lines, horizontal lines, blob detection are detected from the input image. The archi-
tecture is used to predict the previously not seen input or samples. Among all the deep
learning architectures, we chose convolutional neural network architecture. CNN is
basically used for classification and recognition, and this architecture gives pretty
good accuracy. Along with this, there is no need for human supervision when CNN
is used. This is because the architecture automatically detects the important features.
3.2.1 Collection of the Datasets
Getting a dataset to train our models was a crucial thing to do. Thus, we got required
images from various sources like GitHub, Kaggle, and from Google Images [14].
The image dataset that we have downloaded for the face mask detection model is
from Kaggle, and it contains 3833 images. Out of these 3833 images, 1915 were
with masked faces and remaining were of unmasked faces.
For face recognition, our faces’ images were taken, and a dataset out of it was
made. This was done using a user interface where we provided a button to get images
of the person who is in the front of the camera along with an identity number and
other details like name, email ID, and his/her contact number. After saving the faces’
images, we trained our face recognition model with these faces. For this purpose, we
used the Haar cascade.
3.2.2 Architecture
As convolutional neural networks can automatically fetch and detect important

features from the images, we made use of this approach in our system. CNN consists
of three layers: (a) input layer, (b) hidden layer, (c) output layer Patil and Shinde
[13].
Here, in the first layer, that is the input layer, images are given as the input. The
second layer consists of various hidden layers that extract the appropriate and impor-
tant features Deshmukh and Shinde [15] from those images. The feature extracted
from the hidden layers is used by the multiple dense neural networks. The maxpooling
layer in this architecture contains three pairs of convolutional layers which aids in
reducing the spatial size of representation and also the number of parameters. Thus,
Fig. 1 Block diagram of the proposed model
this results in a simplified computation network. After this layer, a flatten layer is
applied, and this layer transforms the data into a one-dimensional array. Then, this
one-dimensional array is given to the dense network. This dense network layer learns
parameters that are helpful for the classification. This dense network layer consists
of a series of neurons which learns the nonlinear features. After this, a dropout layer
is applied which prevents overfitting by dropping some of the units.
For training our face mask detection model, we used MobileNetV2 with 53 layers.
We used the learning rate of 1e-4 so as to minimize the loss. Sigmoid and ReLU are
the activation functions that we have used in this model. In addition to this, we used
Adam optimizer.
Figure 1 is the block diagram of the proposed model which will help in
understanding the flow of the system proposed in this paper.
3.3 Face Mask Detection Module
For detecting masked and unmasked people, we have trained our model with the
dataset which is discussed in the above Sect. 3.2.1. In this module, we are going
to use CNN along with the MobileNetV2 architecture for the purpose of image
preprocessing. When compared with other architectures, MobileNetV2 models are
faster for the same accuracy. We gave input image parameters to the MobileNetV2
architecture, and after that, we set the next maxpooling layer and fully connected
layer same as in CNN. This is depicted in Fig. 2.
A pretrained face detector FaceNet is used for detecting the faces in the given input
image. We imported a couple of necessary files which define model architecture and
pretrained weights for face detection.
Dataset of masked and unmasked images are taken from Kaggle, Google Images,
and open-source image libraries. Figures 3 and 4 show training images of face mask
detection modules.
Fig. 2 MobileNetV2 in our system
Fig. 3 Dataset for mask recognition
3.4 Face Recognition Module
For performing face recognition, first we need to detect if the person is wearing
a face mask or not. If the person is not wearing a mask, then we have to identify
that person’s identity. In this module, we are using the Haar cascade algorithm for
identifying faces. The dataset we used for face recognition consists of images of our
faces. These images are given at the real time itself by using a webcam.
To recognize the people, we are giving the ID number for each person and
capturing 100 face images of each person. This count can be increased so as to
increase the accuracy of the model. Figure 5 represents a dataset that is created of
face images of that person to train the model.
Fig. 4 Dataset for unmask detection
Fig. 5 Training dataset image of face detection
For recognizing images, we have used the Haar cascade feature classifier which is
an object detection algorithm used to identify faces in an image or a real-time video.
It is used to detect faces and also some features like eyes, lips. This is all possible
because the algorithm uses line detection or edge detection features.
The line features shown in Fig. 6 help us to recognize the people who are detected
unmasked. “Haarcascade_frontalface_default.xml” file is imported using cascade
classifier for detecting and finding location of face in input image. We used LBPH
Fig. 6 Haar cascade line/edge detection using two rectangle, three rectangle, and four rectangle
features
recognizer for training of images that we had created earlier. After that, we trained
the recognizer with face images and their respective IDs.
The trained model using Haar cascade algorithm detects and identifies a person’s
name if not wearing a mask with the help of a list of names corresponding to the
respective IDs. This identified person’s information is stored, and notification is sent
through an email for further action. This is how we implemented face recognition in
our system.
3.5 Social Distancing
To verify the exercise of social distance among the people at crowded places or
workplaces, we proposed a mechanism which can detect if people are maintaining
social distance or not. In this mechanism, we used an object detection approach
named YOLOv3 for detection of a person as an object from the input image. This
object detector identifies and detects all persons from the image by putting bounding
boxes around each person, Punn et al. [9]. Here, we imported three necessary files
that were required for YOLOv3 setup. Data from the following three files is used for
setting up this object detector.
(i) “yolov3.weights” file contains pretrained weights of the neural network.
(ii) “yolov3.cfg” contains neural network model architecture.
(iii) “coco.names” has a list of 80 object classes that the model will be able to
detect.
By applying these pretrained weights and configuration of YOLOv3 to DarkNet,

we detected people from the image and framed a green-colored bounding box around
each person. We found centroids of each bounding box for distance measuring
purposes. And, by applying the Euclidean distance formula, we measured the distance
between the centroids of bounding boxes. Taking the mean of the heights of respective
bounding boxes as threshold or minimum safe distance between them, we converted
the bounding boxes into red color for people who are closer than the minimum safe
distance or threshold. For social distancing, we are using the YOLOv3 object detec-
tion model where we are considering people as the objects. The YOLOv3 model gives
us information like width, height and coordinates of the centroid of the bounding
box that is formed around the detected object, that is, the person. As we know this
information is of a two-dimensional shape, the bounding box, it is easy to use the
Euclidean distance method to calculate distance between two persons.
Therefore, we are able to detect people from images and also able to calculate
the distance between their bounding boxes which is indirectly the distance between
people. If any two persons come closer than minimum safe distance, then their
bounding boxes turn red, and we would know that social distance is not maintained.
3.6 Contact Tracing
In case a person is detected positive for COVID-19, it is necessary to trace all the
people who came in contact with the infected person before 2–3 days the person
was detected positive. For carrying out this task, we need a list of people who were
in contact with the infected person. But, before creating the list, we also need to
keep in mind that wearing a mask and maintaining appropriate distance with the
infected individual would decrease the chances of a healthy person getting infected.
The infected person cannot be recognized by our system, and thus, we need to go
through the list manually. Therefore, people who are not following the norms, not
following preventive measures should be the ones whose details must be in the file.
To do this task, we are maintaining an Excel sheet which will save the information
of people who were not wearing masks. This list will be helpful for contact tracing.
4 Results
We are assuming that before using our system for monitoring people, the system is
trained by every person’s face, and we have his/her information such as email ID for
contacting them. This can be done using the user interface that we have provided
(mentioned in methodology, collection of datasets).
Fig. 7 Identified masked and unmasked faces
4.1 Face Mask Detection and Face Recognition Module
Figure 7 show the results of our system when a person is masked and unmasked,
respectively.
In Fig. 7a, as the person is with a mask on his face, our model detects him as a
masked image and gives a green-color bounding box with “Mask” tag. As we have
to notify authorities about the person’s identity, we need to identify people who are
not following the preventive measures. This task of finding the person’s identity is
done in the face recognition module. The same person is present without a mask in
Fig. 7b, and he is recognized as “Swapnil” because we have trained our model with
the images of Swapnil’s face and given the name tag as “Swapnil”. And thus, we can
identify the unmasked persons and notify them by email for further actions.
4.2 Social Distancing
In Fig. 8a, we can see that there are two persons who are standing far away from each
other. Our social distancing module which uses YOLOv3 architecture detects them,
and using Euclidean distance, distance between them is calculated. As the distance
calculated is greater than the threshold value, the bounding boxes drawn are in green
color, whereas in Fig. 8b, the same two persons are seen nearer to each other, and
thus, the distance between them comes out to be less than the threshold value, and
hence, their bounding boxes are turned red denoting that these two people are closer
than necessary.
Fig. 8 a People at a safe distance, b people not at a safe distance
4.3 Contact Tracing
We can see that both persons are identified, and their name is shown in Fig. 9. This
is how persons are identified, and their information is maintained in Excel sheet for
the purpose of contact tracing. Information like the name of the person who is not
following the government norms and his/her other contact details is stored in this
Excel sheet.
Fig. 9 Face identification with social distancing check

5 Conclusion
In this paper, we proposed a model that uses MobileNetV2 and Haar cascade algo-
rithm for face mask detection and face detection. Social distancing is monitored by
camera using YOLOv3. In this COVID-19 situation, it is really important that we
have to follow all the guidelines given by the government. For that purpose, we have
developed this system to detect whether people are wearing masks or not and to check
whether they are maintaining social distance. If people are not wearing masks and
not maintaining social distancing, then they are detected and identified and notified
to the respective authorities. We have successfully developed a module that detects
whether people are wearing masks or not, maintaining social distance or not, and if
not, they are identified and their information is stored so as to keep a track and to
use the information in the future in case a person is tested positive for COVID-19.
This module will be helpful to reduce spread of disease and can be used in private
organizations, schools, colleges, etc.
References
1. M.M. Rahman, M.M.H. Manik, M.M. Islam, S. Mahmud, J.-H. Kim, An automated system to
limit COVID-19 using facial mask detection in smart city network, in 2020 IEEE International
IOT, Electronics and Mechatronics Conference (IEMTRONICS), pp. 1–5 (2020). https://doi.
org/10.1109/IEMTRONICS51293.2020.9216386
2. A. Chavda, J. Dsouza, S. Badgujar, A. Damani, Multi-stage CNN architecture for face mask
detection, in 2021 6th International Conference for Convergence in Technology (I2CT), pp. 1–8
(2021). https://doi.org/10.1109/I2CT51068.2021.9418207
3. M. Jiang, X. Fan, H. Yan, RetinaMask: a face mask detector (2020). [Online]. Available: http://
arxiv.org/abs/2005.03950
4. M.S. Ejaz, M.R. Islam, M. Sifatullah, A. Sarker, Implementation of principal component anal-
ysis on masked and non-masked face recognition, in 2019 1st International Conference on
Advances in Science, Engineering and Robotics Technology (ICASERT) (Dhaka, Bangladesh,
2019), pp. 1–5. https://doi.org/10.1109/ICASERT.2019.8934543
5. S. Ge, J. Li, Q. Ye, Z. Luo, Detecting masked faces in the wild with LLE-CNNs, in 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Honolulu, HI, 2017),
pp. 426–434. https://doi.org/10.1109/CVPR.2017.53
6. S. Yadav, Deep learning based safe social distancing and face mask detection in public areas
for COVID-19 safety guidelines adherence. Int. J. Res. Appl. Sci. Eng. Technol. 8, 1368–1375
(2020). https://doi.org/10.22214/ijraset.2020.30560
7. M.S. Ejaz, M.R. Islam, Masked face recognition using convolutional neural network, in 2019
International Conference on Sustainable Technologies for Industry 4.0 (STI), pp. 1–6 (2019).
https://doi.org/10.1109/STI47673.2019.9068044
8. Z. Xu, L. Wang, J. Wang, A method for distance measurement of moving objects in a monocular
image, in 2018 IEEE 3rd International Conference on Signal and Image Processing (ICSIP),
pp. 245–249 (2018). https://doi.org/10.1109/SIPROCESS.2018.8600495
9. N. Punn et al., Monitoring COVID-19 social distancing with person detection and tracking via
fine-tuned YOLO v3 and deep-sort techniques. 2020. [Online]. Available: https://arxiv.org/abs/
2005.01385
10. M. Hire, S. Shinde, Ant colony optimization based exudates segmentation in retinal fundus
images and classification, in 2018 Fourth International Conference on Computing Communi-
cation Control and Automation (ICCUBEA) (Pune, India, 2018), pp. 1–6. https://doi.org/10.
1109/ICCUBEA.2018.8697727
11. S. Mane, S. Shinde, A method for melanoma skin cancer detection using dermoscopy images, in
2018 Fourth International Conference on Computing Communication Control and Automation
(ICCUBEA) (Pune, India, 2018), pp. 1–6. https://doi.org/10.1109/ICCUBEA.2018.8697804
12. M. Somvanshi, P. Chavan, S. Tambade, S.V. Shinde, A review of machine learning techniques
using decision tree and support vector machine, in 2016 International Conference on Computing
Communication Control and automation (ICCUBEA) (Pune, 2016), pp. 1–7. https://doi.org/
10.1109/ICCUBEA.2016.7860040
13. C. Patil, S. Shinde, Leaf detection by extracting leaf features with convolutional neural network
(May 18, 2019), in Proceedings of International Conference on Communication and Infor-
mation Processing (ICCIP) 2019. Available at SSRN: https://ssrn.com/abstract=3419766 or
https://doi.org/10.2139/ssrn.3419766
14. Dataset: https://www.kaggle.com/andrewmvd/face-mask-detection, https://github.com/balaji
srinivas/Face-Mask-Detection/tree/master/dataset Accessed on 23 April 2021
15. S. Deshmukh, S. Shinde, Diagnosis of lung cancer using pruned fuzzy min-max neural network,
in 2016 International Conference on Automatic Control and Dynamic Optimization Techniques
(ICACDOT) (Pune, 2016), pp. 398–402. https://doi.org/10.1109/ICACDOT.2016.7877616
Detection of Insider Threats Using Deep
Learning: A Review
P. Lavanya and V. S. Shankar Sriram
Abstract Massive number of cyberattacks exist on the Internet, among which insider
threat is one of the most challenging malicious threats in cyberspace. The iden-
tification of insiders (attackers) is a very hard-hitting job within an organization
and discriminating benign employees and insiders is crucial. Hence, the automa-
tion of insider threat detection using machine learning and deep learning techniques
improves the detection performance and helps in analyzing the characteristics of an
insider. Several learning models have been developed, of which deep learning tech-
niques are promising as it offers high-quality results and it does not require feature
engineering. Assorted deep learning techniques have been employed to discrimi-
nate insiders from benign employees, and this review article articulates the deep
learning techniques presented so far in the literature for effective insider threat
detection. The performance of the deep learning techniques and their discrimination
ability, commonalities, and differences among the cybersecurity researchers based
on the metrics are summarized in a motive to provide a clear insight to the budding
cybersecurity researchers.
Keywords Cybersecurity · Insider threat · Insider attack · Insiders · Insider threat

detection approaches · Deep learning
1 Introduction
The exponential growth of security-based network applications is due to advanced

technology. An enormous number of techniques were developed mainly to handle
threats and provide security to cyberspace [1]. Cyberspace is the interconnection
of frameworks such as systems, processors, and controllers in which cybersecurity
P. Lavanya · V. S. Shankar Sriram (B)

School of Computing, Centre for Information Super Highway (CISH), SASTRA Deemed
University, Thanjavur, India
e-mail: sriram@sastra.edu
P. Lavanya
e-mail: lavanya@sastra.ac.in
42 P. Lavanya and V. S. Shankar Sriram
protects the digitalized data and restricts unauthorized users to access the network
using advanced technology to safeguard cyberspace. Increased cybersecurity chal-
lenges lead to cyberattacks which are enclosed as a criminal offense that harms
the CIA (confidential, integrity, and authentication) factors of the system. Recently,
there is an abundant increase in cyberattacks due to the frequent utilization of online
activities. The three different forms of cyberattacks are targeted attack, untargeted
attack, and insider attack. Among them, Insider attacks are the most vulnerable and
are very difficult to detect owing to the following reasons: insiders have authorized
access, however, to perform unauthorized activities; the illegal employee can insert
a malicious code into the system which disturbs the system’s normal function and
have a practice of thieving an intellectual property [2].
Insider threat is one of the malicious threats where employers within an organiza-
tion will become the attacker. The term insider is defined as a person with authorized
entitlement who has the power to access all confidential information of the institution
illegally [2, 3]. The technical report from the CERT division states that a prolonged
investigation of almost 1500 and above insider threats has happened in both types
of public and private industries [4]. Though the process of detecting insider threats
is complex, cybersecurity researchers have designed different approaches namely
anomaly-based approaches, role-based access control, scenario-based approaches,
and psychological factors-based approaches [5] to safeguard the digital environ-
ment. Most of the aforementioned approaches rely on deep learning techniques as it
detects the attack vectors automatically without human intervention.
Among these techniques, deep learning has been evolved in the detection process
for its higher-level feature extraction to solve complex data [6, 7]. The benefits of deep
learning are it provides the finest results for unstructured data and had well- labeled
data. Owing to the advantages of deep learning, it makes the detection of insider threat
to achieve a higher level of performance accuracy. This manuscript mainly focused
on the study of insider threat detection approaches, and classification of different
deep learning techniques is taken for detection approaches. The review manuscript
is designed as follows: Sect. 2 includes the detailed description of insider threat,
types, and categories of insiders and challenges of insider threat. Section 3 describes
the different types of insider threat detection approaches. Section 4 deals with the
study of the insider threat detection approach based on deep learning techniques.
Section 5 discusses the insider threat dataset, and Sect. 6 concludes the paper.
2 Insider Threat
An insider threat is a security threat that begins inside the focused organization. This
does not imply that an insider or attacker should be a current worker or official in the
organization. They could be a specialist, previous worker, colleague, or board part.
Detection of Insider Threats Using Deep Learning: A Review 43
2.1 Insider Profiling
The malicious insider threats are classified into four types. They are sabotage, theft,
fraud, and espionage [8]. The sabotage was one of the most intelligent attacks and
always concludes in considerable distress to people and organizations. These types
of insiders are normally dissatisfied employees with technical knowledge who have
authorized access. A well-defined example of IT sabotage is website defacement. The
second type of insider profiling is theft. It is the process of stealing the intellectual
assets within a company where they can access the data throughout a day and those
data are transferred to some other private companies. For instance, transferring some
important information from one company to another through an employer who can
be either technical or non-technical. Fraud is the third type. Fraud is the process of
accessing the company’s financial data illegally through authorized privileges. For
example, when a person theft a valuable thing from a company is said to be a fraud.
The last and final type is espionage. It is a kind of threat where the systematic nature
of corporate data gets extracted and also it delivers information about the benefits of
insider planning [9].
2.2 Types of Insiders
Insiders are of three types, namely traitors masqueraders and unintentional. The
description of types of insiders is pictured in Fig. 1.
2.3 Categories of Insiders
The insiders are classified into four categories that depend on freedom in accessing
the system within an organization. They are pure insider, insider affiliate, insider
associate, and outside affiliate (Table 1).
2.4 Categories of Insider Threat
Depends on the different types of insider threat, the categories of insider threat are
discussed below; they are self-motivated insiders, recruited insiders, and planted
insiders (Table 2)
Fig. 1 Types of insiders
Table 1 Categories of insiders

Categories of insider Description
Pure insider The fully authorized users are employed to access the entire data of an
organization. The data are highly sensitive that can be either logically or
physically designed where the user had full rights to access it
Insider affiliate A well-known person of the user who visits and accesses the sensitive
data of an organization with the user’s credentials. Friends, relatives of
users, or the company’s clients are said to be insider affiliates
Insider associate A person who accesses the company’s credential physically and he/she
is not worked at that company but can be a part of it. It does not need any
network access. They can be the company’s shareholders or contractors
Outside affiliate A person who does not have any authorized access to organizational
data and he/she is not part or known to an organization but uses
undirected network ways to access the company’s credentials is known
as an outside affiliate [10]
2.5 Challenges of Insider Threat
Some of the challenges faced by researchers in case of generating the dataset and
the process of detecting the insider threat are discussed below,
1. The process of detecting the insider threat is complex because the resemblances
of the user and the insider are similar
2. Insufficient real data and lack of ability to examine the encoded data
Table 2 Categories of insider threat

Categories of insider threats Description
Self-motivated insider threat: A person who acts as an insider is because of
The personal vengeance of an individual the reason that makes them take revenge inside
becomes a threat to act as a self-motivated the organization; this kind of activity is called
insider threat self- motivated insider threat. They are
independent insiders, i.e., it does not depend
on a third party
Recruited insider threat: A person became a malicious threat within an
A third party (other companies or organization due to strength given by a third
shareholders) can threaten and recruit the users party and is dependent. These kinds of insiders
based on the user’s financial needs. This makes have recruited insiders where personal and
an employee act like an insider financial issues become a threat. The main
contribution of recruited insider threat is the
third party
Planted insider threat: The malicious companies pick the one to be an
The attacker companies always select the one insider. They get trained and selected by the
as a threat. They are trained to forward the targeted recruiter, and the insiders take time to
attacker to a particular company make the targeted believe them [11]. The
basement of the threat was very strong. So, it is
called a planted insider threat
3. It has a peak rise in cost and dimensionality

4. Often the insider attack arises with a knowledge attack
3 Insider Threat Detection Approaches
Insider threat detection approaches are used to describe the techniques for identifying
insiders, and it defines the uncertainty of a user being an insider. It is based on
the attributes of the user or insider such as behavior, character, working profile,
and situation-based. Once the insiders are determined from the attributes of insider
threat detection approaches, then it makes it informal to identify the insider threat and
prevent the insider attack. The different types of insider threat detection approaches
are described below.
3.1 Anomaly-Based Approach
In this approach, it portrays the client’s typical behavior as compared to a standard

behavior of client. Any malicious conduct can be able to conceivably recognize as
a deviation from the reference model. The two difficulties when demonstrating a
client’s ordinary conduct are (1) there exist complex nonlinear connections in the
review information (e.g., the occasions a client signs onto a PC is not identified with
the hours of client spends on the PC), and (2) there are hardly any names accessible
ahead of time to show the “great” and “terrible” review information occurrences [12].
3.2 Role-Based Access Control
An association’s instructive resources are typically protected with a bunch of

employer access rules. There exist three significant access control strategies: discre-
tionary access control (DAC), mandatory access control (MAC), and role-based
access control (RBAC). The National Institute of Standards and Technology (NIST)
has been overwhelmingly advancing the unification of RBAC guidelines. A unified
model with four incremental functional levels is flat, hierarchical, constrained, and
symmetric. RBAC upholds hierarchical security organization at a business venture
level instead of a client personality level [13]. It guarantees the task of access rights
depending on the work capacities that a client is allowed to perform inside an
association [14].
3.3 Scenario-Based Approach
The scenario-based approach is based on datasets and the set of user activities
constructed at different levels of scenarios. Contingent upon a grouping of malig-
nant practices includes different types of scenario-based threats. Providing legitimate
definitions to this approach is in process. Basic scenarios in the detection process of
CMU dataset are changes in user’s behavior, avoiding removable drives in working
time but using it after when working time gets covered, and also when an employee
is in the position of shifting to another company, he/she are using their biometric in
their office to steal the confidential information, and issues in maintaining the pass-
words. All these scenarios can use the techniques of both deep learning and machine
learning [15].
3.4 Psychological Factors-Based Approach
In an organization, there is a transformation in behavior or character of the autho-

rized user that should be taken into consideration before the user become an insider
[16]. The high-risk analysis of psychological factors of the user is based on physical
modifications such as emotions, brain wave features, anxiety, tension, and nega-
tive thoughts [17]. Here, the algorithms of the neural network, deep learning, and
Bayesian algorithm were deployed for treating these kinds of issues.
All the above-defined insider threat detection approaches have been discussed
using deep learning techniques in the upcoming Sect. 4.
4 Insider Threat Detection Approaches Using Deep

Learning
Deep learning comprises a cluster of renowned machine learning techniques such as

classification and prediction. It depends on an artificial neural network in which the
various perceptron layers are used for data processing. It is used to solve a variety
of problems and offer a rise in the degree of accuracy. The DL architecture is made
up of several hidden layers; each layer is constructed using an enormous number
of neurons which are implemented in several applications: computer vision, speech
recognition, and health informatics, etc. The deep learning algorithms are suitable
for featuring a high level of complexity that includes several features and vast data
[18, 19]. It uses many frameworks like TensorFlow, MATLAB, Keras, Caffe, etc.,
to resolve complex data [20]. Whereas, the detection mechanism using the deep
learning models achieves high performance and also consists of huge parameters
that involve an enormous amount of labeled data for training the modules [21].
4.1 Role of Deep Learning Over Machine Learning

for Insider Threat Detection
The prescribed machine learning techniques cannot handle anomalous information

because of its features such as high-dimensionality, heterogeneity, complexity, and
sparsity. Though the structure of the machine learning is narrow and simple layered
in case of the detection process, transforming the real features into upper limit intel-
lectual is used. This technique was not adaptable for detecting anomalous data, and
also, it has the minimum capacity to develop a complex system or models in detec-
tion. Meanwhile, the benefit of deep learning techniques depends on proficiency
in the spontaneous determination of features required for the detection process. In
cyberspace, the identification of user behavior is uncertain, and it is very difficult to
collect the information of user’s performance; here the deployment of deep learning
makes it possible is due to the deep nonlinear segments and its general-purpose
learning technique. It plays a vital role in modeling the anomalous data, specifically
for the detection process where the malignant can detect easily. The different types
of deep learning algorithm play an important role in detecting insider threats based
on their approaches which have discussed below,
4.2 Convolutional Neural Network (CNN)
The CNN-based insider threat detection methods are delivered here. The unautho-
rized access of role-based access control approach in a synthetic dataset used a
learning classifier system (LCS)-based CNN called (CN-LCS) where it optimizes
the feature selection rules and does modeling, and classification produces an accu-
racy of 92% [12, 22]. An anomalous behavior approach of mouse clicking behavioral
features of the user in an organization used CNN for changing mouse activities into
images and gives both feature extraction and abstraction in the training phase. It
also provides results for classifying images with 85% accuracy of the open-source
dataset [23]. Another type of CNN is graph convolution network (GCN) which is used
for detecting the malicious team occurred based on anomalous behavior approach
where GCN drops signify data from spatial to graph with an accuracy 92% using
CMU CERT dataset [24].
4.3 Recurrent Neural Network (RNN)
RNN has input as sequential data and its functions on feedback loops. It takes the input
as portions and addresses the problem with a highly flexible deep learning concept
[25]. The back-propagation process begins when the error is encountered. The fore-
most issue of this process was vanishing and exploding gradients. The changes in
character or behavioral activities of a user come under a psychological factors-based
approach which is detected using a learning technique called deep neural network
(DNN) and recurrent neural network (RNN) in the CERT V6.2 dataset produces an
accuracy of 93% [26].
4.4 Long Short-Term Memory (LSTM)
LSTM technique is proposed to overcome the limitations of long-term dependencies.

The conventional LSTM model has memory cells and gates which depict the condi-
tions to forget the previous state and update the hidden state [27, 28]. LSTM is most
widely employed for insider threat detection. LSTM autoencoder for the psycho-
logical factor-based approach defines the LSTM autoencoder forecasts the threat
scenes accomplished on the interweaved series of actions in communal which has an
accuracy of 92% in the CERT v6.2 dataset [27]. The real-time and latest version of
the CERT dataset has an anomalous behavior-based approach, in which the LSTM
plays in accessing the log-series with two or more relative records, and it extracts
valuable data and changes the repeated accounts from a dataset. It produces an 85%
accuracy [28]. The LSTM and PCA are used for getting a high rate of detection,
event aggregator, more attribute classifiers, and anomaly classifiers which are taken
into peer-to-peer detection framework for an anomalous behavior-based approach
that induces 94% accuracy with CERT v6.2 dataset [29]. An LSTM is used along
with NLP to handle a user log and to complete the process. This is based on the
role-based access control approach in the CMU CERT dataset which produces 94%
of accuracy [30].
4.5 Gated Recurrent Units (GRU)
Gated feedback recurrent neural network resolves the limitations of vanishing and
gradient problems. It has two forms of gates: update and reset gates. The functions
associated with these gates are used to manage the flow between previous state and
current state inputs [31]. GRU is used to detect the psychological behavior of an
employee which comes under the psychological factors-based approach of insider
threat detection using the Enron email and tweets dataset which provides an accuracy
of 68% and 71% [32].
4.6 Deep Belief Network (DBN)
The output of DBN was produced from both probabilities and unsupervised learning.
The two types of layers are directed and undirected layers. It is composed of binary
latent variables. An adaptive optimization deep belief network (AODBN) technique
is proposed for the role-based insider threat detection approach; the DBN performs
the process of combining and regularizing the progress logs by absorbing both regular
and irregular characteristics of malicious threats where CERT-IT dataset is taken and
produce a 98% of accuracy [33]. The insiders occur in more different domains, and
the user performance is difficult to detect. DBN is used for the process of managing
the audit logs and it shows that the unseen features were removed and the One-
Class Support Vector Machine (OC-SVM) used for training the features produces
the training accuracy of 88% for the CMU CERT r4.2 dataset [34].
4.7 Autoencoder
The generic autoencoder method is an unsupervised feed-forward neural network

model in which the output vectors are identical to the given input vectors [34], which
are compressed and forwarded to the next layer. The three main components associ-
ated with the auto-encoders are encoder, activation, and decoder [33]. The encoder
compresses the input vector into latent space representation with nonlinear activation
functions within the range of [0, 1]. The decoder layer renovates the input results
in the higher or lower dimensional data [35]. The mechanisms of autoencoder have
been deployed in insider threat detection under the role-based access control approach
where it is trained to audit classification data to explore the normal activities of the
employee, and it generates the error to distinguish whether the user is anomalous or
not between the defined and modified data [36].
4.8 Hybrid Neural Network
The long short-term memory (LSTM) and convolutional neural network techniques
are applicable in the anomalous behavior approach of insider threat detection. The
role of LSTM on known user actions which take a portion of sorted temporal features
and the role of CNN in the classification of features matrices of user actions in
CMU CERT (public dataset) which gives the accuracy of 94% [2]. The method
proposed for detection of malicious activities under anomalous behavior approach is
the combination of neural temporal point process and recurrent neural network which
is known as hierarchical neural temporal point processes using CERT and UMD
Wikipedia dataset that gives accuracies of 90% in CERT and 91% in UMD Wikipedia
[37]. The neural network-based graph embedding adaptively learns discriminative
embedding’s from an account device graph dependent on two basic shortcomings of
aggressors that have been developed to avoid illegal access of user’s account using an
Alipay mobile payment gateway dataset which turns accuracy of 94% [38]. Thus, the
study on different types of deep learning techniques in insider threat detection had
clearly stated the technique of LSTM and CNN are more effective when compared to
all other deep learning techniques because LSTM mainly overcomes the limitations
of long-term dependencies and CNN handles the high features and large volumes of
data.
The overview architecture of insider threat detection using deep learning as shown
in Fig. 2 describes the entire working process. When a dataset is deployed with
preprocessing techniques, the methods of data cleaning and data normalization is
embedded here. Among those insider threat detection approaches and different deep
learning techniques, the researchers can select the approach and the method based
Fig. 2 Insider threat detection using deep learning architecture. CNN convolutional neural network,
LSTM long short-term memory, GRU gated recurrent unit, AE autoencoder, RNN recurrent neural
network, DBN Data Belief Network
on the dataset. With this, the dataset deals with two techniques of training and testing
based on performance evaluation of precision, recall, f -score, ROC, and accuracy
with respect to detection of normal and insiders.
CNN-Convolutional Neural Network

LSTM-Long Short Term Memory
GRU-Gated Recurrent Unit
AE-Auto encoder
RNN-Recurrent Neural Network
DBN-Data Belief Network
5 Datasets
The study of insider threat datasets is computed in five different types based on two
strategies called malicious and benign. The malicious strategy exists in two ways.
One is through violating the rules of the organization by the use of authorized user
access which is called a traitor-based dataset and another way is illegal accessing of
sensitive data is known as a masquerader-based dataset. When the above two ways
of malicious strategies are combined, then it is defined as miscellaneous malicious
as shown in Table 3.
In benign strategy, by discriminating whether the malicious was expressed by
correspondents of a dataset or not. The substituted masquerader and authentication-
based are two types of benign strategies. When a dataset sample comprising tags of
malicious outward, then it is said to be substituted, masquerader. The authentication-
based one is the process of user identification in a labeled dataset [39]. Apart from
these five types, the datasets of insider threat are beginning from real-world-based
datasets and laboratory-based datasets.
Above-discussed insider threat datasets are taken from many different universities
and laboratories. Among these datasets, the most primarily used dataset is CERT
and the secondary is ENRON. By comparing all the datasets, the CERT plays an
important role in all the proposed solutions of insider threat detection; it is due to
the characteristics of insiders present in it. The involvement of different types of
insider threat detection approaches in CERT makes it easy to detect the activity of an
insider whether he/she is an anomaly or role-based or scenario-based which results in
minimum time complexity. This article discuss about the performance metrics used
for the evaluation of deep learning models.
Figure 3 explains the survey-based statistical report of deep learning in insider
threat detection which describes the different types of deep learning techniques and
its performance metrics in year. Based on this, the yearly based review of each
techniques rise had special fall and rise which help the study of researchers in this
domain.
Table 3 Survey on datasets used in the detection of insider threats

Dataset and its types Description
RUU “Are You You” (RUU) dataset has been taken from 34 users;
(Masquerader-based dataset) in these, 14 users were masqueraders. It consists of
host-based user profiling events such as the process of
permitting and processing file system, archiving the
windows, handling library package dynamically while
loading it into the system, and the graphical user interface
(GUI) in a system [40, 41]
WUIL The windows-users and-intruder simulation logs (WUIL)
(Masquerader-based dataset) dataset includes additional authentic masquerader
challenges. It contains data of system user actions and
handling file system practices [21]. It consists of reports,
from which 20 users were actively examining at varying
time intervals within their daily routines. In this, a particular
number of users may deliver their logging activity within an
hour, and some can deliver in several weeks. All these data
are collected using the interval tool for file auditing in the
windows system. Based on batch scripts, the masquerader
session having three different stages of users are basic,
intermediate, and advanced [42]
DARPA DARPA of Intrusion detection system (IDS) evaluation
(Masquerader-based dataset) dataset has been developed by MIT Lincoln laboratory. The
major persistence of this dataset was to analyze and increase
IDS [43]. The dataset in insider threat detection issue
comprises network traceability and the process of capturing
the system logging activity present in affected machines.
Here, the performance of attacks is classified into four
categories such as a denial of service (DOS), remote to the
user, user to root, and surveillance. Apart from these
categories, the most fascinating category based on insider
threat view is a user to root where the masquerader exists
[44–46]
ENRON ENRON is an email dataset. It was delivered by the federal
(Traitor-based dataset) energy regulatory commission through its examination, and
it is facing an enormous number of reliability issues. All
these problems had been fixed by the CALO project (A
cognitive assistant that learns and organizes). It holds a
wide range of emails, individual and official. Overall,
500,000 emails were taken from 150 users. [43]. It has
essential information that is used to examine the email text
and social media analysis concentrated only in the process
of insider threat detection when traitors exist [47]
APEX This dataset was given by NIST (national institute of
(Traitor-based dataset) standards and technology). The main aim of APEX is to
pretend the specialist mission in the intellect team. In
research, it has five experts as malicious users and eight as
benign users; this makes detection of an insider to be
complicated [48]
(continued)
Table 3 (continued)
Dataset and its types Description
CERT CERT is a computer emergency response team for major
(Miscellaneous malicious dataset) computer security incidents in its constituency. It is a group
of synthetic datasets for insider threat. This dataset is
available in the CERT insider threat center which shows the
characteristics of insider threat [49]. The common attributes
of CERT datasets are HTTP, email, login activity, file, and
devices include identity, user information, date, and PC [1]
TWOS TWOS is “The Wolf of STUD” was brought from
(Miscellaneous malicious dataset) Singapore University of technology and design in 2017. It
includes six different types of data: keystroke, mouse, host
monitor, network traffic, SMTP logs (email), and logon. It is
a collection of real time dataset which has a simple
conversation between the user with a host-based system
with authorized user information and some threat cases [40]
Schonlau It was given by Schonlau in the year 2001 which has 15,000
(Substituted masquerader dataset) Unix commands for each user, and it has 50 users of Unix
system logs. It had different types of performance within
organizations. Here, the masqueraders are combining the
data from varying unknown users; at the same time, it does
not have any malicious intent. Its features are time, user,
process, registry, and file access [50]
Greenberg It is composed of Unix csh (C shell) command line entries
(Authentication-based dataset) from 168 users. It holds the related information of
implemented Unix commands. Based on the knowledge and
aids of the user’s programming skills, the dataset is divided
into four sets. They are (1) novice programmers, (2)
experienced programmers, (3) computer scientists, and (4)
non-programmers. The features of Greenberg’s datasets are
the start and end times of a particular session, history, and
its error updates. The layout of the dataset given by
researchers in two forms is plain and enriched [51, 52]
Purdue University This dataset was given by Lane and Brodley in 1997. It
(Authentication -based dataset) includes Unix shell entries of eight users for the past two
years. It is a type of enriched dataset which consists of Unix
command names and arguments
MITRE Owl The MITRE provides an organization-based wide learning
(Authentication-based dataset) (OWL) dataset. It is purely based on data that gives a
statistical representation of specific user feedback and
training the users. It includes logs of 24 Mac system users.
The users of this dataset are working in the area of artificial
intelligence and technology. This dataset is very flexible for
graphical user interactions (GUI)-based applications which
are used for user authentication [51]
LANL The LANL dataset was taken from Los Alamos National
(Authentication-based dataset) Laboratory in the year 2015. It has 12,425 user’s system,
process, network, and its domain name server (DNS), about
red team logs [52]
A statistical report of deep learning in insider

threat detection
Performance Metrics in percentage
2020
2020
100
2020
2018
2018
2018
2020
2020
2016
75
2016
2020
2016
2018
2016
2016
2018
2018
2016
50
25
0
CNN RNN DBN LSTM GRU AE
Different types of deep learning techniques
Fig. 3 A review report of deep learning in insider threat detection
6 Conclusion
Insider threat is one of the greatest challenging threats in cybersecurity where both
the insider threats and the presence of insiders cannot be identified easily within an
infrastructure which results in an enormous amount of loss (financial and confidential
data) in many organizations. Different types of deep learning-based insider threat
detection approaches have been developed to detect insider threats and insiders.
This review article presents the categories of insider threat, types of insiders, and a
glimpse of insider threat detection approaches implemented so far and provides a deep
insight into the insider threat detection approaches deployed based on deep learning
techniques. This study helps cybersecurity researchers to know the importance and
working of deep learning techniques in insider threat detection.
Acknowledgements This work was supported by The Department of Science and Technology-
Interdisciplinary Cyber-Physical System (T-615).
References
1. M.R.G. Raman, N. Somu, K. Kirthivasan, V.S. Shankar Sriram, A hypergraph and arithmetic
residue-based probabilistic neural network for classification in intrusion detection systems.
Neural Netw. 92, 89–97 (2017)
2. F. Yuan, Y. Cao, Y. Shang, Y. Liu, J. Tan, B. Fang, Insider threat detection with deep neural
network, in International Conference on Computational Science (Springer, Cham, 2018),
pp. 43–54
3. R. Chinchani, D. Ha, A. Iyer, H.Q. Ngo, S. Upadhyaya, Insider threat assessment: model,
analysis and tool, in Network security (Springer, Boston, MA, 2010), pp. 143–174
4. Y. Wu, D. Wei, J. Feng, Network attacks detection methods based on deep learning techniques:
a survey. Secur. Commun. Netw. 2020 (2020). https://doi.org/10.1155/2020/8872923
5. A. Sanzgiri, D. Dasgupta, Classification of insider threat detection techniques, in Proceedings

of the 11th Annual Cyber and Information Security Research Conference, pp. 1–4 (2016)
6. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436–444 (2015)
7. S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, Deep learning with limited numerical
precision, in International Conference on Machine Learning (PMLR, 2015), pp. 1737–1746
8. S. Seo, D. Kim, Study on inside threats based on analytic hierarchy process. Symmetry 12(8),
1255 (2020)
9. M.N. Al-Mhiqani, R. Ahmad, W. Yassin, A. Hassan, Z.Z. Abidin, N.S. Ali, K. Hameed
Abdulkareem, Cyber-security incidents: a review cases in cyber-physical systems. Int. J. Adv.
Comput. Sci. App. 9(1), 499–508 (2018)
10. T. Gunasekhar, K.T. Rao, M.T. Basu, Understanding insider attack problem and scope in cloud,
in 2015 International Conference on Circuits, Power and Computing Technologies, pp. 1–6
(2015)
11. E. Cole, S. Ring, Insider Threat: Protecting the Enterprise from Sabotage, Spying, and Theft
(Elsevier, 2005). ISBN: 9780080489056
12. S.-J. Bu, S.-B. Cho, A convolutional neural-based learning classifier system for detecting
database intrusion via insider attack. Inf. Sci. 512, 123–136 (2020)
13. C.D. McDermott, F. Majdani, A.V. Petrovski, Botnet detection in the internet of things using
deep learning approaches, in 2018 International Joint Conference on Neural Networks (IJCNN)
(IEEE, 2018), pp. 1–8
14. I. Saenko, I. Kotenko, Genetic algorithms for solving problems of access control design and
reconfiguration in computer networks. ACM Trans. Internet Technol. (TOIT) 18(3), 1–21
(2018)
15. P. Chattopadhyay, L. Wang, Y.-P. Tan, Scenario-based insider threat detection from cyber
activities. IEEE Trans. Comput. Soc. Syst. 5(3), 660–675 (2018)
16. A. Almehmadi, Micro-movement behavior as an intention detection measurement for
preventing insider threats. IEEE Access 6, 40626–40637 (2018)
17. Y.-A. Suh, M.-S. Yim, High risk non-initiating insider identification based on EEG analysis
for enhancing nuclear security. Ann. Nucl. Energy 113, 308–318 (2018)
18. C. Vigneswaran, V.S. Shankar Sriram, Unsupervised bin-wise pre-training: a fusion of
information theory and hypergraph. Knowl. Based Syst. 195, 105650 (2020)
19. H.A. Glory, C. Vigneswaran, S.S. Jagtap, R. Shruthi, G. Hariharan, V.S. Shankar Sriram, AHW-
BGOA-DNN: a novel deep learning model for epileptic seizure detection. Neural Comput.
Appl. 1–29 (2020)
20. S. Pouyanfar, S. Sadiq, Y. Yan, H. Tian, Y. Tao, M.P. Reyes, M.-L. Shyu, S.-C. Chen, S.S.
Iyengar, A survey on deep learning: algorithms, techniques, and applications. ACM Comput.
Surv. 51(5), 1–36 (2018)
21. S. Mahdavifar, A.A. Ghorbani, Application of deep learning to cybersecurity: a survey.
Neurocomputing 347, 149–176 (2019)
22. D.S. Berman, A.L. Buczak, J.S. Chavis, C.L. Corbett, A survey of deep learning methods for
cyber security. Information 10(4), 122 (2019)
23. T. Hu, W. Niu, X. Zhang, X. Liu, J. Lu, Y. Liu, An insider threat detection approach based on
mouse dynamics and deep learning. Secur. Commun. Netw. 2019 (2019)
24. J. Jiang, J. Chen, T. Gu, K.-K. Raymond Choo, C. Liu, M. Yu, W. Huang, P. Mohapatra,
Anomaly detection with graph convolutional networks for insider threat and fraud detection, in
MILCOM 2019–2019 IEEE Military Communications Conference (MILCOM) (IEEE, 2019),
pp. 109–114
25. A. Tuor, S. Kaplan, B. Hutchinson, N. Nichols, S. Robinson, Deep learning for unsupervised
insider threat detection in structured cybersecurity data streams. arXiv:1710.00811 (2017)
26. P. Torres, C. Catania, S. Garcia, C.G. Garino, An analysis of recurrent neural networks for botnet
detection behaviour, in 2016 IEEE Biennial Congress of Argentina (ARGENCON) (IEEE,
2016), pp. 1–6
27. B. Sharma, P. Pokharel, B. Joshi, User behavior analytics for anomaly detection using LSTM
autoencoder-insider threat detection, in Proceedings of the 11th International Conference on
Advances in Information Technology, pp. 1–9 (2020)
28. J. Lu, R.K. Wong, Insider threat detection with long short-term memory, in Proceedings of the
Australasian Computer Science Week Multiconference, pp. 1–10 (2019)
29. F. Meng, F. Lou, Y. Fu, Z. Tian, Deep learning based attribute classification insider threat
detection for data security, in 2018 IEEE Third International Conference on Data Science in
Cyberspace (DSC) (IEEE, 2018), pp. 576–581
30. D. Zhang, Y. Zheng, Y. Wen, Y. Xu, J. Wang, Y. Yu, D. Meng, Role-based log analysis applying
deep learning for insider threat detection, in Proceedings of the 1st Workshop on Security-
Oriented Designs of Computer Architectures and Processors, pp. 18–20 (2018)
31. R. Dey, F.M. Salemt, Gate-variants of gated recurrent unit (GRU) neural networks, in 2017
IEEE 60th international midwest symposium on circuits and systems (MWSCAS) (IEEE, 2017),
pp. 1597–1600
32. C. Soh, Y. Sicheng, A. Narayanan, S. Duraisamy, L. Chen, Employee profiling via aspect-based
sentiment and network for insider threats detection. Expert Syst. Appl. 135, 351–361 (2019)
33. M. Yousefi-Azar, V. Varadharajan, L. Hamey, U. Tupakula, Autoencoder-based feature learning
for cybersecurity applications, in 2017 International Joint Conference on Neural Networks
(IJCNN) (IEEE, 2017), pp. 3854–3861
34. J. Zhang, Y. Chen, J. Ankang, Insider threat detection of adaptive optimization DBN for
behavior logs. Turk. J. Electr. Eng. Comput. Sci. 26(2), 792–802 (2018)
35. G. Dong, G. Liao, H. Liu, G. Kuang, A review of the autoencoder and its variants: a comparative
perspective from target recognition in synthetic-aperture radar images. IEEE Geosci. Remote
Sens. Mag. 6(3), 44–68 (2018)
36. L. Liu, O. De Vel, C. Chen, J. Zhang, Y. Xiang, Anomaly-based insider threat detection
using deep autoencoders, in 2018 IEEE International Conference on Data Mining Workshops
(ICDMW) (IEEE, 2018), pp. 39–48
37. S. Yuan, P. Zheng, X. Wu, Q. Li, Insider threat detection via hierarchical neural temporal
point processes, in 2019 IEEE International Conference on Big Data (Big Data) (IEEE, 2019),
pp. 1343–1350
38. Z. Liu, C. Chen, J. Zhou, X. Li, F. Xu, T. Chen, L. Song, Poster: neural network-based
graph embedding for malicious accounts detection, in Proceedings of the 2017 ACM SIGSAC
Conference on Computer and Communications Security, pp. 2543–2545 (2017)
39. A. Harilal, F. Toffalini, J. Castellanos, J. Guarnizo, I. Homoliak, M. Ochoa, Twos: a dataset of
malicious insider threat behavior based on a gamified competition, in Proceedings of the 2017
International Workshop on Managing Insider Security Threats, pp. 45–56 (2017)
40. M.B. Salem, S.J. Stolfo, Modeling user search behavior for masquerade detection, in Inter-
national Workshop on Recent Advances in Intrusion Detection (Springer, Berlin, Heidelberg,
2011), pp. 181–200
41. J.B. Camina, C. Hernández-Gracidas, R. Monroy, L. Trejo, The windows-users and-intruder
simulations Logs dataset (WUIL): an experimental framework for masquerade detection
mechanisms. Expert Syst. Appl. 41(3), 919–930 (2014)
42. J.B. Camina, R. Monroy, L.A. Trejo, M.A. Medina-Pérez, Temporal and spatial locality: an
abstraction for masquerade detection. IEEE Trans. Inf. Forensics Secur. 11(9), 2036–2051
(2016)
43. M. Miao, J. Wang, S. Wen, J. Ma, Publicly verifiable database scheme with efficient keyword
search. Inf. Sci. 475, 18–28 (2019)
44. C. Thomas, V. Sharma, N. Balakrishnan, Usefulness of DARPA dataset for intrusion detection
system evaluation, in Data Mining, Intrusion Detection, Information Assurance, and Data
Networks Security, vol. 6973, p. 69730G (2008)
45. S. Terry, B.J. Chow, An assessment of the DARPA IDS evaluation dataset using snort. UCDAVIS
Department of Computer Science, vol. 1, p. 22 (2007)
46. J. Shetty, J. Adibi, The enron email dataset database schema and brief statistical report. Inf.
Sci. Inst. Tech. Rep. Univ. South. Calif. 4(1), 120–128 (2004)
47. E. Santos, H. Nguyen, F. Yu, K.J. Kim, D. Li, J.T. Wilkinson, A. Olson, J. Russell, B. Clark,
Intelligence analyses and the insider threat. IEEE Trans. Syst. Man Cybern. Part A: Syst.
Humans 42(2), 331–347 (2011)
48. M. Collins, Common sense guide to mitigating insider threats. CARNEGIE—MELLON UNIV
PITTSBURGH PA PITTSBURGH United States (2016)
49. P.A. Legg, Visualizing the insider threat: challenges and tools for identifying malicious user
activity, in 2015 IEEE Symposium on Visualization for Cyber Security (VizSec) (IEEE, 2015),
pp. 1–7
50. M.B. Salem, S.J. Stolfo, A comparison of one-class bag-of-words user behavior modeling
techniques for masquerade detection. Secur. Commun. Netw. 5(8), 863–872 (2012)
51. S. Greenberg, Using unix: collected traces of 168 users (1988). https://doi.org/10.11575/
PRISM/30806
52. A. El Masri, H. Wechsler, P. Likarish, B. ByungHoon Kang, Identifying users with application-
specific command streams, in 2014 Twelfth Annual International Conference on Privacy,
Security and Trust (IEEE, 2014), pp. 232–238
53. A. Bushuev, Modern methods of protection against insider threats. zyk v cfepe
ppofeccionalno kommynikacii—Ekatepinbypg 2020(2020), 458–461 (2020)
54. R.A. Alsowail, T. Al-Shehari, Empirical detection techniques of insider threat incidents. IEEE
Access 8, 78385–78402 (2020)
55. M. Canham, C. Posey, P.S. Bockelman, Confronting information security’s elephant, the
unintentional insider threat, in International Conference on Human-Computer Interaction
(Springer, Cham, 2020), pp. 316–334
An Incisive Analysis of Advanced
Persistent Threat Detection Using
Machine Learning Techniques
M. K. Vishnu Priya and V. S. Shankar Sriram
Abstract Up thrust in security threats and cyber-attacks are due to infinite growth in
Internet-based services. One such multi-stage security threat which is more serious,
undiscoverable, and complicated is Advanced Persistent Threat (APT). Discovering
an APT attack is a foremost challenge to the research community as the attack
vectors of APT exists for a long period. Persistent efforts by the researchers in
APT detection using machine learning models improve the detection efficiency and
provide a better understanding of the APT Stages. This review article summarizes the
various machine learning-based detection techniques presented so far in the literature
to alleviate the impact of APT and guides the interested researchers to design a
computationally attractive, reliable, and robust machine learning-based system for
efficient APT detection.
Keywords Advanced Persistent Threat (APT) · Cyber threat · Machine learning

(ML) · APT attack stages · Traditional detection methods
1 Introduction
The security in Internet-based applications had rapid growth due to the development
of technology. To protect the industries or organizations from vulnerabilities, the
defender needs technology-based visibility over its possessions which provide secu-
rity within an infrastructure [1]. In recent years, most of the outbreaks in cyberspace
lead to confidential data loss and effective capital loss in both public and private orga-
nizations [2, 3]. These generous outbreaks are renowned as a national threat which
are induced by cybercriminals (may be individuals or groups) that are bounded by
malware rather than safe networks. Nowadays, many different types of cyber-attacks
M. K. Vishnu Priya · V. S. Shankar Sriram (B)

School of Computing, Centre for Information Super Highway (CISH), SASTRA Deemed
University, Thanjavur, India
e-mail: sriram@it.sastra.edu
M. K. Vishnu Priya
e-mail: vishnupriya@sastra.ac.in
60 M. K. Vishnu Priya and V. S. S. Sriram
have existed in network applications. Among those attacks, one of the complex and
customized attacks is Advanced Persistent Threat (APT).
APT is a targeted Cyber attack, which is personified for its slow and steady
attack process. It utilizes standard frameworks to conceal itself inside the genuine
network traffic [4, 5] and persist over there for a long period. The APT attack is
a more perplexing process and has a significant impact when compared to other
attacks [6]. It uses zero-day vulnerabilities, spear phishing, watering hole attacks,
remote access control tools, some advanced tools and techniques to harm the targeted
systems, and move laterally inside the network. After reaching the targeted network,
the APT attack groups steal sensitive information, monitor the activities of the system
without being detected by intrusion detection systems and other traditional detection
mechanisms [7]. Attackers generate network traffic inside the organizations which
makes the detection process more challenging [8]. Most of the sectors are affected
by APT such as industry sectors, military, healthcare, and education.
The real-time application of APT attacks is based on unique characteristics such as
abnormal performance on client accounts maintenance, the occurrence of secondary
channel Trojans, irrelevant information groups, confused data streams, and deviations
in outbound information. Some reports [9–11] of malware analysis are discussed,
in which FireEye [12, 13] report has abundant sources related to cybersecurity. It
presents an overview of various APT attacks, suspected attributions, target sectors,
attack vectors, and suspected countries. The cybersecurity professionals [14, 15]
focuses on understanding the features of APT, an attempt to excavate hidden state
practices by utilizing the organization’s network traffic, and the process of accessing
security log data using standard/traditional detection techniques. However, these
methods failed to achieve high-level accuracy. To further improve the APT detec-
tion accuracy, Machine Learning (ML) algorithms have been proposed by various
researchers [16, 17]. It analyzes the data and performs the APT detection process
systematically. This manuscript focused on the review of recent research on various
machine learning techniques to prevent large-scale APT attacks.
The rest of the paper is structured as follows, Section 2 presents a detailed descrip-
tion of APT and their stages. Section 3 discusses about the various dataset used for the
valuation of APT attack. Section 4 focuses on the traditional detection and defense
mechanism proposed by various authors and procures the challenges while detecting
the APT attacks. Section 5 illustrates the importance of using ML to detect the
APT attack, architecture of machine learning model, and summarizes some of the
research articles related to APT detection using ML techniques. Section 6 concludes
the review article.
2 Advanced Persistent Threat
Advanced persistent Threat (APT) can be described as, “a holistic human driver
infiltration or a meticulous group of attackers use advanced tools and techniques to
attack the well-established sectors to do the process of exfiltration & exploitation of
An Incisive Analysis of Advanced Persistent Threat … 61
Table 1 APT differs from the normal malware attacks [18]

Malware APT
Attacker Hackers or a specific person Sophisticated attackers,
government-funded hackers, criminals
groups
Target Individual systems Government organization, financial
sectors, IT industries, defense sectors,
healthcare
Purpose Financial purpose, sensitive Exfiltration confidential data
information
Attack life cycle Attack will be terminated once it is Persist over a network through many
detected ways even after being detected
sensitive information from those sectors. The main feature of APT is that attacker
can persist over a very long period in an organization without being undetected.”
The characteristics of APT can be derived as,
Advanced—APT attackers are well-funded to access advanced tools and techniques.
The methods and vectors of attacks are customized into several stages based on their
targets.
Persistent—Attackers pertain to access to the targeted network, impersonate ordi-
nary traffic to avoid detection from the traditional detector, and stays unobserved
for a long period using deceptive techniques. APT attackers try to stay inside the
targeted network without being explicitly noticed.
Threat—The goal of attackers is to devastate and disrupt the organization by
irreplaceable data loss, causes damage to the impendent economic growth of an
organization.
The characteristics of APT rely on the same as other malware. Though, some of the
characteristics are quite different from other malware in their intent and impetus. To
differentiate between the malware and APT, Chen et al. [18] stated some individuality
presented in Table 1.
The exploration of definite APT models shows there is no such similarity among
every single APT attack and they are explicitly customized for each target. However,
the various steps involved in the process of APT are more related to one another. The
upcoming section elaborates on the stages of APT attack in detail.
2.1 Stages of Advanced Persistent Threat
Advanced Persistent Threat involves multiple stages of attack process based on their
targeted environment, among them most of the researches used the following six
stages (Fig. 1) (Reconnaissance, Delivery, Exploitation, Operational, Data collection,
Fig. 1 APT Stages
and Exfiltration) have been considered as commonly used loopholes by most of the
attackers and based on Intrusion kill chain. Table 2 exemplifies the stages of APT
used so far by the attackers which are seized for the detection process.
The above-mentioned table signifies the study of the APT Stages which would
pave the way for upcoming researchers to gain wide knowledge on new techniques
in APT detection based on its various stages and their respective functions.
3 APT Dataset—An Analysis
Creating and collecting a dependable benchmark dataset, training, and testing the
APT labeled features are still a very big challenge due to the vague nature of the
dataset. Most of the APT features are similar to the features of IDS, so IDS dataset
like NSL-KDD, UNSW-NB15, NGIDS-DS, and CICIDS2017 was also evaluated to
detect the trace of APT in a network environment. Those existing approaches moder-
ately found the APT attack signatures. Since, detecting the advanced techniques used
by the attackers is too tough. To encounter these challenges, the establishment of
dataset(s) like synthetic, semi-synthetic, and realistic modes is used for the detection
process because these dataset is composed of attacks that are injected manually in the
network traffic flow includes the features of APT. The following section elaborates
a discussion on various analyzes of dataset related to APT detection.
Ghafir et al. [15] proposed a novel approach for threat detection using various
ML methods and the dataset used for this process was a synthetic dataset gener-
ated based on random and related alerts in the network environment which has 8
important features of APT. The model resulted in better threat prediction with a
low false positive rate. Paul et al. [28] evaluated their proposed approach using a
synthetic dataset that was created in the general organization network and correlate
the evaluated results from the sample data collected from several security fields.
Table 2 Analysis on stages of APT

APT stages Description
3 Stages of APT [19]
• Initial compromise The initial phase of gaining access to the targeted
network environment using techniques like spear
phishing, watering hole attacks, and Internet-facing
services
• Lateral movement In this phase, the attackers collect legitimate credentials
to access other systems from the targeted network
• Command and control C2 protocol was used to control the compromised
systems of the targeted environments to exfiltrate the
sensitive information
• Initial intrusion Techniques like spear phishing emails were used to
attack the targeted network
• Command and control (C&C) C&C channel communication handled in the targeted
network between adversaries and compromised systems
• Lateral movement Collects more information using adversaries and
maintain closer contact to the internal host
• Attack achievement Steals critical information from the targeted network
4 Stages of APT [7]
• Information collection phase (Initial phase) Scanning, probing, and social
engineering techniques were used to invade the targeted
network
• Intrusion phase Phishing email techniques used by the attacker to get
access permission
• Latent expansion phase Expanding the connection to some other hosts in the
targeted network
• Information theft phase Transmitting confidential information from the host
network
5 Stages of APT [13, 21]
• Delivery Attackers deliver the attack vectors via spear phishing
emails to the victim system
• Exploit System applications are exploited using exploitable
techniques
• Installation Implementing malware into the targeted network
• C&C Communication to compromise
• Actions Extracting confidential data
• Intelligence gathering Attack targets within the organization by the influence
of social media sites
(continued)
Table 2 (continued)
• Point of entry The attackers started with stick phishing messages
shipped off to focused workers in the organization
• C&C server Starts compromising the targeted network by using
C&C protocol, establishing long-term access,
download additional malware executable
• Lateral movement Compromises additional systems to gain access and to
pave the way for vulnerable host
• Data discovery Obtain information about the internal Web servers and
hosts
• Data exfiltration Exfiltration sensitive information
• Reconnaissance and weaponization Information gathered for the attack process id done by
OSINT, social engineering
• Delivery Two ways in delivering the exploits in the victim’s
system:
Direct way—using spear phishing, watering hole
techniques to invade into the system
Indirect way—compromises the trustworthy 3rd party
of the targeted network to exploit the attack
• Initial intrusion Exploiting vulnerabilities in Adobe, Excel, Internet
Explorer of the targeted network
• Command and control Attackers use social networking sites, Tor anonymity
network, remote access control to control the targeted
network and exploits the user’s behaviors
• Lateral movement This phase of the attack lasts for a long period for
establishing the attack vectors to completely
compromise the targeted network
• Data exfiltration To gain the strategic benefits of the organization threats
• Research Collects basic information, loopholes about the targeted
network
• Preparation Extracts definite information about the victim and test
the tools and techniques to exploit vulnerabilities
• Intrusion Possessing vulnerable emails into the targeted systems
• Conquering the network Once the victim systems have been compromised,
attackers include some additional malware to access
some specific host in the targeted network
• Hiding presence Conceals their presence to avoid the detection from
traditional methods
• Gathering data Steals some confidential information
• Maintaining access The main goal of the APT attackers was to maintain
access to the targeted network for a long period
(continued)
Table 2 (continued)
• Initial reconnaissance Initial process
• Initial compromise Spear phishing email technique was used to penetrate
through the target network
• Establishing foothold Control over the target
• Privilege escalation Predominant usage of tools to gain credentials
information
• Internal reconnaissance Collects all the information about the target to initiate
the attack
• Lateral movement Access the system using legitimate credentials
• Maintain presence Stays in the network for a long time
• Complete mission Steals the information using C&C servers
• Initial access The initial phase of the attack
• Persistence More exertion is placed into constancy focused for
long-lasting admittance in the organization
• Privilege escalation To introduce malware or perseverance focuses
• Discovery Finding frameworks, clients, and information important
to the mission
• Lateral movement This stage presents to getting across an organization to
the important information for the mission
• Collection Target network data are collected
• Exfiltration Exfiltration of personal data
• Execution One of the main requirements for attackers to execute
the malicious code in the victim system
• Defense evasion Bypassing the traditional detectors and stays
undetected for a long time
• Credential access This phase is the key role for the attackers to gain
access to victims systems using valid credentials
• C&C Gain control over the target network by establishing
communication with the various internal host
Micro et al. [29] analyzed high dimensions of network traffic flow which consti-
tutes data records for 5 months and ranking the features based on the ranking approach
and valued their proposed model on the dataset contains the log data gathered from the
enterprise network over three days. The specific dataset was released by the visual
analytics community for the mini challenge. It was injected by various kinds of
attacks monitored by the intrusion detection approach to detect anomalous behavior.
Siddiqui et al. [30] proposed a fractal dimension-based machine learning classifi-
cation model and evaluated using the pcap files obtained from the Contagio malware
dataset and dataset from DARPA scalable network monitoring program traffic to
classify the normal and malicious behavioral features. Since the features related to
the APT attack were categorized and labeled for the further process. Ivo Friedberg
et al. [31] proposed an anomaly detection model to differ the actions and behav-
iors of APT from Skopik et al. [32] generated network traffic data. This divides the
observed data into training and attack phases based on the anomalous behavior of
the system. The above-discussed dataset detects the signatures of APT but there is a
lack of advanced techniques to mitigate the attack. Even though, there is a surplus of
benchmark dataset(s) for the evaluation of advanced models the generated synthetic
dataset [4, 33] and contagion malware dataset [34] resulted in better performance
when compared with other IDS-based dataset.
4 Detection Mechanism of Advanced Persistent Threat
APT uses advanced tools and techniques to penetrate through large endeavors such
as industrial control systems, banking sectors, institutions, and nuclear power plants
which affects the security infrastructure completely [35]. Owing to the diverse
features of APT, it can persist over a long period without being detected and it
outbreaks confidential information. To diminish these attack processes, most of the
researchers focused on the detection of APT using some standard detection methods
such as anomaly detection, rule-based detection, signature-based detection, traffic
data analysis, and pattern recognition [36–38]. These existed detection methods were
used for the process of isolating the affected systems, extracting the threat features
based on the network traffic, data flow, average size of the data packet, time interval
taken by the data packet for the process of transmission, etc., encountered as the most
encouraging techniques which are widely used for the intrusion detection systems
and malware identification [9, 39, 40].
4.1 Detection Mechanism
APT attacks were discovered by its anomalous behavior and most of the researchers
focused on anomaly detection [41] such as Niu et al. [11] proposed a model called
Global Abnormal Forest (GAF) to detect the features of APT in the C&C domains
which are based on the DNS log data of mobile devices. To find the unusual actors
accessing the system and to achieve higher efficiency, Berrada et al. [42] had demon-
strated an anomaly-based detection algorithm on provenance traces. Normally, this
process of extracting the provenance data results in the prediction of the anomalous
behavior of a system and it made it easier to find the traces of attackers who find the
pathway for the exfiltration of information. Whereas, the proposed model recorded
the traces on four operating systems to detect the APT traces on the system using the
categorical anomalous detection with effective results.
Detection of APT based on the various techniques rather than traditional methods
was stated by some researchers. A hot booting PHC-based APT discovery conspire
has been proposed for the powerful games, it (i) improves the APT identification
execution in the unique games by increased outcomes with an insurance level of
18.1% (ii) utility of the cloud increments by 8.8% contrasted based on the Q-learning
procedure [14].
The sandboxing execution method was proposed for radical malware detection
based on malware activities in VM infrastructure. This method also includes the
sandbox-evasion method [43–45] for the process of finding the APT presences. The
two techniques are (1) big data are large-scale distributed computing framework
based on MapReduce is used to absorb the changes in the normal behavior of the
victim at each stage of APT and (2) Hadoop which is used to expose possible targets
based on identified APT target [46, 47]. When an APT attack exists during the process
of transferring the information from source to destination, there occurs a problem
called data loss. To avoid this kind of data loss, Data Loss Prevention (DLP) is
proposed which involves two operations, namely systematic collecting, analyzing of
data, and currently obstructing data [48–50].
4.2 Defense Mechanism
A very few APT researchers have pointed out the detection solutions for malware in
the industrial control systems (ICS) suits for mitigating the APT attacks. The solution
can be accomplished using the security apparatuses, for example, weakness scanners,
SIEM frameworks, IDS/IPS frameworks, or security arrangement devices [2, 22, 51,
52]. This information base is effectively upgradeable and can be incorporated into
all Cisco interruption recognition frameworks [39].
However, there is no specific method for defending over an APT attack and also
the traditional method has failed to find ways related to the threat, their misclassified
detection in classifying the advanced malware attacks [10, 53] leads the overall attack
detection method results in a high false positive rate, fails to handle a large volume
of data [54].
4.3 Advanced Persistent Threat Detection—Challenges
The existing detection process relies on detecting the changes that occurred in the
network traffic, log data, examining the network ports, checks for pattern matching,
etc. Since most of the malware being undetected due to its advanced tools and tech-
niques to invade into the system. The following are some of the important challenges
discussed previously:
(i) lack of benchmark dataset
(ii) high false positive rate

(iii) lack of higher precision and real-time analysis
(iv) lack of intelligence about advanced malware
(v) unpredictable APT life stages
5 Why Machine Learning is Used for the Detection of APT
Machine learning methods can be utilized in different situations which are applied
to various applications of cyberspace such as malware detection, intrusion detec-
tion, insider threat detection [2, 55], etc., in monitoring the internal user systems
by analyzing the spam and phishing emails from unknown individuals or groups.
Managing the machine individualities had become the acute security proficiency.
They are trained to handle enormous real-time data within a certain period and require
some degree of mechanization and compactness when it is used for the detection of
threat models. Further, these models are well-trained and result in higher accuracy
when it is adapted to work on other high-dimensional data [56–58]. Although several
methodologies are focusing on the detection and analyzes of APT attacks, there exist
few shortcomings related to the maintenance of trade-offs between the false positives
and false negatives and also correlating the alerts from various APT cycles for the
identification of exact APT scenario [59, 60]. Figure 2 explain the APT detection
process carried out by ML trained model.
Hence, the accurate detection of existing APT attacks with minimal time
complexity remains an open challenge for researchers [16]. Therefore, the neces-
sity of using ML methods in the detection of APT (i) provides deep insights into
the features of malware dataset, (ii) provides better classification accuracy and more
flexibility than other traditional detection methods [54], (iii) results in low speci-
ficity with low false positive rate prediction [25, 46, 61] and also attains early alert
of unfamiliar threats like APT. To state the concerns, the following table (Table 3)
Fig. 2 Architecture for machine learning-based detection module for APT

Table 3 APT detection using machine learning techniques

Authors ML techniques Dataset used Performance Merits
metrics
Chu et al. [39] ✓ Support vector NSL-KDD ✓ R-squared SVM model
machine (SVM) score exposed the
✓ Naïve Bayes ✓ Mean squared highest
✓ Decision tree error recognition rate
✓ Multilayer ✓ Root mean (97.22%)
perceptron (MLP) squared error
✓ Principal
component analysis
(PCA)
Hasan et al. A novel machine Synthetic ✓ Accuracy Prediction
[63] learning-based ✓ Precision accuracy–(84.8%)
detection of APT ✓ Recall occurred at SVM
– Decision tree classifiers
– SVM 81.8% TPR
– KNN 4.5% FPR
– Ensemble
learning
Sharma et al. DFA-AD Semi-synthetic ✓ Precision Higher efficiency
[30] techniques ✓ Recall 98.5%
✓ F-measure Detection rate
0.024 FPR
Siddiqui et al. ✓ KNN, Synthetic ✓ Accuracy ✓ Higher
[7] ✓ Correlation and (pcap files) ✓ Precision classification rate
fractal anomaly ✓ Sensitivity ✓ Low false
detection ✓ Specificity positive and false
✓ F-measure negative rate
Bahtiyar et al. ✓ Logistic Synthetic ✓ Accuracy Higher R-squared
[64] regression, ✓ Precision value
✓ Gaussian-Naïve ✓ Recall
Bayes (GNB), ✓ F1-measure
✓ Decision tree ✓ R-squared
(DT), score
✓ Ensemble
learning models
(random
forest, LogitBoost)
Schindler and ✓ Support vector Synthetic log data Accuracy Achieved 98.67%
Timo [59] machine (SVM) (SecLab) of accuracy in
✓ One class SVM detecting the
anomalies in log
data
Matsuda ✓ One class SVM Log data ✓ Precision High precision
et al. [47] ✓ Isolation forest ✓ Recall rate
✓ Local outlier ✓ Accuracy
factor
(continued)
Table 3 (continued)
Authors ML techniques Dataset used Performance Merits
metrics
Lamprakis Random forest Web request data ✓ Precision High precision in
et al. [65] classifiers ✓ Recall predicting the
C&C traffic
Moon et al. DTB-IDS Synthetic dataset Accuracy 84.7% accuracy
[66]
Adams et al. ✓ Neural networks Synthetic data ✓ Precision More viable in
[62] ✓ Decision tree ✓ Recall approaching APT
✓ One class SVM ✓ Accuracy attacks
✓ K-means ✓ MCC
clustering
Shang et al. ✓ Convolution Contagio malware ✓ Precision Higher efficiency
[67] neural network database ✓ Recall in detecting C&C
✓ Long short term ✓ False alarm channels of
memory rate unknown APT
✓ PCA ✓ F1-score attack
✓ SVM,
✓ Decision tree
✓ Random forest,
✓ K-NN
✓ Naïve Bayes
✓ Logistic
regression
Tan et al. [34] Entropy-based Contagio malware ✓ Precision Reduces
detection using database ✓ Accuracy computation
SVM ✓ Recall complexity
✓ F1-score Alert
Generation on
traffic data
Efficient method
Ghafir et al. MLAPT Synthetic dataset ✓ Accuracy 84.8%
[33] ✓ TPR prediction
✓ FPR accuracy in the
early stage of APT
low false positive
rate
elucidates APT detection using several machine learning algorithms. SVM performs
well in classifying the normal and APT signatures in a high traffic flow. The features
are labeled based on the APT attack and the learning model classifies the malicious
behavior [34, 47, 59, 62].
The following statistical report (Fig. 3) consists of a detailed survey on the compar-
ison of machine learning and other computational detection models based on their
performance metrics.
Fig. 3 Statistical report on performance analysis of computational and machine learning techniques
6 Conclusion
Advanced Persistent Threat is a sophisticated threat that uses cutting-edge tech-

niques which persist for a long period. It causes excessive damage and high- level
data breach in cyberspace. At present, this targeted cyber threat dispenses data loss
to more than 150 countries around the world. Unlike normal attacks, APT attack
detection process is much harder due to the lack of standard dataset, the rapid devel-
opment of attack strategies and their chaotic nature leads to augmented challenges.
Nevertheless, the existing standard detection methods tried to manipulate the secu-
rity mechanism against APT but it faced the above-stated factors as challenges. To
overcome the limitations encountered so far, APT-based research works are focused
on the detection process implemented using the concepts of machine learning models
to mitigate various security issues with better performance rates. The future work
may rely on attack detection using advanced machine learning modules. Hence, this
comprehensive review analysis provides insights about the APT, their attack strate-
gies, lifecycle/stages, challenges faced by the traditional methods in detection of
APT attacks, mitigating APT attacks using ML models.
Acknowledgements This work was supported by The Department of Science and Technology-
Interdisciplinary Cyber-Physical System (T-615)
References
1. D. Craigen, N. Diakun-Thibault, R. Purse, Defining cybersecurity. Technol. Innov. Manag.

Rev. 4(10) (2014)
2. B. Stojanović, K. Hofer-Schmitz, U. Kleb, APT datasets and attack modeling for automated
detection methods: a review. Comput. Secur. 92, 101734 (2020)
3. Swisscom, Targeted Attacks Cyber Security Report 2019; Technical report (Swisscom
(Switzerland) Ltd. Group Security, Bern, 2019)
4. A. Alshamrani, S. Myneni, A. Chowdhary, D. Huang.: A survey on advanced persistent threats:
techniques, solutions, challenges, and research opportunities. IEEE Commun. Surv. Tutorials
21(2), 1851–1877 (2019)
5. W. Niu, X. Zhang, G.W. Yang, J. Zhu, Z. Ren, Identifying APT malware domain based on
mobile DNS logging. Math. Probl. Eng. (2017)
6. CISCO Systems. CISCO: Protecting ICS with Industrial Signatures. https://www.cisco.com/
c/en/us/products/security/index.html. Accessed on 5 June 2021
7. Solid State System LLC, http://solidsystemsllc.com/advanced-persistent-threat-protection
Accessed on 24 Mar 2021
8. R. Zhang, Y. Huo, J. Liu, F. Weng, Constructing APT attack scenarios based on intrusion kill
chain and fuzzy clustering. Secur. Commun. Netw. 7536381 (2017)
9. Malware Capture Facility Project. http://mcfp.weebly.com Accessed 28 on Aug 2021
10. Malware-Traffic-Analysis Blog. http://www.malware-traffic-analysis.net Accessed on 27 Aug
2021
11. T M technical report, Targeted attacks and how to defend against them, http://www.trendm
icro.co.uk/media/misc/targeted-attacks-and-how-to-defendagainst-them-en.pdf. Accessed on
9 July 2021
12. Fire eye Report, https://content.fireeye.com/apt-41/rpt-apt41/. Accessed 10 Jan 2021
13. Fire eye Report, https://www.fireeye.com/current-threats/apt-groups.html. Accessed 10 Jan
2021
14. Attivo Networks. BOTsink. https://attivonetworks.com/product/attivo-botsink. Accessed 12
Jan 2021.
15. I. Ghafir, V. Prenosil, Proposed approach for targeted attacks detection, in Advanced Computer
and Communication Engineering Technology (Springer, Cham, 2016), pp. 73–80
16. H.A. Glory, C. Vigneswaran, S.S. Jagtap, R. Shruthi, G. Hariharan, V.S. Shankar Sriram, AHW-
BGOA-DNN: a novel deep learning model for epileptic seizure detection. Neural Comput.
Appl. 1–29 (2020)
17. J. Vukalović, D. Delija, Advanced persistent threats-detection and defense, in 2015 38th
International Convention on Information and Communication Technology, Electronics and
Microelectronics (MIPRO) (IEEE, 2015), pp. 1324–1330
18. P. Chen, L. Desmet, C. Huygens, A study on advanced persistent threats, in IFIP International
Conference on Communications and Multimedia Security (Springer, Berlin, 2014), pp. 63–72
19. C. Vigneswaran, V.S. Shankar Sriram, Unsupervised bin-wise pre-training: a fusion of
information theory and hypergraph. Knowl. Based Syst. 195, 105650 (2020)
20. Guan, Z., L. Bian, T. Shang, J. Liu, When machine learning meets security issues: a survey,
in 2018 IEEE International Conference on Intelligence and Safety for Robotics (ISR). IEEE
(2018), pp. 158–165
21. P.K. Sharma, S.Y. Moon, D. Moon, J.H. Park, DFA-AD: a distributed framework architecture
for the detection of advanced persistent threats. Cluster Comput. 20(1), 597–609 (2017)
22. D. Moon, H. Im, I. Kim, J.H. Park, DTB-IDS: an intrusion detection system based on decision
tree using behavior analysis for preventing APT attacks. J. Supercomput. 73(7), 2881–2895
(2017)
23. M. Ussath, D. Jaeger, F. Cheng, C. Meinel, Advanced persistent threats: behind the scenes, in
2016 Annual Conference on Information Science and Systems (CISS) (IEEE, 2016), pp. 181–
186
24. E.M. Hutchins, J.C. Michael, R.M. Amin, Intelligence-driven computer network defense
informed by analysis of adversary campaigns and intrusion kill chains. Leading Issues Inf.
Warfare Secur. Res. 1(1), 80 (2011)
25. Mandiant. The Advanced Persistent Threat. https://www.fireeye.com/content/dam/fireeye-
www/services/pdfs/mandiant-apt1-report.pdf. Accessed on 30 Mar 2021
26. W. Tounsi, H. Rais, A survey on technical threat intelligence in the age of sophisticated cyber-
attacks. Comput. Secur. 72, 212–233 (2018)
27. Trend Micro, The Custom Defense Against Targeted Attacks. Technical report (Trend Micro,
Tokyo, 2013)
28. F. Skopik, G. Settanni, R. Fiedler, I. Friedberg, Semi-synthetic data set generation for security
software , in 2014 Twelfth Annual International Conference on Privacy, Security and Trust
(IEEE, 2014), pp. 156–163
29. W. Matsuda, M. Fujimoto, T. Mitsunaga, Detecting APT attacks against active directory using
machine leaning, in 2018 IEEE Conference on Application, Information and Network Security
(AINS). IEEE (2018), pp. 60–65
30. S. Singh, P.K. Sharma, S.Y. Moon, D. Moon, J.H. Park, A comprehensive study on APT attacks
and countermeasures for future networks and communications: challenges and solutions. J.
Supercomput. 75(8), 4543–4574 (2019)
31. A. Bohara, U. Thakore, W.H. Sanders, Intrusion detection in enterprise systems by combining
and clustering diverse monitor data, in Proceedings of the Symposium and Bootcamp on the
Science of Security (2016), pp. 7–16
32. I. Friedberg, F. Skopik, G. Settanni, R. Fiedler, Combating advanced persistent threats: from
network event correlation to incident detection. Comput. Secur. 48, 35–57 (2015)
33. I. Ghafir, M. Hammoudeh, V. Prenosil, L. Han, R. Hegarty, K. Rabie, F.J. Aparicio-Navarro,
Detection of advanced persistent threat using machine-learning correlation analysis. Futur.
Gener. Comput. Syst. 89, 349–359 (2018)
34. K. Krithivasan, S. Pravinraj, V.S. Shankar Sriram, Detection of cyberattacks in industrial control
systems using enhanced principal component analysis and hypergraph-based convolution
neural network (EPCA-HG-CNN). IEEE Trans. Ind. Appl. 56(4), 4394–4404 (2020)
35. M. Salem, M. Mohammed, Feasibility approach based on SecMonet framework to protect
networks from advanced persistent threat attacks, in International Conference on Emerging
Internetworking, Data & Web Technologies (Springer, Cham, 2019), pp. 333–343
36. R.P. Baksi, S.J. Upadhyaya, A comprehensive model for elucidating advanced persistent threats
(APT), in Proceedings of the International Conference on Security and Management (SAM)
(2018), pp. 245–251
37. G. Berrada, J. Cheney, S. Benabderrahmane, W. Maxwell, H. Mookherjee, A. Theriault, R.
Wright, A baseline for unsupervised advanced persistent threat detection in system-level
provenance. Futur. Gener. Comput. Syst. 108, 401–413 (2020)
38. T. Schindler, Anomaly detection in log data using graph databases and machine learning to
defend advanced persistent threats. arXiv preprint arXiv:1802.00259 (2018)
39. C. Wen-Lin, C.-J. Lin, K.-N. Chang, Detection and classification of advanced persistent threats
and attacks using the support vector machine. Appl. Sci. 9(21), 4579 (2019)
40. J. Tan, J. Wang, Detecting advanced persistent threats based on entropy and support vector
machine, in International Conference on Algorithms and Architectures for Parallel Processing
(Springer, Cham, 2018), pp. 153–165
41. D.X. Cho, H.H. Nam, A method of monitoring and detecting apt attacks based on unknown
domains. Procedia Comput. Sci. 150, 316–323 (2019)
42. P. Giura, W. Wang, Using large scale distributed computing to unveil advanced persistent
threats. Science 1(3), 93 (2013)
43. A. Singh, Z. Bu, Hot knives through butter: Evading file-based sandboxes. Threat Research
Blog. Accessed on 20 Apr 2021 (2013)
44. F.M. Al-Matarneh, Advanced persistent threats and its role in network security vulnerabilities.
Int. J. Adv. Res. Comput. Sci. 11(1) (2020)
45. J. Sexton, C. Storlie, B. Anderson, Subroutine based detection of APT malware. J. Comput.
Virol. Hacking Technol. 12(4), 225–233 (2016)
46. M. Marchetti, F. Pierazzi, M. Colajanni, A. Guido, Analysis of high volumes of network traffic
for advanced persistent threat detection. Comput. Netw. 109, 127–141 (2016)
47. T. Micro, Countering the advanced persistent threat challenge with deep discovery. Retrieved
10(10) (2013)
48. M.R.G. Raman, N. Somu, K. Kirthivasan, R. Liscano, V.S. Shankar Sriram, An efficient intru-
sion detection system based on hypergraph-genetic algorithm for parameter optimization and
feature selection in support vector machine. Knowl.-Based Syst. 134, 1–12 (2017)
49. J. Sexton, C. Storlie, J. Neil, Attack chain detection Statistical analysis and data mining. ASA
Data Sci. J. 8(5–6), 353–363 (2015)
50. F. Skopik, G. Settanni, R. Fiedler, A problem shared is a problem halved: a survey on the
dimensions of collective cyber defense through security information sharing. Comput. Secur.
60, 154–176 (2016)
51. AlertEnterprise. Sentry CyberSCADA. http://www.alertenterprise.com/products-EnterpriseSe
ntryCybersecuritySCADA.php. Accessed 12 Jan 2021
52. X. Wang, K. Zheng, X. Niu, B. Wu, C. Wu, Detection of command and control in advanced
persistent threat based on independent access, in 2016 IEEE International Conference on
Communications (ICC) (IEEE, 2016), pp. 1–6
53. O.I. Adelaiye, S. Aminat, S.A. Faki, Evaluating advanced persistent threats mitigation effects:
a review. Int. J. Inf. Secur. Sci. 7(4), 159–171 (2018)
54. M.Z. Rafique, P. Chen, C. Huygens, W. Joosen, Evolutionary algorithms for classification of
malware families through different network behaviors, in Proceedings of the 2014 Annual
Conference on Genetic and Evolutionary Computation (2014), pp. 1167–1174
55. L. Xiao, D. Xu, N.B. Mandayam, H. Vincent Poor, Attacker-centric view of a detection game
against advanced persistent threats. IEEE Trans. Mobile Comput. 17(11), 2512–2523 (2018)
56. M.A.M. Hasan, M. Nasser, S. Ahmad, K.I. Molla, Feature selection for intrusion detection
using random forest. J. Inf. Secur. 7(3), 129–140 (2016)
57. A.M. Lajevardi, M. Amini, A semantic-based correlation approach for detecting hybrid and
low-level APTs. Fut. Gener. Comput. Syst. 96, 64–88 (2019)
58. P. Giura, W. Wang, A context-based detection framework for advanced persistent threats, in
2012 International Conference on Cyber Security (IEEE, 2012), pp. 69–74
59. L. Shang, D. Guo, Y. Ji, Q. Li, Discovering unknown advanced persistent threat using shared
features mined by neural networks. Comput. Netw. 189,107937 (2021)
60. Y. Shi, G. Chen, J. Li, Malicious domain name detection based on extreme machine learning.
Neural Process. Lett. 48(3), 1347–1357 (2018)
61. M. Schmid, F. Hill, A.K. Ghosh, Protecting data from malicious software, in 18th Annual
Computer Security Applications Conference, 2002. Proceedings (IEEE, 2002), pp. 199–208
62. C. Adams, A.A. Tambay, D. Bissessar, R. Brien, J. Fan, M. Hezaveh, J. Zahed, Using machine
learning to detect APTs on a user workstation. Int. J. Sens. Netw. Data Commun. 8(2), (2019)
63. I. Jeun, Y. Lee, D.A. Won, A practical study on advanced persistent threats. Computer
applications for security. Control Syst. Eng. 144–152 (2012)
64. Ş. Bahtiyar, B.Y. Mehmet, C.Y. Altıniğne, A multi-dimensional machine learning approach to
predict advanced malware. Comput. Netw. 160, 118–129 (2019)
65. P. Lamprakis, R. Dargenio, D. Gugelmann, V. Lenders, M. Happe, L. Vanbever, Unsupervised
detection of APT C&C channels using web request graphs, in International Conference on
Detection of Intrusions and Malware, and Vulnerability Assessment (Springer, Cham, 2017),
pp. 366–387
66. C. Neasbitt, R. Perdisci, K. Li, T. Nelms, Clickminer: towards forensic reconstruction of user-
browser interactions from network traces, in Proceedings of the ACM CCS 2014 (ACM, 2014),
pp. 1244–1255
67. S. Siddiqui, M.S. Khan, K. Ferens, W. Kinsner, Detecting advanced persistent threats using
fractal dimension based machine learning classification, in Proceedings of the 2016 ACM on
International Workshop on Security and Privacy Analytics (2016), pp. 64–69
Intelligent Computing Systems
for Diagnosing Plant Diseases
Maitreya Sawai, Sameer More, Prasanna Nagardhane, Subodh Pandhare,

and Manjiri Ranjanikar
Abstract In this paper, various image processing approaches for identifying, evalu-
ating, and organizing plant diseases have been discussed. Various parts of the plants
such as seeds, stems, leaves, fruits, flowers can be used to detect the health of the
plants or to identify the diseases on them. This paper specifically focuses on the
methods that include plant leaves to detect the disease. The objective of the paper is
to comprehend various approaches for detecting the diseases and classify them by
using convolutional neural networks by using the concept of transfer learning. The
knowledge of the same will be useful for further exploration on methods to iden-
tify, detect, and quantify the diseases irrespective of the plant. Further, the knowledge
gained will be useful for further investigation of various image processing techniques
to be applied on parts of the plants other than leaves.
Keywords Plant disease · Image processing
1 Introduction
Agriculture is the means to feed the world’s rising population. Except for feeding the
world, plants contribute for reducing global warming. Various methods are used to
practice agriculture in different countries. The agriculture sector faces various chal-
lenges. Traditional methods of farming involved mostly human guessing, observing
crops through the naked eye is not an effective way as compared to modern tech-
niques. Some plants do not show the major symptoms and some plants show them
when it is too late. Powerful microscopes are used to detect plant diseases. Other
cases involve the use of electromagnetic spectrum which is not visible for naked
eyes. Modern techniques like efficient, digital image processing, has higher accu-
racy and are feasible too. Most of the diseases can be revealed in the visible spectrum.
Trained people tell the health of the crop, but the efficiency may not be high always
M. Sawai (B) · S. More · P. Nagardhane · S. Pandhare · M. Ranjanikar

Pimpri Chinchwad College of Engineering, Pradhikaran, Nigdi, Pune 411044, India
M. Ranjanikar
e-mail: manjiri.ranjanikar@pccoepune.prg
76 M. Sawai et al.
and also are costly. If collected samples in the field damage while transporting to the
laboratory for testing, visual rating becomes judgmental. Many such problems are
reduced through digital image processing. In this paper, we discuss various archi-
tectures of convolutional neural networks. We apply the concept of transfer learning
and use four different architectures to find the solution to our problem. We have
implemented this project using a graphical user interface with the help of python
flask, HTML, and CSS. We also find the accuracy measures after using each of these
different architectures. We also compare the outcomes we got by using the different
architectures and hence we conclude the paper by recommending the architecture
which is most appropriate for a particular given input.
2 Literature Review
Barbedo [1] has reviewed various digital image processing approaches that sense,
compute and categorize plant diseases by digital images in the noticeable range. As
the methods which deal with roots, seeds, and fruits need specialized methods so the
paper explores the diseases and the symptoms occurring only on stem and leaves and
not on the other parts of the plant. The survey tried to study as various problems as
possible.
Sladojevic et al. [2] developed a “Deep neural networks-based recognition of
plant diseases by leaf image classification”, which explored a new deep learning
approach for automatically detecting and categorizing the diseases of various plants
by the images of their leaves shot by a phone camera. The developed model could
find the presence of the leaf from the background images and could differentiate
between thirteen different types of diseases and healthy ones. The researchers created
a new plant disease image database, which included 30,000 images using required
transformation in addition to the original 3,000 images publicly available.
Toda et al. [3] discussed “How convolutional neural networks diagnose plant
disease”. Further, they have assessed an array of neuron-wise and layer-wise methods
array of visualization methods to detect plant diseases that were diagnosed using
convolutional neural networks (CNN). Here, CNN was trained by a dataset that was
available publicly. It was found that the colors and texture specific to the diseases
can be captured by neural networks which is identical to human decision-making.
Experimental outcomes signify those straightforward methods like naïve visualiza-
tion of the hidden output layer are not sufficient for the visualization of the diseases
of the plants. Visualization features and semantic dictionaries can be put to use to
find out the visual features which are required for disease identification.
Mohanthy et al. [4] highlighted the work of Krizhevsky et al. [5] which portrayed
instead of using outdated methods, it is practically possible to use supervised training
using deep convolutional neural network for image classification problems which
contain an enormous quantity of classes. They trained a model on a huge number of
high-resolution plant leaf images which used the architecture of the deep convolu-
tional neural network, to classify the disease which the model had not come across.
Intelligent Computing Systems for Diagnosing Plant Diseases 77
There exist many methods to detect plant diseases. Some types of diseases do not
show any symptoms which can be visually identified. In some cases, these symptoms
appear only at a time when it is too late. In such cases, these symptoms can be identi-
fied by using microscopes. Sometimes these signs can be identified in that area of the
electromagnetic spectrum which is cannot be seen by humans. In such cases, we can
use remote sensing methods which give us hyperspectral and multi-images. Detailed
information on this subject can be found in “Plant disease severity estimated visually,
by digital photography and image analysis, and by hyperspectral imaging” by Bock
et al. [6], “Recent advances in sensing plant diseases for precision crop protection.”
by Mahlein AK et al. [7] and “A review of advanced techniques for detecting plant
diseases” by Sankaran et al. [8].
Barberdo [1] explains that plant disease detection research has three stages. The
first stage is the detection of the disease, the second stage is disease severity quantifi-
cation and the last stage is the classification of the disease. But many of the diseases
generate changes in the visible spectrum. Most of the time the first guess is made by
trained people. These trained people may get their guesses correct at times but there
are many glitches in doing this. Bock et al. [6] listed a few of his observations:
• The trained people may get tired, which may affect their levels of concentration,
resulting in decreased accuracy.
• The guesses can vary from person to person.
• Standard area diagrams are required which will help the assessment.
• The people are required to undergo training and re-training to increase the quality.
• These guesses can be incorrect if the pictures taken from the fields are assessed
later in the laboratory.
• The trained people can have misinterpretations related to the area of infection,
lesion number, its size.
• Some plants or crops may stretch up to large areas thus monitoring becomes a
tedious and difficult task.
• Hence by the use of image processing techniques, we can get rid of these disadvan-
tages and increase accuracy and introduce uniformity in the disease identification
process.
2.1 Disease Detection
Following are the disease detection methods:
2.1.1 Partial Classification
If a disease is to be recognized (among numerous other diseases present as well),

then it is better to use partial classification where candidate regions are identified
concerning a particular disease, instead of classifying for all diseases at once. This
is as per a method mentioned by Abdullah et al. [9].
78 M. Sawai et al.
2.1.2 Real-Time Monitoring
Here the application monitors the plant at all times and rings an alarm as and when
a disease is detected. This is as mentioned in the paper, “Fall armyworm damaged
maize plant identification using digital images” by Sena et al. [10] and the paper,
“Lettuce calcium deficiency detection with machine vision computed plant features
in controlled environments” written by Story et al. [11].
2.1.3 Neural Networks
Abdullah et al. [9] put forth a method trying to differentiate a disease called coryne-
spora from various diseases affecting the leaves of the rubber tree. The algorithm
does not use segmentation rather it uses principal component analysis to apply it to
red, green, blue values of low-resolution leaf images. Then the first two principal
components are fed to a multi-layer perceptron (MLP) neural network having a single
hidden layer. The output is nothing but whether the sample contains the disease or
not.
2.1.4 Thresholding
Sena et al. [10] proposed this method to distinguish between images of maize plants
with fall armyworm with healthy plants. The algorithms have two parts, image
processing, and image analysis. Initially, a greyscale transformation of the image
is done. The processing stage deals with the transformation of the image into a
greyscale, filtered, and threshold to discard spurious things. The image is divided
into 12 portions. The portion is set aside having an area less than 5% of the total area.
Then for the remaining portion, the number of the diseased portions was counted and
if this count is above a threshold, then the plant is considered as that of a diseased
one. On doing an empirical evaluation, this threshold value was set to ten.
2.2 Quantification
Here the objective is to quantify or measure the magnitude of the severity of the
particular disease. This can be done in two ways:
• Measuring the area of the affected leaves.
• Measuring how much deep it has affected the plant, with the help of texture and
color
As described earlier, the manual methods have their disadvantages which are
previously mentioned.
Color analysis is a quantification method proposed by Boese et al. [12] in their

paper “Digital image analysis of Zostera marina leaf injury”. This involved a method
for correctly estimating the magnitude of the intensity of the leaf injury of Eelgrass.
This injury can be triggered by micro herbivory feeding, degenerative disease, or
dryness. Unsupervised segmentation of the leaves and putting them into classes
(8–10) is the first step. Here one expert labels the following classes into required
possibilities. Next, the area of each injury is measured, and this itself becomes the
quantification. The author however concludes that this method has some limitations
but is still superior to other complicated leaf injury methods for quantifying.
2.3 Classification
This is an extension of the detection methods. There is a step further, instead of

trying to discover a single disease among many diseases. Here whichever diseases
are present are sorted into respective classes, and are classified into their respective
labels. The techniques of classification can be grouped concerning the algorithmic
strategy deployed.
Some of the classification techniques are as follows:
2.3.1 Neural Networks
Saad Albawi et al. [13] explained the fundamentals concepts of neural network. The
way a neural network works was demonstrated in this paper. This paper also states
the parameters that affect the efficiency of the neural networks.
2.3.2 Support Vector Machines
Meunkaewjinda et al. [14] proposes a technique for identifying and classifying the
diseases affecting grapevines. The method makes use of various color notations
(HIS, L*a*b, UVL, and YCbCr). Multi-layer perceptron neural network performs
the separation between leaves and background, a color library is constructed using
an unsupervised self-organizing map using an unsupervised and untrained map. In
each case, the number of clusters to be adopted is decided by the genetic algorithm.
A support vector machine (SVM) then separates the diseased and healthy parts. After
this some more manipulations are done, then the image is put through a multiclass
SVM, which classifies the sample into the respective class.
80 M. Sawai et al.
2.3.3 Fuzzy Classifier
It was put forth by Hairuddin et al. [15]. In this research, he classifies various nutri-
tional deficiencies of oil palm plants. In the first phase concerning the color, the image
is segmented. Once the segmentation is finished, feature extraction takes place, and
these various color and texture features are given to a fuzzy classifier. The output
here does not provide with the deficiencies present but it rather gives the solutions
on what should be done based on the deficiency.
Murk et al. [16] proposed a model based on deep learning. They named it a plant
disease detector. As per the researchers, the model detects various plant diseases
using pictures of their leaves. To increase the sample size, augmentation was applied
and then CNN was used with multiple convolutions and pooling layers. To train the
model plant village dataset was used and for testing, 15% of its data with images of
healthy and diseased plants were used. The testing accuracy achieved by the model
is 98.3%.
3 Requirements
Following are the requirements needed for implementation.

• Operating System: Windows/Linux
• Processor: Intel Core i3/5/7
• Anaconda Navigator
• Tensorflow
• Sklearn
• Keras
• NumPy
• flask
• HTML
4 Implementation
This section explains the implementation of the system.
4.1 Techniques
The following are the techniques used.

4.1.1 Convolutional Neural Networks (CNNs) and Its Application

to Our Model
A convolutional neural network is a type of neural network which works on

processing the data which is in a grid-like topology, as an image. They are widely
used for image classification problems. CNN comprises many layers. The first layer
is the input layer and the last layer is the output layer. In between the input layer
and the output layer, there are multiple hidden layers. Each layer comprises multiple
neurons with learnable weights and biases. Each neuron takes many inputs and then
computes a weighted sum over them. Here it could also pass through an activation
function and then gives the appropriate output. We have applied CNN using the
approach of transfer learning which is discussed in the next section. We have used
the CNN architectures of keras which helps us in many ways. The CNN architectures
tell us about the number of layers, features, activation function, etc. The architec-
tures we have used are InceptionV3, DenseNet201, and ResNet50 (The number on
the extreme right denotes the number of layers in that particular architecture). We
have used the “ImageNet” pretrained weights of these architectures. The activation
function used is softmax. Hence, we have added the input and output layers to the
various CNN architectures we have used. The input layer has an output shape which
is depicted by (None, 224, 224, 3). The numbers inside the brackets denote the size
and the color configuration of the input image. No parameters are passed in this
layer. The last layer (final output layer) which we have added is the dense layer
whose shape can be represented by (None, 6). Here the number “6” denotes the 6
prediction classes which will be used for making the prediction.
4.1.2 Transfer Learning
Deep convolutional neural networks take a lot of time for implementation. They
might take even days or weeks to train on very huge datasets. To combat this and
save time we can use the concept of transfer learning. Transfer learning is a technique
to re-use the weights of the model from models which are pretrained. These models
which are pretrained are developed for standard computer vision datasets. In our
model, we have used the pretrained weights of ImageNet. Transfer learning involves
using a model trained for a problem being used for another similar problem. Transfer
learning uses pretrained weights but the developer can add or remove layers as per
the needs of the different problem statement.
For this purpose, Keras applications provide neural network architectures. These
architectures vary with each other based on size, number of parameters, depth, etc.
Some of these popular models are:
• Xception
• VGG16
• VGG19
• ResNet50
82 M. Sawai et al.
Fig. 1 Photograph of
healthy Okra leaf
• ResNet101V2
• InceptionV3
• MobileNet
• DenseNet201
4.2 Dataset
The dataset consists of cotton and okra leaves which are both healthy and diseased.
We have used 6 classes for prediction as to the output namely:
• Fresh cotton leaf:427 images
• Diseased cotton leaf: 288 images
• Fresh cotton plant: 427 images
• Diseased cotton plant: 815 images
• Fresh okra leaf: 196 images
• Diseased okra leaf: 125 images
Figures 1, 2 and 3 show the sample data set images.
The credit for the cotton leaves dataset goes to Mr. Akash Zade. The okra leaves
dataset has been created by us. Also, we have divided the dataset into folders of
training and testing.
4.3 Process Flow
In this section, we will discuss the complete process flow of our implementation. We
have implemented our application by creating a user interface. It has been imple-
mented on the localhost. We have executed our source code on Spyder IDE also
making use of anaconda navigator.
Fig. 2 Photograph of
diseased Okra leaf
Fig. 3 Photograph of fresh

cotton leaf
84 M. Sawai et al.
First, we need to download and install all the software requirements which have
been mentioned in the requirements section. After opening the anaconda navigator,
we created our environment of TensorFlow. Then we downloaded all the packages
mentioned previously. Once all the packages are downloaded, we need to import all
the packages and modules into our coding file. We already ran our neural network
model on google colab and saved it as a “.h5” file. Then we load this file using the
“load_model()” function. Then we define our predict function which takes an input
image and gives us the prediction with the help of the “predict ()” function. We
then created an instance of the flask and deployed our application on a local system
(localhost).
For the implementation, we just need to choose any image on the localhost. On
clicking the submit button we get redirected to the corresponding page of output
classification which gives us a prediction about its health.
Algorithm:
Step 1: Import all the necessary packages.
Step 2: Mount drive on Colab.
Step 3: Create model object.
Step 4: Compile the model.
Step 5: Perform image augmentation on training images.
Step 6: Training the convolutional neural network model.
Step 7: Plot training and validation curve using the function.
Step 8: Load an image for prediction and convert it to an array.
Step 9: Test using the predict() function.
The flowchart of the whole process is shown in Fig. 4.
5 Results
The result which we got is that our application is correctly able to predict a disease of
the six classes. After clicking on the predict button of the application, our application
was correctly able to redirect to the appropriate page of the predicted disease.
Table 1 shows the summary of the different CNN architectures used. We have
used InceptionV3, DenseNet201, and ResNet50 and all these are architectures of
convolutional neural networks. Here the number of parameters indicate the total
number of parameters used by the particular CNN architecture. As we can see the
parameters used by any one architecture are in millions. One particular layer of any
architecture can use up to a million parameters.
Figure 5 depicts the user interface of our application. Based on our research, we
have come up with the following aspects that can be considered while diagnosing
plant diseases:
• The solution should be easily available to the farmers which can be backed up by
designing a mobile application or website.
Fig. 4 Flowchart of the

process Start
Input Image
Image pre-processing and labelling
Augmentation process
Neural network training
Testing
Classified Disease Healthy Image
Output Result
Result
Table 1 Comparative
Model Name Accuracy Number of parameters
analysis of different models
InceptionV3 95.23 23,851,784
DenseNet201 98.41 20,242,984
ResNet50 72.38 25,636,712
Fig. 5 Application interface

86 M. Sawai et al.
• The algorithm can be tested with images of mixed light conditions to determine
whether the required accuracy is accomplished.
• To increase the effectiveness of the detection technique, a combination of various
features and learning methods can be used.
• The feature which determines the severity of the disease should be considered to
help the farmers to take corrective action within time.
• The model should be trained and tested with more varied data, maybe considering
more types of plants.
• The model should be extended to detect additional diseases which may not be so
common.
6 Conclusion
Thus, we are successfully able to predict the plant disease by implementing a web
application running on a local host. Two factors must be taken into consideration
for selecting an appropriate neural network architecture. The first factor is accuracy.
We conclude that using the DenseNet201 architecture gave us the highest accuracy.
The second factor is the number of parameters. We also need to focus on the number
of parameters the architecture uses because this factor will increase the time of
execution. Hence our recommendation will be to hit the right balance between these
two factors depending upon the hardware of the user.
We conclude this review with the hope that deeper research in this area will create
a positive impact on the agricultural industry having disease-free crops.
Acknowledgements All four researchers would like to thank their guide Prof. Dr. Manjiri
Ranjanikar for her constant support and guidance in writing this paper.
References
1. J.G.A. Barbedo, Digital image processing techniques for detecting, quantifying and classifying
plant diseases. SpringerPlus 2(1), 1–12 (2013)
2. S. Sladojevic, et al., Deep neural networks based recognition of plant diseases by leaf image
classification. Comput. Intell. Neurosci. 2016 (2016). https://doi.org/10.1155/2016/3289801
3. Y. Toda, F. Okura, How convolutional neural networks diagnose plant disease. Plant Phenomics
2019 (2019). https://doi.org/10.34133/2019/9237136
4. S.P. Mohanty, D.P. Hughes, M. Salathé, Using deep learning for image-based plant disease
detection. Front. Plant Sci. 7, 1419 (2016)
5. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional
neural networks. Commun. ACM 60(6), 84–90 (2017)
6. C.H. Bock, G.H. Poole, P.E. Parker, T.R. Gottwald, Plant disease severity estimated visually,
by digital photography and image analysis, and by hyperspectral imaging. Crit. Rev. Plant Sci.
29(2), 59–107 (2010)
7. A.K. Mahlein, E.C. Oerke, U. Steiner, H.W. Dehne, Recent advances in sensing plant diseases
for precision crop protection. Eur. J. Plant Pathol. 133(1), 197–209 (2012)
8. S. Sankaran, A. Mishra, R. Ehsani, C. Davis, A review of advanced techniques for detecting

plant diseases. Comput. Electron. Agric. 72(1), 1–13 (2010)
9. N.E. Abdullah, A.A. Rahim, H. Hashim, M.M. Kamal, Classification of rubber tree leaf
diseases using multilayer perceptron neural network, in 5th student conference on research
and development. IEEE (2007), pp. 1–6
10. D.G. Sena Jr., F.A.C. Pinto, D.M. Queiroz, P.A. Viana, Fall armyworm damaged maize plant
identification using digital images. Biosys. Eng. 85(4), 449–454 (2003)
11. D. Story, et al., Lettuce calcium deficiency detection with machine vision computed plant
features in controlled environments. Comput. Electron. Agricult. 74(2), 238–243 (2010)
12. B.L. Boese, P.J. Clinton, D. Dennis, R.C. Golden, B. Kim, Digital image analysis of Zostera
marina leaf injury. Aquat. Bot. 88(1), 87–90 (2008)
13. S. Albawi, A.M. Tareq, S. Al-Zawi, Understanding of a convolutional neural network, in 2017
International Conference on Engineering and Technology (ICET). IEEE (2017). https://doi.
org/10.1109/ICEngTechnol.2017.8308186
14. A. Meunkaewjinda, P. Kumsawat, K. Attakitmongcol, A. Srikaew, Grape leaf disease detection
from color imagery using hybrid intelligent system, in 5th International Conference on Elec-
trical Engineering/Electronics, Computer, Telecommunications And Information Technology,
vol. 1. IEEE (2008), pp. 513–516
15. M.A. Hairuddin, N.M. Tahir, S.R.S. Baki, Overview of image processing approach for nutrient
deficiencies detection in Elaeis Guineensis, in 2011 IEEE International Conference on System
Engineering and Technology. IEEE (2011), pp. 116–120
16. M. Chohan, A. Khan, R. Chohan, S.H. Katpar, M.S. Mahar, Plant disease detection using deep
learning. Int. J. Recent Technol. Eng. (IJRTE) 9(1). ISSN: 2277-3878
Multimodal MRI Analysis
for Segmentation of Intra-tumoral
Regions of High-Grade Glioma Using
VNet and WNet Based Deep Models
Sonal Gore, Prajakta Bhosale, Ashley George, Ashwin Mohan,

Prajakta Joshi, and Anuradha Thakare
Abstract Automatic segmentation of brain tumors plays important role for diag-
nosis of cancer. The work explores CNN-based auto-segmentation of high-grade
glioma using two models. Firstly, basic 3-dimensional VNet model is applied on
2D images using same architecture. Secondly, original WNet model is enhanced by
making it deeper with additional convolutional layers at encoder-decoder paths of
both UNet-like segments. Total 31 and 44 convolutional layers are used with 2D-
VNet and modified WNet, respectively, to experiment on BraTS 2018 MRI data.
It generated multi-region segmentation with three classes as per internal structures
of tumor namely—enhancing, non-enhancing/necrosis, and edema. Test accuracy of
99.52%, 99.49%, dice scores of 0.9957, 0.9958, dice loss of 0.425, 0.414 are obtained
by 2D-VNet and WNet, respectively. Training time taken by 2D-VNet and WNet is
44 and 77 s-per-epoch, respectively. Modified WNet exhibits more complexity than
2D-VNet model, whereas performance of both is almost similar.
Keywords MRI · High-grade Glioma · Segmentation · WNet · VNet · Residual

blocks · Enhancing tumor · Non-enhancing tumor · Edema
1 Introduction
A brain tumor, like any other neoplasm in the body, occurs when cells grow at an
abnormal rate to form a mass of abnormal cells. These neoplasms grow over a varying
period of time depending upon whether they are benign or malignant (cancerous) in
nature. The brain is enclosed within the skull, so when these neoplasms start growing
inside the brain, the pressure within the closed space of the skull increases and gives
rise to symptoms indicating that the person is suffering from brain tumor. There
are non-cancerous tumors (benign) while cancerous tissues are more malignant.
There are two main brain tumor types—primary tumors and metastatic/secondary
S. Gore (B) · P. Bhosale · A. George · A. Mohan · P. Joshi · A. Thakare

Department of Computer Engineering, Pimpri Chinchwad College of Engineering, Pune,
Maharashtra, India
90 S. Gore et al.
neoplasms. Primary neoplasm originates within brain being as either benign or malig-
nant. Secondary tumors are the ones that originate in other parts of human body like
breasts or lungs but later spread to the brain through blood. Secondary or metastatic
brain tumors are always cancerous and malignant. According to a study [1], the inci-
dence of brain tumor cases in Indian population ranges from 5 to 10 per 100,000
people with an increasing trend. 40% of all cancers spread to the brain. Brain cancer
is also the second most common cancer in children, accounting for nearby 26% of
infant cancers.
Recently, automatic tumor segmentation methods have achieved great advances
using deep neural networks. However, manual or semi-automatic segmentation
methods are still dominant during clinical practices, which is cumbersome and
requires the doctor’s expertise. Annotated medical scans require expert opinion,
which is costly, time consuming, and susceptible to inter-rater variability. Auto-
segmentation definitely help to reduce the impact of consequences on tumor diag-
nosis. Glioma segmentation surely has clinical relevance as well as importance during
treatment planning after immediate assessment of its severity as per type.
In order to tackle this problem, two deep neural networks are proposed here,
namely deep WNet with residual blocks and 2D-VNet for accurate auto-segmentation
of brain tumor tissues into three different regions. There is a high need to segment
the tumor area at micro level, which further helps to identify the different regions of
tumor area as enhancing, non-enhancing or necrotic, and peri-tumoral edema region.
These regions carry different characteristics, which cause differently for its severity,
its treatment planning or further consequences on the life quality of glioma patients.
Original VNet model is proposed to operate on 3D MRI/CT volumes. The VNet
model with similar architecture is applied on 2D images with our proposed system.
The original WNet model consists of two cascaded UNet-based models, and these
are further enhanced by making it deeper with additional convolutional layers at
encoder and decoder paths of both the UNet-like segments of WNet network.
2 Related Work
Brain tumor segmentation must be done as soon as possible and as accurately as

possible. For tumor segmentation, various image processing methods and artificial
intelligence (machine learning and deep learning) techniques have been developed.
In the manual segmentation of the tumor, these automatic models help to prevent
inter-rater heterogeneity. There is still room to develop more generic or intelli-
gible segmentation approaches to aid in more brain tumor classification based on
region-specific characteristics. Segmentation using fully connected networks [2] is
an ongoing research area and has been introduced to participate in different compe-
titions of segmentation and classification tasks. One of the most notable works by
Ronneberger et al. [4] in which it has presented a network and training approach that
depends heavily on data augmentation techniques to make better use of available
labeled samples. The architecture is made up of a contracting path for gathering
Multimodal MRI Analysis for Segmentation … 91
context and exactly symmetric expansive path for specific localization. These have
demonstrated that such a network can be trained from start to finish using just a
few images. Shreyas et al. [3] have proposed a deep learning-based architecture for
segmentation of brain tumor from MRI scans while the work built upon UNet archi-
tecture proposed in [4] and have used a fully convolutional neural network. Those
have experimented on a UNet model with changes like using an empirically derived
class weights instead of using pixel-wise weight maps, that has reduced the need
of high memory usage. The batch-normalization layers are also included in network
after every convolution step in order to speed up the training and to reduce an effect of
internal co-variate shift [5]. Few studies have used preprocessing techniques such as
intensity normalization, bias correction to prepare MRI images for final processing
[6, 7]. As the present study is working with WNet and VNet [12], we have studied
the method in which it has made use of a cascade of WNet and UNet. The model has
designed UNet in conjunction with WNet with an intent that WNet would segment
complete tumor portion from normal brain tissues and the outputted bounded-box
would be fed as an input to UNet to further segment the tumor region in its internal
sub-regions. This approach was taken by Reji et al. [24] in order to increase the accu-
racy of UNet predictions. Another notable work is the cascaded framework proposed
by Wang et al. [8] where three fully connected networks (FCNs) were employed in
hierarchical and sequential fashion in order to segment the internal structures of
brain tumor, and each of these FCNs have dealt with binary classification problem in
segmentation task. Firstly, a multimodal 3D image is fed as an input to WNet model
that segments complete tumor and finally its cropped region (bounding box of whole
tumor) is attained. Then, the cropped region from WNet [9] is provided as input to
second network i.e., TNet [10] to segment core region specifically. Similarly, region
inside the bounding box generated by TNet is used as input for ENet [11] to segment
particularly enhancing core region of tumor. Another work proposed by Casamitjana
et al. [13] has implemented a cascaded VNet architecture, that have redesigned the
residual connections and utilized the masks of region of interest in order to segment
further the relevant regions of brain; that has helped to solve class imbalance problem
which is generally inherent in tumor segmentation tasks. A two-step process was
used where the first step is to localize the tumorous tissues and sub-regions within
tumor are distinguished in the second step by ignoring set of unwanted background
voxels. The actions are conducted simultaneously using two CNN (Convolutional
Neural Network) models, in which prediction of first model is provided to second
one as next input. Chen et al. [14] proposed a bridging design between two UNet
architectures. It has connected each layer of expansive path of first UNet with its
corresponding layer in contracting path of second UNet, and it directly inputs the
auto-learnt feature panel of previous layers to later layers. This technique has reduced
the training cost and exhibited a better performance in contrast to single UNet archi-
tecture. A fully automated tumor identification and volume estimating method was
introduced in the work done by Ogretmenoglu et al. [15] on FLAIR images, in
which mean/area variance dependent analysis is used to assess whether or not there
is a tumor in any part of the hemisphere. Gadolinium injected T1 weighted scans are
used to exclude non-brain areas such as fat tissues, skull. The clustering using fuzzy
92 S. Gore et al.
C-means method is used to detect edema region in FLAIR scans, while threshold
segmentation is used to detect tumor in T1 post-Gadolinium (T1CE) images. Tumor
volume is measured using tumor area and MRI slice thickness data. Kermi et al. [16]
tested a UNet-based model with modifications such as residual blocks with large
padding and the absence of a pooling layer. It used data augmentation on BraTS
2018 challenge data and obtained mean dice coefficient values of 0.868, 0.783, and
0.805 for whole tumor, enhancing, and core region, respectively. Segmentation in the
presence of a dynamic context and a fuzzy boundary is one of the most challenging
tasks in biomedical image segmentation. To fix such problem, Huimin Huang et al.
[17] proposed WNet, a double U-shape-based architecture capable of exact localiza-
tions along with sharpening the inter-regional margins. The study has constructed
an atlas-based segmentation network [18] to generate position awaked segmenta-
tion maps using former knowledge of anatomy of human body. Furthermore, other
work has developed refinement network for boundary enhancement [19] to produce
a consistent boundary. The experimental findings has demonstrated that presented
WNet model reliably catch desired body part with sharpened information and thus
increased efficiency on two datasets, with dice score values of 0.9661 and 0.9625 for
liver and spleen, respectively. Therefore, it is found that research in deep learning to
perform segmentation on medical images has helped a lot to improve the ability to
treat ailments. MRI-based automatic segmentation of brain tumors plays a crucial
role for tumor diagnosis, surgical, or other suitable treatment planning for brain
cancer patients. And convolutional neural networks have been extensively employed
for the segmentation tasks.
Hence, our work explores the segmentation of high-grade glioma tumors using
two deep learning models. The first method uses the VNet model and the second
method makes use of the WNet model with residual blocks. Original VNet model,
which is proposed for 3D volumes, is applied on 2D images with our proposed system
using same architecture of basic VNet model. The original WNet model consists of
two bridged UNet-based models, and these are further enhanced by making it deeper
with additional convolutional layers at encoder side as well as at decoder path of
both the UNet-like segments of original WNet network.
3 Data Preparation and Preprocessing
The data released under BRATS 2018 challenge (Multimodal Brain Tumor Segmen-
tation Challenge); that is provided by Medical Image Computing and Computer
Assisted Intervention (MICCAI), is used with our work [20, 21]. The data consists
clinically acquired presurgical multimodal MRI scans of 210 high-grade glioma
(HGG) patients, each contains volumes of four MRI modalities including T1
weighted (T1W), gadolinium-enhanced T1 weighted (T1-GD), T2 Weighted (T2W),
fluid attenuated inversion recovery (FLAIR), and it also contains the ground truth
volume segmented by expert team of neuro-oncologist and radiologist. This data
is already pre-processed by performing its co-registration to standard anatom-

ical template, interpolating to same resolution (1 mm3 ) and with skull stripping
action. Annotations labeled by expert team are gadolinium-enhanced tumor area—
ET (with label ‘4’), peri-tumoral edema region—ED (with label ‘2’), necrosis or
non-enhancing tumor core—NCR or NET (with label ‘1’). Among 210 patients,
random 50 patients are selected due to limited availability of computational power.
Each 3D volume of all four MRI modalities is of size 240 × 240 × 155, which is
sliced, cropped, and normalized into total 31,000 images of size 192 × 192.
4 Method
It first preprocesses the images by converting every 3D MRI volume into a NumPy
array using the SimpleITK library available in Python. While the 3D NumPy array
consists of 155 two dimensional slices, each of size 240 × 240. These slices are
further cropped to a size of 192 × 192 to remove unwanted background pixels. 7750
2D images of each of 4 MRI modalities has resulted in total 31,000 images; which
are finally included in our experimentation. Further these images are standardized
using z-score normalization method, and such normalized images are fed as an input
to deep models (2D-VNet and modified WNet) for auto-extraction of features. The
model is learnt from auto-extracted feature set and finally generates the segmented
output with three classes as enhancing, non-enhancing, and edema sub-regions. The
typical flow of proposed system is presented in Fig. 1, that has followed the steps in
specific manner as mentioned in above description in order to preprocess and prepare
the data.
Fig. 1 Typical flow of system

94 S. Gore et al.
4.1 Architecture of 2D-VNet
Figure 2 provides a schematic representation of the VNet model for 2D input.

The VNet model consists of contracting (down sampling) and expanding (up
sampling) paths like UNet model given in [4]. The down-sampling and up-sampling
paths consist of 5 convolutional blocks followed by residual block layer, each convo-
lutional block consists of one to three convolutional layers. Each convolutional block
at down-sampling path, consists of convolutional layers with activation filters of 5 ×
5 size and the last layer of each convolutional block performs convolution operation
by using filters of size 2 × 2 with stride 2, further it is followed by batch normaliza-
tion and activation function such as Parametric Rectified Linear Unit (PReLU) [22].
Addition operation is performed at residual block to utilize the features of previous
layers for well-learning of model. Such feature-fusion method has been commonly
used in many computer visions tasks such as ResNet [23] based classification. As
fusion of features with addition operation changes the distribution of weights, this
method performs better for our model as compared to concatenation operation. Down
sampling reduces the input size to 50% and doubles the number of features at each
block. The convolution layer with stride 2 × 2 is used here for down sampling instead
of using pooling layers. Though convolution layer takes more computation time than
pooling layer, we have used convolution layer with stride 2 to extract features as it
may learn certain properties that pooling layer cannot learn while training. At the up-
sampling side, each block contains a deconvolution layer which maps to the features
extracted by the respective down-sampling block, and it produces output of the same
size as the input volume. Finally, softmax activation function is applied on the output
layer. And adaptive moment estimation (ADAM) optimizer with a learning rate of
2e−4 is used to train the model.
Fig. 2 Architecture of 2D-VNet

Fig. 3 Architecture of modified WNet based deep model
4.2 Architecture of Deep WNet with Residual Blocks
Figure 3 shows a schematic representation of modified WNet with residual blocks and
deep architecture. The WNet model consists of two cascaded UNet-based networks
[4]. Each contracting and expanding path of modified WNet network contains 5
blocks, and each block of encoder/decoder consists of two consecutive convolution
layers with activation kernel of size 3 × 3, rectified linear unit (ReLU) activation
function and each block is followed by maximum pooling layer, and further followed
by a residual block. The residual block helps to preserve the location information
of pixels during each convolutional layer of down-sampling path. It learns from the
residue of true output and the input. The architecture of the residual block is as shown
in the Fig. 4. The number of filters used are 32, 64, 128, 256, 512 at encoder/decoder
paths with 1024 filters at connection point of encoder and decoder. At the end, softmax
activation function is applied on the output layer. A dropout layer with dropout rate
of 20% is applied at the end of each block. ADAM optimizer (with learning rate of
1e−5) is used to train the model.
5 Experimentation and Results
Two different deep learning models namely 2D-VNet and WNet with residual blocks
are trained using 60% images, validated using separate cohort of 20% images and it is
further tested with separate group of 20% images. Performance of these two models
is measured by the modified dice coefficient metric which evaluates the performance
of both an accuracy of automated probabilistic segmentation (with spatial overlap
between actual and predicted) and the reproducibility of manual segmentation of
96 S. Gore et al.
Fig. 4 Residual block

architecture
MRI images. The original dice coefficient formula with slight modification is used
to measure the performance of models, which computes the sum of square of actual
and predicted output added with epsilon value of 1e-6 in the denominator instead
of using sum of absolute of actual and predicted output in the denominator as per
shown in Eq. 1.

2 × SEG(gt) ∩ SEG(pr)
Dice Coefficient = (1)
SEG2(gt) + SEG2(pr) +
where SEG(gt) is the ground truth segmentation of tumor with three class annotation,
SEG(pr) is the predicted segmentation of tumorous region with predicted three class
labels and ε is 1e−6. Dice loss can be simply calculated as per Eq. 2.
L Dice = 1 − Dice Coefficient (2)
The experimentation is carried out using Python implementation, with Keras and
Tensorflow 2.4 as a backend. All experiments are conducted on Google colaboratory
with 25 GB RAM. The experimentation is carried out on 31,000 brain tumor slices
obtained from 50 HGG patient’s data with four different modalities. The data is split
in the ratio 3:1:1 that is 60% for training, 20% for validation, and 20% for testing.
Both the models are trained with batch size of 8 and with 30 epochs. Performance
of both the models is measured by dice coefficient, dice loss, and accuracy. Table
1 shows the performance metrics obtained during training, validation, and testing
phase.
Table 1 Performance metrics obtained during training, validation, and testing
Method Training Validation Testing
Accuracy Dice coefficient Dice loss Accuracy Dice coefficient Dice loss Accuracy Dice coefficient Dice loss
Multimodal MRI Analysis for Segmentation …
2D-VNet 99.5549 99.6391 0.3608 99.5115 99.5663 0.4336 99.5210 99.5744 0.4255
Modified 99.5712 99.6459 0.3541 99.5044 99.5901 0.4099 99.4973 99.5856 0.4143
WNet
Bold indicates the highest values for accuracy, dice coefficient, and lowest value for dice loss, which are obtained during training, validation and testing phase
97
98 S. Gore et al.
Fig. 5 a Dice loss and accuracy graph for 2D-VNet, b graph for modified WNet
It is observed from the Table 1 that the modified WNet model with residual blocks
gives slightly more accuracy, dice coefficient, and less dice loss as compared to 2D-
VNet while training the model. It could also be seen that the proposed model of
WNet gives slightly better results in terms of dice coefficient than 2D-VNet during
testing phase. The loss and accuracy graphs of training and validation for 2D-VNet
and modified WNet are given in Fig. 5a, b, respectively.
The qualitative comparison of the results of FLAIR modality between 2D-VNet
and modified WNet with additional residual blocks is presented in Fig. 6. It shows the
original 2D slice of FLAIR modality, its ground truth segmentation, and its predicted
segmented tumorous regions which is divided into 3 classes—enhancing tumor (in
yellow color), edema (green color), and non-enhancing or necrotic tumor (dark green
color).
The segmentation map is obtained from 2D-VNet and modified deep WNet with
residual blocks. Each tumor region is highlighted with a different color: edema shown
in green color, enhancing tumor region shown in yellow color, and non-enhancing
tumor area shown in dark green color. It should be highlighted that the segmentation
map resulted by modified deep WNet with residual blocks gives more accurate results
as compared to 2D-VNet by maintaining sharp boundaries even in small object
circumstances. The efficacy of WNet based model performed better in comparison
to 2D-VNet based model with a slightly high dice coefficient. The dice coefficient
achieved with both the models signifies the highest similarity between ground truth
and predicted segmentation map.
Fig. 6 The segmented output of sample slice of FLAIR MRI
6 Conclusion
The work has demonstrated an effective performance to segment the tumorous tissues
from non-tumorous as well as further to classify different regions of tumorous tissues
using deep neural networks such as 2D-VNet and deep modified WNet with residual
blocks. The segmentation of whole brain tumor and intra-tumoral regions was carried
out for high-grade glioma. The projected deep neural networks were tested and
evaluated quantitatively on BRATS 2018 dataset. The 2D-VNet model learns at the
rate of 1.4 ms per 2D slice, whereas WNet takes comparatively more time which
is 2.4 ms per 2D slice. The tests carried out showed that the segmentation results
obtained by our models are very similar to those manually obtained by the experts. In
comparison, modified WNet architecture yielded dice score of 99.58% during testing
and showed slightly better results than 2D-VNet with minute edge differences in
predicted segmentation output. Therefore, both the architectures are equally effective
for glioma segmentation with classification of intra-tumoral regions, but 2D-VNet is
comparatively time efficient than WNet model. These models can be further trained
to segment low-grade glioma tumors along with high-grade glioma to make such
models more generalized.
100 S. Gore et al.
References
1. A. Dasgupta, T. Gupta, R. Jalali, Indian data on central nervous tumors: a summary of published
work, in South Asian J. Cancer 5(3), 147–153 (2016)
2. J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in
IEEE Conference on Computer Vision and Pattern Recognition (2015)
3. V. Shreyas, V. Pankajakshan, A deep learning architecture for brain tumor segmentation in
MRI Images, in IEEE 19th International Workshop on Multimedia Signal Processing (MMSP)
(2017), pp. 1–6. https://doi.org/10.1109/MMSP.2017.8122291
4. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image
segmentation, in International Conference on Medical image computing and Computer-
Assisted Intervention. Lecture Notes in Computer science, vol. 9351. Springer, Cham (2015),
pp. 234–141
5. S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing
internal covariate shift, in CoRR, vol. abs/1502.03167 (2015)
6. N.J. Tustison, J.C. Gee, N4ITK: Nick’s N3 ITK implementation for MRI bias field correction.
IEEE Trans. Med. Imaging 29(6), 1310–20 (2010)
7. L. Nyul, J. Udupa, On standardizing the MR image intensity scale. Magnet. Resonance Med.
42(6), 1072–1081 (1999)
8. G. Wang, W. Li, S. Ourselin, T. Vercauteren, Automatic brain tumor segmentation using
cascaded anisotropic convolutional neural networks, in Brainlesion: Glioma, Multiple Scle-
rosis, Stroke and Traumatic Brain Injuries. Lecture Notes in Computer Science, vol. 10670
(Springer, Cham, 2017)
9. X. Xia, B. Kulis, W-Net: a deep model for fully unsupervised image segmentation, in Computer
Vision and Pattern Recognition. arXiv:1711.08506 (2017)
10. T.J. Jun, J. Kweon, Y.H. Kim, D. Kim, T-Net: nested encoder decoder architecture for the main
vessel segmentation in coronary angiography, in Neural Networks, vol. 128 (2020)
11. P. Adam, A. Chaurasia, K. Sangpil, C. Eugenio, ENet: A deep neural network architecture for
real-time semantic segmentation, in Computer Vision and Pattern Recognition, arXiv:1606.
02147 (2016)
12. F. Milletari, N. Navab, S.A. Ahmadi, V-net: fully convolutional neural networks for volumetric
medical image segmentation, in Computer Vision and Pattern Recognition. arXiv:1606.04797
(2016)
13. A. Casamitjana, M. Cata, I. Sánchez, M. Combalia, V. Vilaplana, Cascaded V-Net using ROI
masks for brain tumor segmentation, in Brain Lesion: Glioma, Multiple Sclerosis, Stroke and
Traumatic Brain Injuries. Lecture Notes in Computer Science, vol. 10670 (Springer, Cham,
2018)
14. W. Chen, Y. Zhang, J. He, Y. Qiao, Y. Chen, H. Shi, X. Tang, Prostate segmentation using 2D
bridged U-net, in International Joint Conference on Neural Networks (2019), pp. 1–7. https://
doi.org/10.1109/IJCNN.2019.8851908
15. C. Ogretmenoglu, Fiçici, O. Erogul, Z. Telatar, Fully automated brain tumor segmentation and
volume estimation based on symmetry analysis in MR images, in CMBEBIH 2017. IFMBE
Proceedings, vol. 62 (Springer, Singapore, 2017)
16. A. Kermi, I. Mahmoudi, M. Khadir, Deep Convolutional neural networks using U-Net for
automatic brain tumor segmentation in multimodal MRI volumes. In: BrainLes 2018, LNCS,
vol. 11384 (Springer, Berlin, 2019), pp. 37–48
17. H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, Y. Iwamoto, et al., WNET: an end-to-end
atlas-guided and boundary-enhanced network for medical image segmentation, in IEEE 17th
International Symposium on Biomedical Imaging (ISBI), 3–7Apr 2020, Iowa City, Iowa, USA
18. G. Gindi, A. Rangarajan, I. Zubal, Atlas-guided segmentation of brain images via opti-
mizing neural networks, in Proceedings of SPIE Biomedical Image Processing and Biomedical
Visualization, vol. 1905 (1993). https://doi.org/10.1117/12.148668
19. C. Zhuang, X. Yuan, W. Wang, Boundary enhanced network for improved semantic segmen-
tation, in Urban Intelligence and Applications (2020), pp. 172–184
20. B. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby et al., The multimodal
brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 34(10),
1993–2024 (2015)
21. S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, et al., Advancing the Cancer
Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features.
Nat. Sci. Data 4, 170117 (2017)
22. K. He, X. Zhang, S. Ren, J. Sun, Deep into rectifiers: Surpassing human-level performance
on ImageNet classification, in Proceedings of International Conference on Computer Vision
(ICCV) (IEEE Computer Society, 2015), pp. 1026–1034
23. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, in IEEE
Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778. https://doi.org/
10.1109/CVPR.2016.90
24. S. Reji, E. Earley, M. Basak, Brain tumor segmentation, in CS230: Deep Learning (Standford
University, CA, 2018)
Early Onset Alzheimer Disease
Classification Using Convolution Neural
Network
Happy Ramani and Rupal A. Kapdi
Abstract Alzheimer’s disease is one of the major causes of death. The disease
treatment is highly recommended in its early stage as it is difficult to treat it in
later stage. The diagnosis of this slow growing disease is difficult as it does not
show any symptoms in the early stage. As the deep neural networks have shown its
success to process the medical images, the paper uses convolution neural network for
early detection of the Alzheimer’s disease using binary classification. The network
model uses T2 magnetic resonance images from Alzheimer’s disease neuroimaging
initiative dataset. The preprocessing extracts the slices with hippocampal region
from the three-dimensional images and removes non-brain region of the slice. The
proposed method achieves 71.13% accuracy. It performs better than AlexNet in terms
of loss and prediction time.
Keywords Alzheimer’s disease · Convolution neural network · Deep Learning ·

Magnetic resonance imaging
1 Introduction
Alzheimer’s disease(AD) is a neurodegenerative disease which is the most common

cause of dementia. The disease is usually seen in elder people of more than 60 years.
“2019 Alzheimer’s disease facts and figures” [1] described AD impact on patients
Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging
Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI con-
tributed to the design and implementation of ADNI and/or provided data but did not participate in
analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://
adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.
H. Ramani (B) · R. A. Kapdi

Institute of Technology, Nirma University, Ahmedabad, Gujarat, India
e-mail: 19MCEC04@nirmauni.ac.in
R. A. Kapdi
e-mail: rupal.kapdi@nirmauni.ac.in
104 H. Ramani and R.A. Kapdi
Fig. 1 Leading causes of death [2]
and society, the cost required to cure AD patients as well as the increase in the
death rate between 2000 and 2017. The report stated that in spite of intensive care of
AD and other dementia patients, the death rate increases every year. “World Health
Organization” [2] published the information about leading causes of death in the
world. Figure 1 shows pictorial view for death rate of AD compared to the other
diseases. AD holds the seventh position in top ten causes of death in year 2019
globally and has the second highest death rate increase from 2000 to 2019 years.
Due to AD, brain shrinking and death of neurons happen. While in the early
stage of AD, the patient does not show any physical changes as these changes are
minor and unnoticeable. As the disease proceeds, neurons in the brain are destroyed
because of two reasons: (1) the abnormal growth of protein fragment beta-amyloid
and its flow outside neurons. This results in neuron death and prevention of neuron-to-
neuron communication and (2) the abnormal growth of tau protein tangling inside the
neuron which prevents the flow of nutrients and other molecules inside the neurons.
After around 20 year or more, these slow growing changes come up with the physical
symptoms like memory loss, language problem, and inability to remember things at
Early Onset Alzheimer Disease Classification Using Convolution Neural Network 105
a level that they need a full-time assistant. Symptoms of AD will differ from human
to human. Once neurons are dead, there is no way to get them back, because of this
no treatment available to cure AD.
To detect AD, doctor takes questionnaire, test of attention, memory, and language,
etc. If AD seems to be in higher stage from these tests, then experts like neurologist
and radiologist use MRI to diagnose the structural changes in brain due to Alzheimer.
The researchers have tried to address early stage AD with various automated/semi-
automated approaches. Such approaches are addressed in Sect. 2. The paper uses
MRI to diagnose AD using deep neural network approach. The dataset description
and preprocessing mechanism are covered in Sect. 3. And the proposed method and
results are discussed in Sects. 4 and 5. Section 6 concludes the work with future work.
2 Literature Survey
In this section, literature survey carried out for research work is explained in various
subsections.
2.1 State-of-the-Art Survey
Across the past several years, many techniques are developed to diagnose AD and
classify its stages. Because of the inadequate dataset, most of the research work is
done on publicly available ADNI dataset and OASIS dataset. This research work can
be categorized into proposed techniques based on CNN model’s variations and other
conventional methodologies. Figure 2 shows the survey taxonomy.
Fig. 2 Survey taxonomy

2.2 Classification Using Conventional Approaches
The conventional approaches used by researchers to address the AD classification

are shown in Table 1.
2.3 Classification Using Variations of CNN
Various CNN models are developed to address AD diagnosis and classification. These
benchmark CNN models have less depth but still provide better accuracy. Another
variation of CNN is transfer learning approach for the detection and classification of
AD, as it gives more accuracy than other solutions. In transfer learning, pretrained
weights, i.e., learned parameters of a particular dataset are used for AD classification.
Description about those research work is given in Table 2.
Table 1 Description of conventional approaches used for AD classification

Researcher/s Method Dataset Preprocessing Accuracy
Manzak et al. [3] Random forest ADNI clinical data Data of healthy • 67% for AD versus
feature elimination patients, patients not cognitive normal
method and neural diagnosed further, (CN)
network to detect AD and attributes which
are having missing
values greater than or
equal to 50% are
removed
F. Ahmad et al. [4] Principal component ADNI functional Low-pass filter • 95% for classifying
analysis (PCA)-based MRI (fMRI) dataset applied for removing AD stages
algorithm to classify respiratory and
AD into four stages cardiac noise effects.
Generalized linear
model used for
analyzing signal of
subject
Suk et al. [5] DL-based feature ADNI dataset of Preprocessing on • 95.9% for AD
representation along MRI, PET, and MRI and PET are versus CN
with stacked cerebrospinal fluid • applied following • 85% for MCI
auto-encoder to (CSF) with two procedures versus CN
classify three stages features of clinical –Anterior • 75.8% for MCI-C
of AD data: minimum commissure-posterior versus CN
mental state commissure
estimation and correction
ADAS-Cog subjects –Skull stripping
are divided into 51 –Cerebellum removal
AD, 99 MCI (43 MCI • Segmentation of
to AD, 56 remain in MRI and PET
MCI), and 52 normal • Parcellated into
regions of interests
(ROIs)
Table 2 Description of CNN model’s variations used for AD classification

Researcher/s Method Dataset Preprocessing Accuracy
Lin et al. [6] Feature selection ADNI dataset of 1.5T • Skull stripping • 79.9% for MCI to
using CNN model and MRIs in which 169 • Histogram AD conversion
FreeSurfer software, were converted from normalization
dimension reduction MCI to AD and 139
using PCA, and sparse were not converted
feature selection using from MCI to AD.
Lasso to predict
conversion from MCI
to AD
Maqsood et al. [7] AlextNet to classify OASIS dataset of 382 Images are normalized • 92.8% for
AD into four stages total subjects and by enhancement classifying AD stages
number of subjects in techniques, linear
each stage are 167 for contrast stretching,
no dementia, 87 for segmented to gray
very mild dementia, matter (GM), white
105 for mild dementia, matter (WM), CSF,
23 for moderate transformed images to
dementia 227 × 277 size
Puranik et al. [8] Inception-ResNet V2 ADNI for Brain images were • 98.41% for AD
deep architecture to T 1-weighted MRI and transformed from versus CN
detect AD fMRI DICOM -> NII format
-> JPEG files -> TF
recorded files and
removed last five files
of each subject
Chitradevi et al. [9] Segmentation of brain MRI brain images of Histogram • 95% for AD versus
regions and AlexNet various slices (axial, equalization for CN
Model to diagnose AD coronal, and sagittal) contrast enhancement,
from Chettinad Health Otsu’s thresholding
City, Chennai. 1.5T for skull Stripping and
2D T 2 flair weighted segmentation of brain
MRI images of 200 regions such as GM
subjects, 100 of each: and WM of axial slice,
normal and AD corpus callosum (CC)
of sagittal slice,
hippocampus (HC) of
coronal slice
Afzal et al. [10] AlexNet to detect AD OASIS dataset of 218 Rescale to 227x227 • 98.41% for AD
subjects from 256x256, data versus CN (2D view)
augmentation by • 95.11% for AD
rotating the image, versus CN (3D view)
crop from right side,
bottom side, left side,
corner side, top side,
whole crop
Khan et al. [11] Pretrained weights of ADNI dataset and 50 Extracted 2D slices of • 99.36% for AD
the VGG model to subjects of each stage: 166x256 size and versus NC
detect and classify AD AD, MCI, and NC resized to 128x128 • 95.91% for
into three stages: AD, size three-class
MCI, and normal classification
cognitive (NC)
2.4 Convolutional Neural Network (CNN)
CNN is a deep neural network (DNN) architecture which tries to mimic the natu-
ral visual perception mechanism [12]. From the invention of the first framework of
CNN to present, there is a significant development which has been done on CNN.
Some benchmark models of CNN are also developed by researchers such as LeNet,
AlexNet, InceptionNet, VGGNet, ResNet, and GoogleNet. The CNNs address wide
variety of problems including image classification, text recognition, object detec-
tion, natural language processing, and many more. CNNs gain popularity for its
better learning capabilities from automatic feature extraction without any human
intervention. Fundamental components of CNN are convolutional layer, pooling
layer, fully connected(FC) layer, activation function, loss function, regularization,
and optimization.
3 Dataset Description and Preprocessing Mechanisms
Details of dataset used for research work and preprocessing method are explained in
various subsections.
3.1 Dataset Description
To diagnose AD, 715 MRIs are selected with axial view of PD/T 2 weighted -
FSE/TSE MRIs in NiFTI format. These images are split into 160 AD, 343 MCI,
and 212 normal aging/CN. To classify subjects into two classes, MCI and AD sub-
jects are counted as AD class and rest of the subjects are having CN class. Hence, 503
subjects are of AD and 212 of CN. Database is used from ADNI database [13, 14].
Table 3 contains information about the dataset used, such as age range and gender
about all subjects.
Table 3 Dataset representation

Subject Age [Range] Gender [M/F]
AD+MCI 503 (160 + 343) [55,91] 296/207
CN 212 [60,90] 107/105
Fig. 3 Preprocessing on dataset images.
3.2 Preprocessing Mechanisms
The FSL-BET [15, 16] tool removes the non-brain region from input 256×256×104
3D MRIs. In the next step, nii2png Python package [17] converts 3D MRIs to 2D
slices. From this 2D slices collection, the slices which have hippocampal region are
selected as hippocampal region is the first and most affected brain region showing
AD effects. At the end, these 2D slices are scaled to 227×227 dimension. The entire
preprocessing steps with example image are illustrated in Fig. 3.
4 Proposed CNN Architecture
The proposed CNN model addresses the early onset AD classification. The proposed
CNN model as shown in Fig. 4 consists of three convolutional layers. The convolution
layer is an essential component of the CNN framework, which performs the feature
extraction by using an aggregation of nonlinear and linear operations. First layer is
having 16 filters with 4 kernel-size and softmax activation function (AF). Second
layer used 32 filters with 5 kernel-size and LeakyReLU AF. And the third layer has 64
filters with 3 kernel-size, ReLU AF, and padding value as valid. Convolutional layers
have learnable parameters in the form of filters and kernels. Each filter is convolved
over the entire input volume and calculated the dot product between input and values
of the filter in the forward pass.The prominent features of these layers are extracted
using max pooling layers which follow them. Pooling layers do not have any learnable
parameters. The pooling layer divides the input image into group of non-overlapping
portions and that each sub-portion gives the output as per requirement like maximum
value or minimum value, etc. Based on that, pooling layers are having three most
used variations: min pooling, average pooling, and max pooling. The max pooling
is used in the proposed architecture.
The first and last batches of the convolution layer and pooling layer are followed
by dropout of 0.1. The second batch of these layers is followed by batch normal-
ization with epsilon = 0.2, momentum = 0.99, renorm-momentum = 0.99, axis =
−1, and scale parameter = False. The output of the last layer is converted into a
one-dimensional array of a vector. This flattened layer is connected to dense layers,
which are called FC layers as each input is connected to all output using learnable
Fig. 4 Proposed CNN model architecture.
weights. Activation functions and the number of nodes can be defined as parameters
in the FC layers. At end of model, there are four FC layers with sigmoid, ReLU,
ReLU, and sigmoid AF, respectively. The last layer from a set of FC layers has same
number of nodes as the number of classes. Due to less network depth, any specific
hardware is not required for training the model. The model is implemented using
TensorFlow and Keras. Training and test MRI datasets are split in a ratio of 77% and
23%, respectively. Accordingly, training dataset consists of 550 subjects of 387 for
AD and 163 for CN, and test dataset consists of 165 subjects of 116 for AD and 49
for CN.
5 Results
The parameters for the training phase are as follows: Loss function = binary_crossentropy,
the optimizer = stochastic gradient descent, epochs = 20, batch size = 128, and steps
per epoch = 1. The model results in 71% average training accuracy and 71.13%
average testing accuracy.
These results of the proposed model are compared to the well-known AlexNet
model with the same set of input. In AlexNet model, parameters for the training
phase are Adam optimizer with 0.001 learning rate, binary_crossentropy and steps
per epoch = 1, epochs = 20. The AlexNet results in 63.69% average training accuracy
and 69.53% average testing accuracy.
Comparison of proposed CNN model and AlexNet is shown in Table 4 , for the
parameters—average testing accuracy, average testing loss, and time taken by mod-
els. AlexNet model has more depth, and hence, it requires more time for the training
process. In a comparison of that, the proposed CNN model takes less time for the
training phase as it has fewer layers. By that, the CNN model is more time efficient
and less complex in structure than the AlexNet model. In proposed CNN model, batch
Table 4 Comparison of proposed model and AlexNet

Proposed CNN Model AlexNet Model
Accuracy 71.13% 69.53%
Loss 0.6–0.7% 4.5–5.2%
Time taken (per epoch) t 2t
normalization is used to accelerate the training, while it is not included in AlexNet.

Additionally, as per the state of the art, data augmentation is used for enlarging
dataset, for avoiding overfitting in deep models. But when there is a model with less
depth, data augmentation can lead to overfitting instead of reducing it. Therefore,
focusing on time and space trade-off, we avoided to use data augmentation and more
deep model.
Moreover, as per the literature survey done for research in AD, except [9], most
of the study is done using T 1-weighted MRIs and dataset used in [9] is publicly
not available. While T 1 weighted MRIs have their benefits, T 2 weighted MRIs can
identify the difference between normal and abnormal more easily, because it can
recognize abnormal lesions of fluid and demonstrate CSF better [18]. And, beta-
amyloid plaques and tau tangles are CSF biomarkers of AD [19]. Only after the
comparison of different modalities of MRIs of the same subject, any modality can be
said as beneficial for particular research. Hence, the paper presents a model which
uses different modality of input set than other research studies.
6 Conclusion
As an output of this research work, a CNN model is proposed. This model can be
used for the early diagnosis of AD. While most of the existing research work for
AD diagnosis is done using AlexNet, the proposed CNN model is time efficient
compared to AlexNet. The model is trained and tested on the ADNI dataset, which
is used by the majority of researchers in this research area. More efficient in terms
of time and space, deep, and the accurate network is required for early diagnosis of
AD. Clinical assessment data can be used along with image data for better results.
After getting satisfactory results in that, pathological and genetic data can also be
used collaboratively. As this area of research is in the initial stage, there are many
ways open for researchers to contribute to this research area. Though the presented
model gives 71.13% accuracy for a similar dataset as an input, AlexNet gives less
accuracy. In the future, the proposed model can be improved to increase accuracy
and make use of the large dataset.
References
1. A. Association, Alzheimer’s disease facts and figures. Alzheimer’s Dementia 15(3), 321–387
(2019)
2. W.H. Organization, The top 10 causes of death (2021). Accessed 15 Feb 2021. https://www.
who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death
3. D. Manzak, G. Çetinel, A. Manzak, Automated classification of Alzheimer’s disease using
deep neural network (DNN) by random forest feature elimination, in 2019 14th International
Conference on Computer Science & Education (ICCSE). IEEE (2019), pp. 1050–1053
4. F. Ahmad, W. Dar, Classification of Alzheimer’s disease stages: an approach using PCA-based
algorithm, vol. 33 (2018), p. 153331751879003. https://doi.org/10.1177/1533317518790038
5. H.I. Suk, D. Shen, Deep learning-based feature representation for AD/MCI classification, in
International Conference on Medical Image Computing and Computer-Assisted Intervention
(Springer, Berlin, 2013), pp. 583–590
6. W. Lin, T. Tong, Q. Gao, D. Guo, X. Du, Y. Yang, G. Guo, M. Xiao, M. Du, X. Qu et al., Con-
volutional neural networks-based MRI image analysis for the Alzheimer’s disease prediction
from mild cognitive impairment. Front. Neurosci. 12, 777 (2018)
7. M. Maqsood, F. Nazir, U. Khan, F. Aadil, H. Jamal, I. Mehmood, O.Y. Song, Transfer learning
assisted classification and detection of Alzheimer’s disease stages using 3d MRI scans. Sensors
19(11), 2645 (2019)
8. M. Puranik, H. Shah, K. Shah, S. Bagul, Intelligent Alzheimer’s detector using deep learn-
ing, in 2018 Second International Conference on Intelligent Computing and Control Systems
(ICICCS). IEEE (2018), pp. 318–323
9. M.D. Chitradevi, P. Sathees, Analysis of brain sub regions using optimization techniques and
deep learning method in alzheimer disease, vol. 86 (2019), p. 105857. https://doi.org/10.1016/
j.asoc.2019.105857
10. S. Afzal, M. Maqsood, F. Nazir, U. Khan, F. Aadil, K. Awan, I. Mehmood, O.Y. Song, A
data augmentation-based framework to handle class imbalance problem for Alzheimer’s stage
detection, vol. 7 (2019), pp. 1. https://doi.org/10.1109/ACCESS.2019.2932786
11. N.M. Khan, N. Abraham, M. Hon, Transfer learning with intelligent training data selection for
prediction of Alzheimer’s disease. IEEE Access 7, 72726–72735 (2019)
12. J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai
et al., Recent advances in convolutional neural networks. Pattern Recogn. 77, 354–377 (2018)
13. Access data and samples. Available at http://adni.loni.usc.edu/data-samples/access-data/
14. A secure online resource for sharing, visualizing, and exploring neuroscience data. Available
at https://ida.loni.usc.edu/login.jsp
15. Fmrib software library v6.0 Available at https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/
16. M. Jenkinson, C.F. Beckmann, T.E. Behrens, M.W. Woolrich, S.M. Smith, FSL. Neuroimage,
62, 782–90 (2012)
17. A.A. Laurence, NIfTI-Image-Converter (2021). Accessed 30 Jan 2021. https://alexlaurence.
github.io/NIfTI-Image-Converter/
18. MRI scans. Available at https://www.physio-pedia.com/MRI_Scans
19. T. Tapiola, I. Alafuzoff, S.K. Herukka, L. Parkkinen, P. Hartikainen, H. Soininen, T. Pirttilä,
Cerebrospinal fluid β-amyloid 42 and tau proteins as biomarkers of Alzheimer-type pathologic
changes in the brain. Archiv. Neurol. 66(3), 382–389 (2009)
A Study on Evaluating the Performance
of Robot Motion Using Gradient
Generalized Artificial Potential Fields
with Obstacles
Syed Muzamil Basha, Syed Thouheed Ahmed, and Naif K. Al-Shammari
Abstract Motion planning (MP) is actually a specialized version with more general
Artificial Intelligence (AI) planning problem. The goal of a general purpose planning
problem is to come up with a sequence of actions that accomplish a given goal. There
are two approaches that utilize random samples. One the Probabilistic Road Map
algorithm that sought the construct of a road map. Second, the rapidly exploring
random tree procedure, which constructs every evolving trees to explore the free
space and forge paths between the start and the goal. Both algorithms have the
pleasing property that they work quite well in practice, even on high dimensional
configuration spaces. Finally, the behavior of the generalized potential fields with
obstacles are discussed. It helps to steer the robot through configuration space by
considering the gradient of this artificial potential field. A strength of these potential
field methods is that they are relatively simple to implement, and they can often
be carried out directly based on sensory input. The finding of the present research
is to describe the challenges being faced in the area of motion planning of robot
and controlling its speed using gradient in generalized artificial potential fields with
obstacles.
Keywords Motion planning · Configuration space · Gradient · Artificial potential

fields
S. M. Basha (B) · N. K. Al-Shammari

School of Computer Science and Engineering, REVA University, Bengaluru, India
e-mail: muzamilbasha.s@reva.edu.in
N. K. Al-Shammari
e-mail: naif.alshammari@uoh.edu.sa
S. T. Ahmed
School of Computing and Information Technology, REVA University, Bengaluru, India
e-mail: syedthouheed.ahmed@reva.edu.in
N. K. Al-Shammari
Department of Mechanical Engineering, College of Engineering, University of Hail, Hail,
Saudi Arabia
114 S. M. Basha et al.
1 Introduction
The Motion Planning Problem is actually more specifically concerned with coming
up with plans to move a robot from one location to another that avoids all the obstacles
in the environment. This basic approach can be applied to a wide variety of robotic
systems including relatively simple robots that roll around on the ground, to robotic
arms with multiple degrees of freedom.
In Fig. 1, the robots are constrained to move around on a grid of cells. The
limitations are that it cannot go outside the playing area or enter any of the black
grid cells. These correspond to obstacles in real world. Here, our goal is to come up
with a sequence of steps that will take the robot from the starting location (Green
cell), to the goal location (Red cell). Typically, end up with a mathematical structure
called a graph. In this context, path is simply a sequence of consecutive edges that
lead from one node to another. There are many different paths that would solve the
above-mentioned problem.
1.1 Grassfire Algorithm
In Grassfire algorithm [1], begin with marking the destination node with a distance
value of 0. Then, find all of the nodes that are 1 step away from the destination, and
label them with 1. Then, all the nodes that are 2 steps away are labeled as 2 etc., until
the start node is encountered. For every cell in the grid, the distance value that it gets
Fig. 1 Problem with two

degree of freedom
A Study on Evaluating the Performance of Robot Motion … 115
Fig. 2 Example of grassfire

algorithm
marked with indicates the minimum number of steps that it would take to go from
that point to the destination node as shown in Fig. 2.
The red line indicates the shortest path (or) list of nodes that has to be visited to
reach the source node. The pseudo code of the Grassfire algorithm is presented in
Algorithm 1.
Algorithm 1: Grassfire Algorithm

Step 1: Start
Step 2: For node =1 ton in the configuration field
2.1: The distance of the node is initialized to infinity
Step 3: Create an empty list
Step 4: Initialize the distance of the goal node to zero
4.1: add goal to list
Step 5: while list not empty
5.1: Let current=first node in list
5.2: Remove current from list
5.3: For each node n that is adjacent to current
5.3.1: if n.distance=infinity
5.3.1.1: n.distance=current.distance+1
5.3.1.2: add n to the back of the list
Step 6: Stop
Note that if two neighbors have the same distance value, it is obvious to choose
one arbitrarily. This happens when the shortest path to the goal is not unique. The
grassfire algorithm has the following desirable properties. If a path exists between
the start and the destination node, it will find one with the fewest number of edges.
If no path exists, the algorithm will discover that fact and report it to the user. In this
sense, we say that the grassfire algorithm is complete. More formally the amount of
computational effort that needs to expend in order to run the grassfire algorithm on
a grid grows linearly with the number of nodes.
1.2 Dijkstra’s Algorithm
The pseudo code of Dijkstra’s algorithm [2] is as shown in Algorithm 2.
Algorithm 2: Dijkstra's Algorithm

Step 1: Start
Step 2: For each node n in the graph
2.1: n.distance = infinity
Step 4: start.distance=0
4.1: add goal to list
5.1: Let current= node in list with the smallest distance
5.3: For each node n that is adjacent to current
5.3.1: if n.distance>current.distance + length of edge from n to
current
5.3.1.1: n.distance=current.distance+ length of edge from n
to current
5.3.1,2: n.parent=current
5.3.1.3: add n to the back of the list
Step 6: Stop
By using priority queues to store sorted list of nodes the complexity can be reduced
as shown in Eq. (1),
O((|E| + |V |) log(|V |)) (1)
where |V | denotes the number of nodes in the graph and |E| denotes the number of
edges.
1.3 A* Algorithm
The A* Algorithm is a well-known algorithm that exemplifies this idea [3]. A* is

actually an example of a broader class of procedures called best first search algo-
rithms. It explores a set of possibilities by using an approximating heuristic cost
function to sort the varied alternatives and then inspecting the options in that order.
Heuristic functions are used to map every node in the graph to a non-negative value.
The criteria of the heuristic function is as shown in Eq. (2).
H (goal) = 0
for any 2 adjacent nodes X and Y
H (X ) ≤ H (Y ) + d(X, Y )
d(X, Y ) = weight/length of edge from X and Y
The above properties will ensure that for all nodes n
H (n) ≤ length of shortest path from n to goal (2)
For path planning on a grid the mostly used heuristic functions (Euclidean and
Manhattan distance) are expressed in Eqs. (3 and 4).

2 2
H (xn , yn ) = xn − x g + yn − yg (3)

H (xn , yn ) = xn − x g + yn − yg (4)
Algorithm 3: A*Algorithm
Step 1: Start
2.1: n.f = infinity
2.2: n.g = infinity
Step 4: start.g=0
4.1: start.f = H(Start)
4.2: add start to list
5.1: Let current= node in list with the smallest f value
Step 6: if (current = = goal node) report success
Step 7: For each node n that is adjacent to current
7.1: if n.g>(current.g + cost of edge from n to current)
7.1.1: n.g =current.g + cost of edge from n to current
7.1,2: n.f = n.g + H(n)
7.1.3: n.parent = current
7.1.4: add n to the list if it isn't there already
Step 8: Stop
In the present research work, we have discussed about planning paths for robots
that moved on a grid. For grid-based problems, the Breadth First Search or Grassfire
Algorithm, which search for the shortest path by exploring outward from a start
location. Finally, the A* Algorithm is a way to speed up the search for a shortest path
when you have an informative heuristic to guide the search procedure. The objective
of the present research work is to evaluate the performance of Robot Motion using
Gradient Generalized Artificial Potential Fields with obstacles. The focus is more on
the algorithms used. The steps that need to be followed in understanding the motion
behavior of ROBOT is the methodology followed in the present research work.
2 Methodology
The graph-based algorithms are important, since they serve as a basis for a wide
range of path planning procedures. The notion of configuration space allowed us to
think about the motion of the robot in terms of the motion of a point, moving through
the configuration space while avoiding the configuration space obstacles.
In the context of configuration space, we discussed about collision checking func-
tions that could be used to decide whether or not a given configuration would collide
with the workspace obstacles, thus providing an implicit description of the configura-
tion space obstacles and the complimentary free space. The plan path through contin-
uous configuration spaces are described using the methods like the visibility graph,
the trapezoidal decomposition, the grid-based approach. Each of these methods
represented a different approach toward capturing the structure of the continuous
configuration space with a discrete graph, so that it helps in applying standard graph-
based techniques like Dijkstra’s algorithm. To solve these path planning problems,
another important class of method is based on the idea of random sampling, in which,
the graph from randomly chosen samples in the configuration space, connected by
edges which represent collision-free trajectories. The Probabilistic Road Map algo-
rithm that sought the construct of a road map or skeleton of the free space and the
rapidly exploring random tree procedure, which constructs every evolving trees to
explore the free space and forge paths between the start and the goal state. Both
algorithms have the pleasing property that they work quite well in practice, even on
high dimensional configuration spaces. It helps to steer the robot through configu-
ration space by considering the gradient of this artificial potential field. A strength
of these potential field methods is that they are relatively simple to implement, and
they can often be carried out directly based on sensory input. The motion planning
complexity increases with increasing in the degree of freedom of the system.
Fig. 3 Quantified positions of robot and obstacle in configuration space
The quantified positions of the robot can be traced with the help of a Tuple
composed of two numbers, tx , t y , which denote the coordinates of a particular refer-
ence point on the robot (red), with respect to a fixed coordinate frame of reference
as shown in the Fig. 3.
On the right-hand side of this figure, we plot the configuration space obstacle,
corresponding to the geometric obstacle shown in the left side of the figure. In this
case, the configuration space obstacle is defined by the Minkowski sum of the obstacle
and the robot shape [4]. The presence of multiple obstacles in space, can be visualized
by the union of all of the configuration space obstacles. All of the geometry of the
robot and the obstacles are captured by the configuration space obstacles.
The configuration space of the robot can be represented with a tuple tx , t y and θ ,
where tx , t y still denote the position of a reference point in the plane, and θ denotes
the applied rotational angle in degrees. In this case, the configuration space has
three dimensions, and the configuration space obstacles can be thought of as three-
dimensional regions in this space. The vertical access corresponds to the rotation θ ,
while the other two horizontal axes correspond to the translational parameters tx , t y .
2.1 Visibility Graph
It is a conceptual framework for thinking about a wide range of the motion planning
problem framed on a continuous configuration space and then use various approaches
to reformulate this problem in terms of a graph. It helps to apply the searching
algorithms like Grassfire, Dijkstra, and A*. In visibility graph the configuration
space obstacles are modeled as polygons [5], in which a node is associated with
every configuration space optical vertex. A problem that can be readily solved using
searching algorithms as discussed in introduction section.
2.2 Trapezoidal Decomposition
Another approach to path planning that is particularly effective in situations where

the configuration space obstacles can be modeled as polygons is termed as cell
decomposition [6]. The goal is to divide the robot’s free space into a set of simpler
regions and then form a graph where the nodes of the regions and the edges indicate
which regions are adjacent to each other. In this case, sort the obstacle vertices based
on their x-coordinates, and proceed from left to right, dividing the free space up into
regions. As the shapes of the objects in the configuration space is convex, the robot
can safely move in a straight line between any two points in each of the cells. Path
planning is then carried out by finding out which cell contains the start location and
which is the goal. It helps the planning of a path through the graph.
2.3 Probabilistic Road Maps
The pseudo code of Probabilistic Road Maps (PRM) is described in Algorithm 4.
Algorithm 4: PRMAlgorithm
Step 1: Start
2.1: Generate a random point in configuration space, X
Step 3: if (X is in free space)
3.1: find the K closest points in the roadmap to X according to the Dist
function
3.2: Connect the new random sample to each of the k neighbours u sin g t h e
Localplanner procedure.
3.3: Each successful connection forms a new edge in the graph
Step 4: Stop
2.4 Dist Function
The PRM procedure relies upon a distance function. It can be used to gauge the
distance between two points in configuration space. This function takes as input the
coordinates of the two points and returns real number as shown in Eq. (5).
Dist(x, y) ∈ . (5)
The common choices for distance functions are as shown in Eq. (6)

Dist1 = |xi − yi |
i

Dist2 = (xi − yi )2 (6)
i
2.5 Handling Angular Displacement
There are often cases where some of the coordinates of the configuration space
correspond to angular rotations. In this situation care must be taken to ensure that
the Dist function correctly reflects distances in the presence of wrapround as shown
in Eq. (7)
.Dist(θ1 − θ2 ) = min(|θ1 − θ2 |, (360 − |θ1 − θ2 |)) (7)
where θ1 and θ2 are two angles between 0 and 360°.
2.6 Rapidly Exploring Random Trees
In the probabilistic roadmap procedure, the basic idea was to construct a roadmap
of the free space consisting of random samples and edges between them. Once that
had been constructed, connect the desired start and endpoints to this graph and plan
a path from one end to the other. In the first phase construct a generic roadmap of the
entire free space without considering any particular pair of start and end points. The
advantage of this approach is that you can reuse the roadmap over and over again to
answer multiple planning problems [7].
Here the red node depicts the new random configuration that the system gener-
ates, while y depicts the closest existing node in the tree. Make use of the same
Localplanner procedure to decide whether two points in configuration space can be
linked by a collision free trajectory. It turns out that this procedure for generating
random samples is very effective at growing trees that explore and span the free
space.
Algorithm 5: RRTAlgorithm
Step 1: Start
2.1: add start node to the tree
Step 3: Repeat n times
3.1: if X is in free space using the CollisionCheck function.
3.2: find Y the closest points in the roadmap to the random configuration X
3.3: if (Dist , )
3.3.1:Check if X is too far from Y
3.3.2: find a configuration Z that is along the path from X to Y su ch
that Dist(Z,Y)
3.3.3: X = Z;
3.4: if (Localplanner (X, Y))
3.4.1: Check if you can get from X to Y
3.4.2: Add X to the tree with Y as its parent
Step 4: Stop
Algorithm 6: RRT2 tree Algorithm

Step 1: Start
Step 2: while not done
2.1: Extend Tree A by adding a new mode, X
2.2: Find the closest node in Tree B to X,Y
2.3: if (Localplanner (X, Y))
3.4.1: Check if you can bridge the 2 trees
3.4.2: Add edge between X and Y
3.4.3: This completes a route between the root o f Tree A a n d t he
root of Tree B
3.4.4: Return this route
2.4: else
2.4.1: Swap Tree A and Tree B
Step 3: Stop
In algorithm 6, the procedure is outlined for two-tree in pseudo code. On every

iteration of the algorithm, the system generates a random sample and tries to grow
the current tree toward that random sample. On every route of this procedure, it
swaps the two trees, since, both trees are growing toward each other at the same rate.
On the first round, generate a random sample and grow the blue tree. This linking
attempt does not succeed, because of an intervening obstacle. In the next round,
generate a new random sample and attempt to grow the red tree toward that point.
Turn around and try to connect that point to the blue tree represented as succeed, so
now perform merge operation between the start and the goal locations. In practice, the
RRT algorithm is very effective at forging paths in high-dimensional configuration
spaces. Another important feature of the RRT approach is it can be used on systems
that have dynamic constraints, which limit how they can move.
2.7 Attractive and Repulsive Artificial Potential Fields (APF)
In this section, the discussion is on another approach to guide robots into obstacle
filled environments based on artificial potential fields [8].
An attractive potential function, Fa (X ) is constructed by considering the distance
x1
between the current position of the robot, X = and the desired goal location,
x2
g
x1
Xg = g as shown on the Eq. (8).
x2

2
Fa (X ) = ξ X − Xg (8)
Here ξ is constant scaling parameter. Note that the function value is zero at the
goal and increases rapidly as the robot moves away.
A repulsive potential function in the plane, Fr (X ), can be constructed based on a
function, ρ(X ) that returns the distance to the closet obstacle from a given point in
configuration space X as shown in the Eq. (9).
2
η 1
− 1
if ρ(X ) ≤ d0
Fr (X ) = ρ(X ) d0 (9)
0 if ρ(X ) d0
Here η is a constant scaling parameter and d0 is a parameter that controls the

influence of the repulsive potential.
Gradient control Strategy: While robot position is not close enough to goal, select
the direction of robot velocity based on the gradient of the artificial potential field as
shown in Eq. (10).

∂ f (x)
∂ x1
vα − ∇ f (X ) = − ∂ f (x) (10)
∂ x2
Here v is appropriate robot speed.
3 Result and Discussion
A really attractive feature of these artificial potential field based schemes is that they
are relatively simple to implement. In fact, they can incorporate into real time control
schemes running at tens of hertz using local sensor data. However, a downside of these
methods is that it can be very difficult to ensure that they will always work. Ideally,
our artificial potential function would only have a single global minimum located at
the desired configuration. In practice, there are situations where the attractive and
repulsive forces conspire to produce local minimum at locations other than the desired
location. It turns out that these kinds of local minima are very hard to eliminate. One
way to view these artificial potential field based schemes is as a useful heuristic.
In many cases, they will successfully guide the robot to the desired configuration,
but they can get stuck in dead ends, which is often necessary to use a back tracking
procedure to detect these situations and to switch to a different planning strategy like
Generalizing Potential fields (GPF) [9], in which a control law to move the robot
by considering gradient of the potential field with respect to the configuration space
parameters as shown in Eq. (11).
⎛ f (x)
⎞
∂ x1
⎜ , ⎟
⎜ ⎟
⎜ ⎟
vα − ∇ f (X ) = −⎜ , ⎟ (11)
⎜ ⎟
⎝ , ⎠
f (x)
∂ xn
To generalize artificial potential fields to more complicated robotics systems

which can involve many degree of freedom is done by considering a set of control
points distributed over the surface of the robot. Imagine that each of these control
points is outfitted with a proximity sensor. It can use to detect the range to nearby
obstacles. The artificial potential field uses this information to push each of these
control points away from obstacles while guiding them toward their desired goals.
The effect of all of these pushes and pulls is arrogated in the artificial potential field
and the gradient information is used to decide how to move locally.
4 Conclusion
The graph-based algorithms are important, since they serve as a basis for a wide range
of path planning procedures. The notion of configuration space allowed us to think
about the motion of the robot in terms of the motion of a point, moving through the
configuration space while avoiding the configuration space obstacles. In the context
of configuration space, we discussed about collision checking functions that could be
used to decide whether or not a given configuration would collide with the workspace
obstacles, thus providing an implicit description of the configuration space obstacles
and the complimentary free space. The plan path is through continuous configuration
spaces methods like the visibility graph, the trapezoidal decomposition, the grid-
based approach. Each of these methods represented a different approach to capture
the structure of the continuous configuration space with a discrete graph so that we
could apply standard graph-based techniques like Dijkstra’s algorithm to solve these
path planning problems. Another important class of method is based on the idea of
random sampling, in which, the graph from randomly chosen samples in the config-
uration space, connected by edges which represent collision-free trajectories. The
description of two approaches for utilizing random samples is the Probabilistic Road
Map algorithm that sought the construct a road map or skeleton of the free space,
and the rapidly exploring random tree procedure, which constructs every evolving
tree to explore the free space and forge paths between the start and the goal. Both
algorithms have the pleasing property that they work quite well in practice, even on
high-dimensional configuration spaces. It helps to steer the robot through configu-
ration space by considering the gradient of this artificial potential field. A strength
of these potential field methods is that they are relatively simple to implement, and
they can often be carried out directly based on sensory input.
References
1. D. Sutherland, J.J. Sharples, K.A. Moinuddin, The effect of ignition protocol on grassfire
development. Int. J. Wildland Fire 29(1), 70–80 (2020)
2. A. Bozyiğit, G. Alankuş, E. Nasiboğlu, Public transport route planning: Modified dijkstra’s
algorithm, in 2017 International Conference on Computer Science and Engineering (UBMK)
(IEEE, 2017), pp. 502–505, 5 Oct 2017
3. H. Wang, Computer and cyber security: Principles, algorithm, applications, and perspectives.
https://doi.org/10.1201/9780429424878
4. N. Eckenstein, M. Yim, Modular robot connector area of acceptance from configuration space
obstacles, in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
(IEEE, 2017), pp. 3550–3555, 24 Sep 2017
5. E. Taheri, M.H. Ferdowsi, M. Danesh, Fuzzy greedy RRT path planning algorithm in a complex
configuration space. Int. J. Control Autom. Syst. 16(6), 3026–3035 (2018)
6. G. Bitar, A.B. Martinsen, A.M. Lekkas, M. Breivik, Two-stage optimized trajectory planning
for ASVs under polygonal obstacle constraints: Theory and experiments. IEEE Access 2(8),
199953–199969 (2020)
7. J. Denny, R. Sandström, A. Bregger, N.M. Amato, Dynamic region-biased rapidly-exploring
random trees, in Algorithmic Foundations of Robotics XII 2020 (Springer, Cham, 2020), pp. 640–
655
8. F. Bounini, D. Gingras, H. Pollart, D. Gruyer, Modified artificial potential field method for
online path planning applications, in 2017 IEEE Intelligent Vehicles Symposium (IV) (IEEE,
2017), pp. 180–185, 11 Jun 2017
9. N.K. Al-Shammari, T.H. Syed, M.B. Syed, An edge–IoT framework and prototype based on
blockchain for smart healthcare applications. Eng. Technol. Appl. Sci. Res. 11(4), 7326–7331
(2021)
Exploratory Analysis of Kidney Disease
Data Set—A Comparative Study
Aniket Muley and Sagar Joshi
Abstract In this study, an applied and comparative study approach is proposed

with a decision tree and random forest method. Here, secondary data associated with
kidney-related disease was employed. Study explores that random forest is efficient
as compared with decision trees. It is recommended that, in the future, random forest
methods will be more fruitful in the classification related to the kidney patients.
Keyword Data mining · Chronic kidney disease · Decision tree · Random forest
1 Introduction
Nowadays, kidney disease patients are more in India. The functioning of it and causes
associated with it are found in large scale in the nearby area. The main interest is
to identify the major parameters that can cause chronic pains to patients. Therefore,
developing a model to identify the disease can help the people to take precautions
by identifying symptoms and overcome from the chronic disease earlier stage.
Various researcher works on the problems with different perspective. Some of
the reviews are: Ahmed et al. [1] discovered that chronic kidney disease (CKD) is
generally found in South Asian and in black skin people as compared to general
population. They observed that it is due to diabetic people in south Asia and having
maximum risk. Apart from that, other issues, viz. blood pressure, heart problems,
family suffering from the same disease, it is more frequently observed during the age
group of 60 and above. Data mining can be defined as a process of digging out until
that time unidentified compelling and actionable information from big data and then
using the information so derived to make vital tasks in industry and tactic judgment
[2].
A. Muley (B)
School of Mathematical Sciences, Swami Ramanand Teerth Marathwada University, Nanded,
M.S. 431606, India
S. Joshi
Nutan Maharashtra Institute of Engineering and Technology, Talegaon Dabhade, M.S. 410507b,
India
128 A. Muley and S. Joshi
Priyadharshini et al. [3] used K-nearest neighbor (KNN) and logistic regression
model to diagnose CKD. Their aims of the study were to identify the missing compar-
ative estimations qualities from the data set. They figured out six AI calculations,
viz. strategic relapse, irregular backwoods, uphold vector machine, KNN, credulous
Bayes classifier, and neural networks utilized for proposed models. Selvarathi et al.
[4] aim to detect and diagnose CKDs, which mainly include the kidney stones, cystic
kidney disease, and suspected renal carcinoma. Histograms of directed gradient func-
tion and the KNN algorithm were used to identify the chronic kidney diseases. The
multi-layered convolution neural network (CNN) architecture was applied for kidney
disease classification. Further, batch prediction approach has been tested for CKD
forecasting. [5–11] used an open-source data set which contained numerous health-
related characteristics associated with fitness and results of the analytical investi-
gation. They analyzed with the help of a prediction model for creatinine and the
risk of CKD. Further, various machine learning algorithms were applied to evaluate
achieving model accuracy.
Rezapour [12] applied supervised at mining algorithms for the classification of
variables and helps to identify the influential factors. Decision tree algorithm was
mostly focused and applied on input data. The study results reveal that the risk of
stroke in patients whose vascular access surgery was performed by catheterization
before fistula was 84.21%. Padmanaban and Partiban [13] aimed at the detection of
patients through classification algorithms. Naïve Bayes and decision tree methods
applied forgetting accurate results and give better recitals to measure its parameters
and sensitivity. Chetty et al. [14] proposed classification models for the prediction
and categorization in the form of CKD and non-CKD cases with high accuracy.
Comparative analysis was performed among the selected classification model. Sinha
and Sinha [15] performed comparative study related to CKD cases and classified
it with support vector machine (SVM) and KNN techniques to get accurateness,
exactness, and completing time selected data set.
Chorasia et al. [5], Pasadana et al. [8], Senan et al. [10] forecasted kidney disease
issues by performing pre-processing operations variables and discovering missing.
Comparative analysis was performed among the selected model, and predictive
analytics models were assessed on the basis of exactness of forecasting [16]. Gharb-
dousti et al. [17] applied several machine learning-based classification algorithms on
400 observations and 24 attributes. Data set was preprocessed, missing values were
filled via means of nominal features and replaced by mode for categorical features;
further, the data set was normalized to have a unit scale for all data. Algorithms,
viz. decision tree, linear regression, SVM, naive Bayes, and neural network, were
compared on the basis of correlation matrix features. Tazin et al. [18] predicted the
presence of kidney-related issues through machine learning algorithm on a data set
collected from UCI repository. These algorithms were analyzed on the basis of their
model precision measures and receiver operating characteristic curve with Weka data
mining tool [19].
The major objectives of our study are: (1) To study and compare best classification
technique for CKD data set among decision tree and random forest and (2) to identify
the most responsible factors which cause a person to suffer CKD?
Exploratory Analysis of Kidney Disease Data … 129
2 Methodology
In this study, secondary data associated with CKD was taken from the Kaggle website
[20–22]. The parameters used are gathered in Table 1, and respective features are
explored in Table 2. Further, data mining techniques and neural network are implanted
on the data set. Microsoft Excel (2016) and Free and open-source R software (3.2.2)
are used along with rattle package to analyze the data.
Here, the missing is reinstated by arithmetic mean and mode respective attributes.
So for further analysis, we have considered the mean value for missing observations
and only numerical attributes have been used. Classification is an important issue
in the decision-making process. Appropriate classification gives accuracy for multi-
stage decision-making processes. Here, decision tree and random forest methods
were used for the classification of the data. In the decision tree, the main part is the
root note which divides the data into two or more homogeneous parts. It splits the
data into sub-nodes. Decision node splits into the subsequent nodes, and finally, it
stops the split and that will be our leaf node.
Random forest is a famous ensemble method used to develop predictive modeling
for classification as well as regression kind of problems. It minimizes correlation
issues by selecting subsamples of the features at every division. This method enhances
the classification prediction results. Further, both decision tree and random forest
results were compared with their area under curve (AUC) for train set, test set,
validation, and whole data set. The performance of the model is dependent on its
consistent AUC which can give more precise classification.
Table 1 Study parameters

Abbreviation-parameter Abbreviation-parameter
[20–22]
Age: Age WC: White blood cell count
BP: Blood pressure RC: Red blood cell count
SG: Specific gravity HTN: Hypertension
Al: Albumin DM: Diabetes mellitus
SU: Sugar CAD: Coronary artery disease
RBC: Red blood cells APPET: Appetite
PC: Pus cell PE: Pedal edema
PCC: Pus cell clumps ANE: Anemia
BA: Bacteria HEMO: Hemoglobin
BGR: Blood glucose random PCV: Packed cell volume
BU: Blood urea Class: Class
SC: Serum creatinine POT: Potassium
SOD: Sodium
Numerical Attribute-age, bp, bgr, bu, sc, sod, pot, hemo, pcv, wc,
rc
Nominal Attributes- sg, al, su, rbc, pc, pcc, ba, htn, dm, cad, appet,
pe, ane, class
Table 2 Features of the

Characteristics Features
parameters [20–22]
Data set : Multivariate
Attribute : Real
Associated tasks :Classification
Total instances : 400
Total attributes : 25
Non-available values : Yes
Area : N/A
Date donated : 2015–07–03
Number of web hits : 120,955
Here, we dealt with kidney disease data by decision tree and random forest
exploratory models in data mining. Initially, we have used 5 ways to find missing
observations of collected data. Further, it has been compared with others and
represented in Tables 3 and 4. The results are as follows:
3.1 Decision Tree
Table 3 explores that decision tree for mean shows minimum error for numerical
attributes, so we choose it as best. The results obtained from decision tree for mean
is as follows:
Table 3 Missing observation errors obtained through different ways for decision tree
Test Mean Mode Nominal R nominal R numerical
Validation 0 1.6 5 6.6 1.6
Training 3.6 2.9 2.5 3.6 2.1
Testing 8.3 8.4 6.7 6.6 8.3
Full 3.7 3.5 3.6 4.5 3
Table 4 Missing observation errors obtained through different ways for random forest
Mean Mode Nominal R nominal R numerical
Validation 0 0 1.6 1.6 1.6
Training 0 0 0 0.7 0
Testing 5 3.4 5 5 6.6
Full 0.7 0.5 1 1.5 1.3
Fig. 1 Decision tree
Figure 1 represents the decision tree, and it illustrates that hemoglobin is the
most responsible factor for causing CKD. Based on observations, hemoglobin level
in blood is observed to be less than 13; therefore, there will be a 56% chance of
causing CKD. Remaining 44% of the patients, 3% of the patients having blood
glucose random (bgr) content ≥157 may suffer from CKD and the rest of 41% of
the samples; further it is observed that, serum creatinine (SC) level ≥1.3 may suffer
and 37% of them do not having the CKD problem.
3.2 Random Forest
Table 4 explores the results obtained through various comparison to perform the most
suitable way to get more accurate results.
Figure 2 represents features of variables and its presence that is responsible for
causing CKD. The important variables that play key role on the classification of
patients, i.e., to view the relative importance of each variable, can be identified.
Hemoglobin, bgr, sc, pcv, sod, and rc are the most important factors causing CKD.
Figure 3 explores the error rates obtained through the random forest method for
CKD, non-CKD, and out-of-bag (OOB) bootstrap sample. With the CKD sample, it
CKD MeanDecreaseAccuracy
wc
bp
pot
age
bu
sc
bgr
sod
pcv
rc
hemo
MeanDecreaseGini notckd
wc
bp
pot
age
bu
sc
bgr
sod
pcv
rc
hemo
Fig. 2 Variable importance

0.12
OOB
CKD
notckd
0.10
0.08
Error
0.06
0.04
0.02
0 100 200 300 400 500
trees
Fig. 3 Error rates

Fig. 4 OOB ROC curve for random forest
gives least error as compared to others and it progressively helps to develop optimal
number of trees. It clearly shows that 500 trees were generated. Figure 4 represents
the OOB ROC curve for random forest. From this curve, we get the AUC value for
random forest which is 0.992. Hence, the accuracy of random forest is 99.2%. This
ROC plot is based on OOB predictions for each observations in the training data.
Table 5 represents the confusion matrix, overall error occurred during evaluation
and the values obtained through receiver operating characteristic (ROC) curves for
the decision tree and random forest methodology. We get the subsequent area under
curve (AUC) values of validation, training, testing as well as full data. The result
reveals that random forest gives the better result as it gives accuracy value of 100%
for almost all data split as well as overall data. Also, the overall error is less than
the decision tree model. Hence, random forest gives the most precise result with the
overall accuracy of 1 or 100%.
4 Conclusions
In our study observed that, hemoglobin, blood glucose random, serum creatinine,
packet cell volume, sodium, and red blood cell count were found most responsible
attributes for causing CKD. It can be useful in predicting the presence of CKD for
that patient. It will be helpful to reduce the number of other medical tests of blood
Table 5 Comparative results of decision tree and random forest

Method Data set Confusion matrix AUC
Actual Predicted Overall error
CKD Not-CKD
Decision tree Train CKD 171 8 5.3% 0.94
Not-CKD 7 94
Validate CKD 34 1 1.6% 0.99
Not-CKD 0 25
Test CKD 34 2 6.6% 0.93
Not-CKD 2 22
Full CKD 239 11 5% 0.95
Not-CKD 9 141
Random forest Train CKD 179 0 0% 1.00
Not-CKD 0 101
Validate CKD 35 0 0% 1.00
Not-CKD 0 25
Test CKD 35 1 3.4% 0.99
Not-CKD 1 23
Full CKD 249 1 0.5% 1.00
content. For full data, decision tree gives 97% accuracy and random forest gives
100% accuracy. Random forest gives more accurate results than decision tree with
accuracy 99.2%. The results were compared with [5, 7–10, 16], and the most suitable
method is observed as random forest. Most of the researchers were used Weka as
analytical tool for their study but we have preferred R software, and it gives more
accuracy to classify our results. The exploration of the results through visualization
with this R software is a special feature observed to understand the analysis of the
data. In a nutshell, random forest gives us more accurate results, hence in the future,
with the help of this model we can deal with the classification of the data set of
chronic and non-chronic disease patients.
References
1. M. Ahmad, V. Tundjungsari, D. Widianti, P. Amalia, U.A. Rachmawati, Diagnostic deci-

sion support system of chronic kidney disease using support vector machine, in 2017 Second
International Conference on Informatics and Computing (ICIC) (IEEE, 2017), pp. 1–4
2. M.H. Dunham, Data mining: Introductory and Advanced Topics. (Pearson Education India,
2006)
3. C. Priyadharshini, K. Sanjeev, M. Vignesh, N. Saravanan, M. Somu, KNN based detection and
diagnosis of chronic kidney disease. Ann. Rom. Soc. Cell Biol. 2870–2877 (2021)
4. C. Selvarathi, P. Devipriya, R. Indumathi, K. Kavipriya, A survey on detection and classification

of chronic kidney disease with a machine learning algorithm. Ann. Rom. Soc. Cell Biol.
3757–3769 (2021)
5. V. Chaurasia, S. Pal, B.B. Tiwari, Chronic kidney disease: A predictive model using decision
tree. Int. J. Eng. Res. Technol. 11(11), 1781–1794 (2018)
6. V. Kunwar, K. Chandel, A. S. Sabitha, A. Bansal, Chronic Kidney Disease analysis using data
mining classification techniques, in 2016 6th International Conference-Cloud System and Big
Data Engineering (Confluence). (IEEE, 2016), pp. 300–305
7. M. Manonmani, S. Balakrishnan, An ensemble feature selection method for prediction of
CKD, in 2020 International Conference on Computer Communication and Informatics (ICCCI)
(IEEE, 2020), pp 1–6
8. I.A. Pasadana, D. Hartama, M. Zarlis, A.S. Sianipar, A. Munandar, S. Baeha, A.R.M. Alam,
Chronic kidney disease prediction by using different decision tree techniques. J. Phys. Conf.
Ser. 1255(1), 012024 (2019). IOP Publishing
9. M. Saffarian, V. Babaiyan, K. Namakin, F. Taheri, T. Kazemi, Developing a novel continuous
metabolic syndrome score: A data mining based model. J. AI Data Min. (2021)
10. E.M. Senan, M.H. Al-Adhaileh, F.W. Alsaade, T.H. Aldhyani, A.A. Alqarni, N. Alsharif, M.I.
Uddin, A.H. Alahmadi, M.E. Jadhav, M.Y. Alzahrani, Diagnosis of chronic kidney disease using
effective classification algorithms and recursive feature elimination techniques. J. Healthc. Eng.
1–10 (2021)
11. W. Wang, G. Chakraborty, B. Chakraborty, Predicting the risk of chronic kidney disease (CKD)
using machine learning algorithm. Appl. Sci. 11(1), 202 (2021)
12. M. Rezapour, Predicting stroke in hemodialysis patients using data mining. Digit. Trans. 1(1),
45–57 (2021)
13. K.A. Padmanaban, G. Parthiban, Applying machine learning techniques for predicting the risk
of chronic kidney disease. Indian J. Sci. Technol. 9(29), 1–6 (2016)
14. N. Chetty, K.S. Vaisla, S.D. Sudarsan, Role of attributes selection in classification of Chronic
Kidney Disease patients, in 2015 International Conference on Computing, Communication
and Security (ICCCS) (IEEE, 2015), pp 1–6
15. P. Sinha, P. Sinha, Comparative study of chronic kidney disease prediction using KNN and
SVM. Int. J. Eng. Res. Technol. 4(12), 608–612 (2015)
16. L.J. Rubini, P. Eswaran, Generating comparative analysis of early stage prediction of Chronic
Kidney Disease. Int. J. Mod. Eng. Res. (IJMER) 5(7), 49–55 (2015)
17. M.S. Gharibdousti, K. Azimi, S. Hathikal, D.H. Won, Prediction of chronic kidney disease
using data mining techniques, in IIE Annual Conference. Proceedings. Institute of Industrial
and Systems Engineers (IISE) (2017), pp. 2135–2140
18. N. Tazin, S.A. Sabab, M.T. Chowdhury, Diagnosis of Chronic Kidney Disease using effective
classification and feature selection technique, in 2016 International Conference on Medical
Engineering, Health Informatics and Technology (MediTec) (IEEE, 2016), pp 1–6
19. E.H.A. Rady, A.S. Anwar, Prediction of kidney disease stages using data mining algorithms.
Inform. Med. Unlocked. 15, 100178 (2019)
20. https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease. Last accessed on 29 Jan 2019
21. https://www.hse.ie/eng/health/az/c/chronic-kidney-disease/#collapse_1. Last accessed on 29
Jan 2019
22. https://www.kaggle.com/mansoordaku/ckdisease. Last accessed on 29 Jan 2019
Deepfakes for Video Conferencing Using
General Adversarial Networks (GANs)
and Multilingual Voice Cloning
Jayesh Shelar, Dipak Ghatole, Mayank Pachpande, Dhanashree Bhandari,

and S. V. Shinde
Abstract Covid-19 pandemic led to remote working and hence resulting in more
video conferences among all sectors. Even important international conferences
between different nations are being conducted on online video conferencing plat-
forms. Hence, a methodology capable of performing real-time end-to-end speech
translation has become a necessity. In this paper, we have proposed a complete
pipeline methodology, wherein the real-time video conferencing will become inter-
active, and it can be used in the educational section for generating videos of instruc-
tors from just their images and textual notes. We are using automatic voice trans-
lation (AVT), text-to-stream machine translation (MT), and text-to-voice generator
for voice cloning and translation in real time. For video generation, we use general
adversarial networks (GANs), encoder-decoder, and various other previously imple-
mented generative models. The proposed methodology has been implemented and
tested with some raw data and is quite effective for the specified application.
Keywords General adversarial networks · SyncNet · Speech-to-speech

translation · Voice cloning · Automatic voice translation
1 Introduction
Eventually, Covid-19 led us to shift toward online (remote) working mode, with a
drastic rise in handling remote meetings and educational events. There may be many
language barriers in these processes, so removing these barriers are also a key task.
For example, remote interaction is happening between two employees (people) from
different countries. They need translators to understand the language, which is a time-
consuming process. A crucial aspect of removing these translators is translating the
speaker’s sentences into the listener’s language and then correcting the lipsyncing
with the translated video.
The first attempts at creating a talking face with desired audio relied heavily on
lip landmarks for the speech representation of a single speaker, which took hours to
J. Shelar (B) · D. Ghatole · M. Pachpande · D. Bhandari · S. V. Shinde

Department of Computer Engineering, Pimpri Chinchwad College of Engineering, Nigdi, India
138 J. Shelar et al.
train. According to our problem statement, these works are ideal for a single speaker,
but we need a more generic model to produce real-time lipsynced videos.
By exploring such speaker-independent approaches, we got SyncNet discrimi-
nator network which performs quite well for syncing lips with the audio [1]. Lipsync
expert is well trained previously on the various videos of news anchors, which will
be useful for error correction while training our deep fakes generator network [1].
Generally, a sequential model is used in any speech-to-text conversion, but since
they are computationally intensive; we attempt to use deep convolutional neural
nets based on a spectrogram of the audio data provided. Recurrent neural networks
(RNNs) take a long time to train, whereas DCNNs are usually very fast and require
less computation power and time [2]. An alternative neural text-to-speech (TTS)
based only on convolutional neural networks (CNN) alleviates these economic costs
of training. Due to the advanced features like high parallelizability, the CNN-based
TSS model’s performance is much optimum than the RNN or gated unit models [2].
We concentrate on an issue with the recovery of the original signal over the noise
generated by the system in the further down sampling and up sampling process; the
percentage recovery is 15% downsampled to 50% original audio.
Initially, for generating spectrograms, the main component is available data and
related datasets. So, the first step is preprocessing; the data generating natural speech
from the text (TTS) is a heavy computing task even after the availability of the
powerful machines to process the incoming data on the model and the complex time-
series nature of the data [3]. Generating the spectrogram network is difficult because
none of the algorithms guarantees a globally optimal solution at low computational
complexity. By using various models, we get better results of audio.
Keeping in mind the future scope of this idea, these limitations can be eliminated
by working on a model of multi-speaker acoustic space [4]. This enables them to
generate the speakers’ voice that has never been heard before during the training
process. Also, a reduction in the training data sample can significantly save computing
resources that can be further used for other purposes [3]. That is why training the
network with few samples can be helpful, but it is the tradeoff that comes at the cost
of the accuracy of the output generated by the model.
2 Literature Survey
A lot of work has been recently done in this area after the introduction to general
adversarial networks. But, before that some successful experiments are also being
done using encoder-decoder networks.
Deepfakes for Video Conferencing Using General Adversarial … 139
2.1 Generator Networks
One of the most interesting approaches to this problem is the Speech2Vid model [5].
Speech2Vid model successfully generates a talking face just from the audio input and
an image. It consists of two encoders, namely audio encoder and face encoder with
an image decoder and deblurring module, which refines the generated output [5].
The deblurring CNN module works similarly as in the medical images as suggested
in [6, 7]. The model uses skip connections in the network to preserve the identity.
Since there is no discriminator network to aid in improving encoders, the learning
of the model is not robust.
After the Speech2Vid model [5], many other approaches are made to accurately
lipsync the videos, but many of them are trained on some limited set of vocabulary
and identities. This restricts their use in real-time applications such as what we are
discussing for live video conferences.
The introduction of general adversarial networks (GANs) gave a new direction to
this research area leading to one of the most effective models called Wav2Lip [8].
The advantage of having a SyncNet discriminator helps the model in the learning
process, which is a big difference from the previous Speech2Vid [5] model. Even
after comparing with the most effective models, Wav2Lip gave far better results in
standard datasets used for performance measures like LRW, LRS2, and LRS3.
2.2 Discriminator Networks
The discriminator works as a classifier in GAN’s setup. It is used to get the difference
between real data from the data created by generator.
Supasorn et al. [9] focused on the different steps of learning to create a new
deepfake for Barack Obama. Audio of Obama passed to the model, and the model
will create a talking video of Barack Obama. They trained their model for hours on the
videos of Barack Obama of his weekly address footage. But, when it comes to making
deepfake of another person, it lacks because the model needs to be trained again on
new person videos and images for hours. Recently [10], Prajwal KR and Rudrabha
Mukhopadhyay used discriminator in GANs setup, resulting in 56% accuracy on the
LRS dataset. They used a single frame for checking the lipsync with respective audio
given as input.
Joon Son Chung and Andrew Zisserman [5] created their own dataset for checking
Lip synchronization; it consists of several hundred hours of speech from the BBC
news channel. SyncNet architecture takes 0.2-s clips of both the audio and video.
Further, it divides into two stream, i.e., video stream and audio stream. While training
the model, both streams were trained simultaneously, which gave 92% accurate
lipsync error.
The most recent work related to generating deepfakes used SyncNet for training
the generator network. This approach generates impressive deepfakes because the
previously well-trained SyncNet network helps the generator to detect lipsync errors
more efficiently [8].
2.3 Translation
Goggles translatotron gives us direct speech-to-speech translation using sequential

models [11]. Mapping of language La to Lb is done by training the neural network
to achieve target spectrograms with respect to audio spectrograms of language La.
2.4 Generating Spectrogram Network (Voice Cloning)
Following our analysis, we discovered that we were able to achieve good results using
an end-to-end speech from text method called Tacotron [12], which directly estimates
a spectrogram from an input text. In the text-to-speech system, mel spectrograms
need to be exactly in audible representations. According to [13], we combine the
spectrograms using the subsequent two forms of networks text to mel, which creates
mel spectrogram from text input and spectrogram super-resolution. Spectrogram
network (SSRN) creates a whole short-term Fourier transform spectrogram from a
group of knowledge.
Sharma et al. [14] proposed a fast Griffin-Lim algorithm (FGLA) approach, which
uses a vocoder in the speech synthesis phase. FGLA approach is tested on LJSpeech,
Blizzard, and Tatoeba datasets. The performance is evaluated on synthesis delay and
speech quality.
3 Proposed System
We proposed an approach to generate end-to-end speech translation with voice

cloning and creating real-time deepfakes as shown in Fig. 1. This can be viewed
as a pipeline which consists of different modules mainly generator, discriminator,
and voice cloning module.
3.1 Generator
Generator is used to generate deepfakes of the user which has mainly two
componenets: audio encoder and identity encoder. Sometimes a deblurring module
can also be used to clean the generated video frames.
Fig. 1 Our approach is to make real-time deepfakes generation with accurate LipSync and multi-
lingual voice cloning. The proposed system takes Person (P1)’s image and a sentence in a language
in La. Then, it extracts text from given audio followed by a language translator, translating text from
language La into language Lb. The translated text sentence will be cloned into Person (P1)’s voice
using super-resolution spectrograms. The output of speech-to-speech will be given to deepfakes
generation along with the Person (P1)’s image/video, realistic real-time talking video of Person
(P1) will be created using a generator which is trained using SyncNet model [1]; the SyncNet is a
well-trained model for several hours on recorded videos from BBC news channel and vice versa
for Person (P2)
3.1.1 Audio Encoder
A standard convolutional neural network (CNN) that takes mel-frequency cepstral

coefficient (MFCC) heatmap of size M X T X 1 and creates audio embedding of
size h [8]. The encoder mostly uses deep convolutional neural networks (DCNNs)
implemented earlier like AlexNet and ResNet [5]. The problem with feedforward
neural network is that high-resolution images are not easy to train. AlexNet consists
of 8 Layers (5 convolution layers + 3 fully connected layers). Also, some features
of AlexNet, like using ReLu, overlapping pooling layers, makes the training more
robust and avoids overfitting.
3.1.2 Identity Encoder
A VGG-M of 112 × 112 × 3 is used to extract features from images, i.e., encoding
features can also be used for dimensionality reduction. As specified in [5], an image
deblurring module is integrated with the pipeline. It is a CNN inspired by the very
deep super-resolution network (VDSR) containing 20 weighted layers considered to

be better than super-resolution convolutional neural network (SRCNN).
3.1.3 Face Encoder
Face encoder now consists of pose information with ground truth [8] which gets
concatenated channel wise. The final output from the face encoder is then passed
over to the discriminator network, which sends back the reconstruction loss as the
feedback to the generator network, which then constructs the new facial frames of
dimensions H × H × 6.
3.2 Discriminator
After the research done by K R Prajwal and Rudrabha Mukhopadhyay’s [8],

LipGAN’s discriminator is only 52% accurate while detecting lipsync error. There
are mainly two reasons which affect the accuracy of LipGAN. First is LipGANs
takes only a single frame checking whether the audio is synced with the audio or not,
but instead of a single frame, some temporal data are beneficial while noticing the
lipsync error. Secondly, the discriminator in LipGAN focuses on many visual arti-
facts [8]. So training the discriminator on these noisy images will affect the accuracy
of lipsync error.
SyncNet model takes two different input streams, i.e., V of Tvid which is nothing
but the consecutive face frames and speech segment S of size Taud × D, regarding the
time steps. Both the video and audio stream trained the discriminator to distinguish
between synced audio with video frames. It takes the random sampling from an audio
stream Taud x D and checks whether that stream is aligned to the video stream with
respect to different time steps.
The main objective of the training is that the output of both audio and video frames
should be genuine pairs and should be different. Euclidean distance is calculated
between video and audio stream for calculating contrastive loss proposed for siamese
networks [15]. Equation (1) depicts the contrastive loss calculated using Euclidean
distance. The distance metrics calculated in Eq. (2) is then used to improve the
generator network’s results as discussed earlier.
1
N
E= (yn )dn2 + (1 − yn ) max(margin − dn , 0)2 (1)
2N N −1
dn = ||vn − an ||2 (2)
In Fig. 2, based on contrastive loss between video and audio streams, the model
selects whether that pair is valid or not. If the distance between the streams is smaller,
Fig. 2 Obtaining valid and invalid audio–video pairs
the loss will be calculated and help the model learn better. If the pair of the streams
are far from each other, simply ignoring the pair must be false [1].
[1] Training time for lipsync discriminator: 29 h (Optimizer = Adam, Batch Size
= 64, Tv = 5) and initial learning rate = 1e-3.
3.3 Generating Spectrogram Network
Text2Mel is one of the best techniques to generate spectograms from text input which
can in turn be converted into audio.
3.3.1 Text2Mel
In this network, four main core components of the architecture are input encoder,
waveform encoder, audio tuning attention mechanism, and waveform decoder. It
synthesizes coarse spectrograms from input taken by the user.
• TextEnc: It takes text as an input. So, I = {i,.., iN} ∈ CharN where N is no. of
character, consisting of two matrices K(key), V (value) ∈ R(d × N). Therefore,
(K, V ) = TextEnc(I).
• AudioEnc: It encodes rough mel spectrogram. We get result as Q =
AudioEnc(S1:F,1:T ), where T is the length of the speech.√
• Attention: Attention matrix A = softmax n-axis(K TQ/ d). Defines evaluation
n-th character of iN and e t-th mel spectrum are related.
• AudioDec: AudioDec estimates mel spectrogram from seed matrix R’ where R’
= [R, Q]. We get the result as Y 1:F,2:T + 1 = AudioDec(R), and then, Dbin can
be calculated as shown in Eq. (3).
• The final result we get,

Yft 1 − Yft
Dbin (Y |S) := E f t −S f t log − 1 − S f t log (3)
Sft 1 − Sft
3.4 Spectrogram Super-Resolution Network
The overall purpose of generating super accurate spectrograms over the SSRN
network is to pass that SSRN data through a batch sampler of downsampling and
upsampling, which affects the quality of training data. From very recent times, the
idea of backpropagation has been floating; however, our model is feedforward. In
the second step, the SSRN network model is trained on the subsequent pairs of low
and high-quality data of mel-course spectrograms.
In audio super-resolution using neural nets paper [13], by generating the mel
spectrogram using spectrogram super-resolution network (SSRN), we can create the
coarse spectrograms that are more sharp than the original waveform. Upsampling
frequency is achieved by quadrupling the length of the sequence from T to 4 T =
T 0, by applying “deconvolution” layers of size twice.
3.5 Voice Cloning Encoder, Vocoder, Synthesizer
The following are the different ways of voice cloning encoder, vocoder, synthesizer.
3.5.1 Training Data Shape Modulation
Generally, the audio waveform is represented by the function s(t): [0, T ] R, where T
is the time period of the signal (in seconds), and s(t) is the amplitude at t. [2] The
sampling should be done only on the discretize space, so first, we have to discretize
any continuous wave data present like 1, 2,.. RT where R is the sampling rate of ×
(in Hz) range from 4 to 44 kHz. To represent the pitch and modulation of one wave
as compared to others short term, Fourier transforms are used along with inverse
and baseline long-term Fourier transform, respectively, which can further be used
as distinction metrics for the sample input training [2]. By using deep convolution
neural networks (DCNN), we can eliminate the disadvantage of RNN that is over
immediate synthesis of high-resolution data. This is acheived by the staged synthesis
of data as compared to sequential data incase of RNN.
3.5.2 Bottleneck Architecture
Each convolution, batch normalization, and ReLU non-linearity block in our model
perform a dilated convolution, dense neural backpropagation, and ReLU activation
at the end layer. Due to frequent downsampling and reduction metrics in the original
data, fixed dimensional spectrograms get overtrained on different parts of the atten-
tion module which further helps in quickly training the model over the small set of
data by separating and using the noise generated as the feature set.
3.5.3 Downsampling and Upsampling
The diametric and diagonals are halved, and the beta mask from the previous
spectrogram attention size is multiplied during a downsampling operation; this is
altered during an upsampling step [13]. Auto-encoders influenced this bottleneck
architecture, which is considered to allow the model to learn a hierarchy of functions.
3.5.4 Generating Results
Input text will be taken and converted into a mel-coarse spectrogram and fed to
the deep convolutional text-to-speech (DCTTS). The model then returns the output
waveform which is the time series data and gets further encoded as the audio of the
targeted user.
4 Results
These are some of the experimental results after a rough implementation of our idea.
It takes image of a speaker and then generates video frames according to the audio
that is passed as an input as shown in Figs. 3, 4 and 5, and the results are represented
in Tables 1 and 2.
Fig. 3 Generating video frames from a single image and audio spectrograms from language La
Fig. 4 Generating video frames from a single image and audio spectrograms from language La
Fig. 5 Generating deepfake video with accurate lipsync on language Lb using input video and
audio of language La
Table 1 [8] “Lip-Sync Error-Distance” (lower is better) and “Lip-Sync Error-Confidence” are two
new metrics suggested, in which the lipsync accuracy in unconstrained videos can be accurately
measured
LRS2 [16] LRS3 [17]
Approaches LSE-D LSE-C FID LSE-D LSE-C FID
SPEECH2VID [5] 14.23 1.587 12.32 13.97 1.681 11.91
LIPGAB [10] 10.33 3.199 4.861 10.65 3.193 4.732
WAV2LIP [8] 6.469 7.781 4.446 6.986 7.574 4.350
Realistic videos 6.736 7.838 – 6.956 7.592 –
Notice that we only train on the LRS2 train set [1], but we can generalize to any dataset without
difficulty
5 Conclusion
In this paper, we proposed an approach to generate end-to-end speech translation

with voice cloning and creating real-time deepfakes. This has a lot of applications
such as international video conferences, teaching aids, creating dropped frames,
Table 2 [2] As you can see from the table, the result see the sharp rise in the accuracy till the break
even point which is somewhere after 40 h in this case as the attention module start noising the whole
sample and overall accuracy keep on decreasing, but however the variance of the sample keeps on
decreasing which suggests the confidence validation of prediction of speech note to be right
Time for training (TFT) Tacotron (RNN based) Wavenet (DCNN based)
MOS (95%CI) MOS base confidence Variance (%)
12 Days ~ 270 + hrs 207 – 15
2h – 174 92
7h – 261 84
15 h – 271 37
40 h – 254 41
and corporate meetings. The techniques used previously have been studied, like
the Speech2Vid model with encoders and decoders and then Wav2Lip with GANs
which gives better results. Also, deep convolutional neural networks (DCNNs) give
better results for voice cloning than sequential models like LSTMs and RNNs with
GRU with the help of mel spectrograms [18]. We have created a pipeline with these
existing effective techniques which is best in the business. We have also analyzed
our results based on the real-time data samples which show amazing results. There
is a lot of scope to improve these techniques further and reduce the computational
power required as the application discussed regarding video conferencing requires
real time and faster results.
References
1. M. Baldonado, C.-C.K. Chang, L. Gravano, A. Paepcke, The Stanford Digital Library Metadata
Architecture. Int. J. Digit. Libr. 1, 108–121 (1997)
2. H. Tachibana, K. Uenoyama, S. Aihara, Efficiently trainable text-to-speech system based on
deep convolutional networks with guided attention in IEEE april (2018). arXiv:1710.08969
3. S.O. Arik, J. Chen, K. Peng, W. Ping, Y. Zhou, Neural voice cloning with a few sample, in
IEEE ICASSP (2016). arXiv:1802.06006
4. G. Ruggiero, E. Zovato, L. Di Caro, Vincent Pollety. Voice cloning: a multi-speaker text-to-
speech synthesis approach based on transfer learning, in IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP) arXiv:2102.05630
5. J.S. Chung, A. Jamaludin, A. Zisserman, You said that? arXiv preprint (2017). arXiv:1705.
02966
6. S. Shinde, U. Kulkarni, D. Mane, A. Sapkal, Deep learning-based medical image analysis using
transfer learning, in Health Informatics: A Computational Perspective in Healthcare. Studies in
Computational Intelligence, vol. 932, eds. by R. Patgiri, A. Biswas, P. Roy. Springer, Singapore
7. S.S. Mane , S.V. Shinde, Different techniques for skin cancer detection using dermoscopy
images. Int. J. Comput. Sci. Eng. (2017), ISSN-2394-5125
8. K.R. Prajwal, R. Mukhopadhyay, Namboodiri, P., C.V. Jawahar. A lip sync expert is all you
need for speech to lip generation in the Wild. In Proceedings of the 28th ACM International
Conference on Multimedia (MM’ 20) (2020). arXiv:2008.10010
9. S. Suwajanakorn, S.M. Seitz, I. Kemelmacher-Shlizerman, Synthesizing obama: learning lip

sync from audio. ACM Trans. Graph. (TOG) 36, 4 (2017)
10. , R. Mukhopadhyay, J. Philip, A. Jha, V. Namboodiri, C.V. Jawahar, Towards automatic face-
to-face translation, in Proceedings of the 27th ACM International Conference on Multimedia.
(ACM, 1428–1436, 2019)
11. Y. Jia, R.J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen, W. Yonghui, Direct speech-
to-speech translation with a sequence-to-sequence model (2019). arXiv:1904.06037
12. V. Kuleshov, S. Zayd Enam, S. Ermon, Audio super-resolution using neural nets (Cornell
University, 2017). arXiv:1708.00853
13. V. Kuleshov, S. Zayd Enam, S. Ermon, Audio super-resolution using neural nets, in Institute
of Electrical and Electronics Engineers (2015). arXiv:1708.00853
14. A. Sharma, P. Kumar, V. Maddukuri, N. Madamshetti, K.G. Kishore, S.S.S. K. B. Raman,
Partha pratim roy a [eess.AS] (2020). arXiv:2007.05764
15. S. Chopra, R. Hadsell, Y. LeCun, Learning a similarity metric discriminatively, with application
to face verification, in Proceeding of the CVPR. vol. 1 (IEEE, 2005)
16. T. Afouras, J.S. Chung, A. Senior, O. Vinyals, A. Zisserman, Deep audio-visual speech
recognition, In arXiv:1809.02108 (2018)
17. T. Afouras, J.S. Chung, A. Zisserman, LRS3-TED: a large-scale dataset for visual speech
recognition, arXiv preprint arXiv:1809.00496 (2018)
18. M. Blaauw, J. Bonada, R. Daido, Data efficient voice cloning for neural singing synthesis,
in ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal
Processing(ICASSP), pp. 6840–6844. https://doi.org/10.1109/ICASSP.2019.8682656
Topic Evolution Model for Interactive
Information Search
Harshal Adhav and Vikram Singh
Abstract In the modern context, as the data generated is exponential, finding mean-
ingful patterns from large datasets is an urgent need. A ‘Topic Evolution Model’
could generate the evolutions related to a topic of user interest and assist in the
exploration of patterns. In a generic setting, the proposed ‘topic evolution model’
assist researchers and domain experts for the relevant information extraction on
scientific field progress and innovations in technological field or domain from large
archives. The evolution patterns uncover the emerging, decay/fading, peculiar, and
long-lasting research topics and subtopics. The performance evaluation on coherence
metrics asserts that the proposed model significantly minimizes the domain expert
user efforts in topic analysis, as evolving patterns easily reveal underlying statistical
and machine learning details. The perplexity metrics highlights the capability of the
topic model toward the cognitive view of the user, i.e., change of ideas and knowledge
through a period of time reducing the citation bias.
Keywords Evolution graphs · Information search · LDA · Topic evolution · Topic

modeling
1 Introduction
In the recent years, increasing demand for tools from domain experts and research
fellows that can help them in extracting the information from large database of
documents or corpus to get the information about scientific discoveries and new
innovations resulted in research of evolution patterns, as such revealing these mean-
ingful patterns of evolution from the corpus of documents, research papers and scien-
tific articles has many applications which will help in synthesizing datasets across
different domains like scientific research, historical events and work of literature [1].
While searching, navigating and seeking the specific information in document corpus
of research papers and articles, the ability to find and identify topics with their time
H. Adhav (B) · V. Singh

National Institute of Technology, Kurukshetra, Haryana 136119, India
e-mail: harshal_32013108@nitkkr.ac.in
150 H. Adhav and V. Singh
of emergence and see their evolution pattern over time could be of significant help
to the system user.
Example: Let us consider a scientific paper document corpus and a researcher
or domain expert who starts his research in a peculiar or specific area or field.
They would want to quickly overview the areas, determine how these topics in the
area/domain have evolved, and find important ideas and the papers that introduced
them. After finding a specific concept in the paper, they want to learn that whether
there were previous papers or articles that discussed the same ‘concept’ or the topic
is new and emerging one.
An information search task often caused or induced by the changes happening in
the pattern of the information objects. Therefore, the goal of a topic evolution model
and evolution graph generation is to track the changes in search-related topics, either
discovered or retrieved in different time and applying temporal similarity between
topics aligning these topics discovered in different time periods [2].
The science evolution patterns generated will help philosophers and historians of
scientific domain to test their theories they have theorized with actual data patterns;
researchers can also make their research work in emerging scientific area [3]. Most of
the policy makers will be able to support innovative ideas and get key indications for
their decision-making processes. Understanding how different topics, ideas, innova-
tions in scientific domain and its literature evolve, diversify or integrate over a period
of time is a very interesting and important problem which will lead to the genera-
tion of very interesting evolution patterns of themselves. Also, at present, there is an
increasing demand for tools from domain experts and research fellows that can help
them while extracting the information from large database of documents or corpus
to get the information about scientific discoveries and new innovations [2].
In case of evolution on scientific literatures through years has two views, i.e.,
cognitive view and social view on evolution patterns. The cognitive view refers
to change of ideas and shared knowledge, but social view deals with authorship
and social interactions. In the existing research works researchers adapted both the
document content as a bag of words and citations in it, i.e., impact of one author’s work
on another one as discussed in cognitive and social view. Though, the philosophy has
limitation that work of individuals who are not related to large institutes gets ignored
creating ‘social’ bias. Due to this, to reduce the social bias, the proposed algorithm
considers cognitive view to model the interactions between ideas independently
rather than social interactions.
Topic evolution graphs generated through the proposed models efficiently track
the evolution of scientific literature by identifying and analyzing evolution patterns,
e.g., emergence, long-lasting, and fading of research topics through ages or the split
of on topic into several subtopics within a domain or inter-domain, etc. The aim
of generating these topic evolution graphs is to outline complex temporal changes
by each topic discovery and finding similarities between different topics via times
epochs.
Topic Evolution Model for Interactive Information Search 151
1.1 Research Questions and Contributions
An information search task often induced or caused by the changes happening in the
pattern of the information objects. Therefore, the goal of a topic evolution model
is to track the changes in search related topics, either discovered or retrieved, i.e.,
extracted in different time applying temporal similarity between topics aligning these
topics discovered in different and equally spaced time periods. The design challenges
faced by this ‘topic evolution pattern’ generation process are still trivial and need
to be considered critically so that a generic model can be effectively designed. The
following are the research questions (RQs) that are conceptually designed to conduct
the work:
RQ-I: How ‘topic evolution model’ steers the user search on untraceable topics?
RQ-II: What are the inherent parameters of topic modeling that affects the
generation of topic evolution model?
The key contribution is to develop a generic framework to assist domain experts or
researcher’s information search and exploration over the voluminous data archives.
Further, investigate the evaluation metrics and criteria to find out the evolving, fading
and long-lasting topics during different time periods.
1.2 Materials and Methods
To tackle all the challenges and achieve the goals proposed for this research contri-
bution, we considered many topic modeling methods like latent Dirichlet allocation
(LDA), latent semantic analysis (LSA), probabilistic latent semantic analysis (PLSA)
and hierarchical Dirichlet process (HDP). All these are powerful tools for the topic
modeling with their advantages and disadvantages. LDA is an unsupervised learning
model for topic modeling. For our experimentation purpose, we chose LDA as it
is the most reliable and suitable technique with which we can control how many
topics are to be generated which is one of its advantages, so we can scale the size
and pattern complexity of the topic evolution system.
The document corpus of research articles extracted from different sources, e.g.,
Google Scholar [4], arXiv.org [5]. The relevant information, e.g., abstract, publi-
cation year, are also acquired. The topics generated by LDA are aligned over the
different time zone, using the Jaccard similarity measure, as it can be directly adapted
with LDA for model generation.
2 Related Works
Topic evolution has attracted fast growing interest in the information retrieval
community for different evolution models with great efficiency. Existing research
efforts of researchers significantly adapted both document as bag of terms or words
and author citations for the purpose topic evolution.
2.1 Topic Modeling: A Fundamental Notion
Recent evolution network-based frameworks can be mainly differentiated from one

another by the chosen topic extraction and alignment methods. Li et al. [2] came
up with an idea to use latent Dirichlet allocation (LDA) to generate set of topics
and align them using cosine similarity between different time periods. He et al.
[6] also implemented the latent Dirichlet allocation to model citation network and
proposed inheritance topic model (ITM). Chavalarias et al. [7] came up with a method
which enabled a bottom-up reconstruction of the dynamics of scientific fields. They
generate topics by word co-occurrence graphs and align inter-temporal topics by
Jaccard similarity.
LDA [8], the Latent Dirichlet Allocation is an unsupervised machine learning tech-
nique for topic modeling. LDA is a method for the topic detection and its extraction
from large document corpus. LDA considers all word tokens having the equal impor-
tance, though requires ‘number of topics’ to generate that many numbers of topics
from corpus of documents. The coherence score assist in the appropriate number of
topic numbers. Further, Jaccard distance [9] was proposed by Paul Jaccard in 1912.
The Jaccard distance (dissimilarity) between the sets, which is complement to the
Jaccard similarity (coefficient) or index, can be calculated by subtracting the value
of Jaccard index from 1. It can also be evaluated as the difference of the number of
terms of the union of two sets and the intersection of same two sets divided by the
number of terms in the union of these same sets.
Topic modeling from the unstructured dataset is the basis of most the information
retrieval systems which are present today. Topic modeling is one way to identify and
extract the topics in large set of the document corpus. The topic modeling considers
that the documents are the mix of potential topics to be discovered. Martin et al. [10]
applied the efficient topic modeling using the LDA by using noun only approach to
filter out unnecessary verbs, adjectives and adverbs extracted by the LDA model and
checked the increase in the coherence score by three different datasets like whole
document, its lemmatized version and lemmatized version with nouns only which
helped in improving coherence of topics generated.
2.2 Topic Evolution for Information Search
In recent work, Andrei et al. [11] generated topics by a hierarchical Dirichlet process
(HDP) model and uses ‘Bhattacharyya Similarity,’ representing the gradual specia-
tion and convergence similar to biologic evolution, for identifying topic alignments.
Jo et al. [12] built model on the premise that the words which are relevant to a
topic are distributed over documents such that the distribution is correlated with the
underlying document network such as a citation network.
Tong et al. [13] used LDA topic modeling and Jensen–Shannon divergence for the
text mining research whose main task is effectively searching the terms, managing
text patterns and exploring the retrieved text data. They implemented this over the
Wikipedia articles and users’ tweets for identifying the important topics so that
system can be optimized based on the topics for relevant information search. Simi-
larly, Salatino et al. [14] developed an idea of using the ‘Computer Science Ontology’
to model the research topics which introduced the new approach for early detection
of the research topics using the Rexplore system. They applied a clique percolation
method (ACPM) developed for analyzing and evaluating the dynamics or changes
happening between the existent topics.
Topic evolution is an interesting field for information search as data deluge. A
topic evolution from document corpus can optimize the information search. He et al.
[6] implemented topic evolution model in scientific literature with the help of the
citations and explained how it will help the system on the basis of the citation network.
They used the citation-aware approach to generate new and unique inheritance topic
model. Chaudhuri et al. [15] created a research paper and article recommendation
system which with topic modeling used hidden feature identification methods to
further improve the search system. They extracted hidden features like keyword
diversity, sentence complexity, citation analysis and scientific quality measurements
which helped in further improving the recommendation system.
Currently, there are no known systems exist with the capability to generates the
topic evolution patterns, though which may help researchers and domain experts in
their work. The main objective of this paper is to demonstrate the designed system
for the needs discussed. We assert that the proposed system will be a pivotal work
to enable the knowledge discovery using evolving, long-lasting, fading information
patterns in the topic evolution.
3 Topic Evolution Model for Information Search
Revealing or extracting some meaningful and important topic evolution graphs or

patterns from a large document corpus has many applications. The evolving scientific
ideas and potential trails and relations may be uncovered over time. The key objective
of the paper is twofold; first is to design a topic evolution model or graph that assist
domain expert to track the evolution of scientific ideas by patterns, e.g., emerging
topics, fading topics of research and second to highlight the split of research topics
into related subtopics. Eventually, this model will steer the domain expert’s topic
analysis without indulging into underlying statistical and machine learning.
The designed model solely dependent on cognitive view of scientific literature; it
will not be affected by social bias, i.e., citations making sure that evolution pattern is
only dependent on the change of ideas, innovations over time. We have successfully
developed a generic topic evolution model for information search. The pivot evolution
graphs will be able to elaborate on the particular topic interesting to the seeker.
3.1 Conceptual Framework
The goal of the ‘topic evolution model for information search’ will be to generate
topic evolution of graphs and be able to represent and filter them so that researchers
and domain experts can use them as they like. The proposed generic framework
enables extraction of meaningful topic evolution patterns for period of time and poten-
tial co-topics. Each topic is divided into long-lasting, peculiar, evolving and fading
terms on the basis of topic labeling. Figure 1 illustrates workflow of proposed ‘topic
evolution model’ document corpus as input first split into several time periods and
evolution patterns as output. Before splitting, each document falling under specific
duration is preprocessed and cleaned from irregularities.
The elementary processing is achieved using LDA [8] with an objective to generate
the topics set, which will be subsequently aligned by Jaccard similarity [9] to produce
the topic evolution graph Gλ for alignment threshold λ. This threshold λ divides
Fig. 1 Topic evolution model for information search

the global evolution graph into graph having topics more than alignment threshold.
Further, these graphs can be split into pivot topic evolution graphs on the basis of
topics t 1 , t 2 ,…, t n . The proposed topic evolution model for information search consists
of two major phases: Topic Extraction and Topic Evolution and Graph Generation.
3.1.1 Topic Extraction
The topic extraction is a pre-requisites for the topic modeling; Fig. 2 outlines the
inherent steps of proposed topic extraction from document corpus. The document
corpus is acquired from Google Scholar and other similar sources. The preprocessing
is adapted for de-noising the data. We will consider a corpus C of time-stamped
documents, a set of periods P and a set of terms V (vocabulary or dictionary). We
will divide whole document corpus between different time periods.
As the documents corpus is needed to be split into the different set of documents
for different time periods, these documents will be split on the basis of the document
frequency of documents published in the particular period by observing the distri-
bution of documents per year using bar graphs. As the distribution of documents
is known, we can decide the split period such that the distribution of documents
per period is sufficient and equally distributed to train the model in an effective and
efficient way. As for selecting the period, if the user wants to see small evolution
changes in topics they will select the small split period like 2–3 years while the users
who wants to explore the larger evolution changes in topics will select the larger
split period like 5–6 years because larger the period, documents related to a partic-
ular topic will be more making sure that more dominating changes and topics to be
extracted by the topic evolution model.
The extraction of the different topics from document corpus by topic extraction
method, LDA. The extracted topics will be stored in weighted term vectors with their
Fig. 2 Overview of LDA for topic extraction

Fig. 3 Topic evolution graph generation overview
time periods. This topic description and period tuple will be weighted vector where
each topic description will be terms extracted from corpus C.
3.1.2 Topic Evolution and Graph Generation
We will define a topic evolution function (TEF): similarity: T × T ∈ [0,1] to estimate

the similarity among topics T between different time periods. Further, based on the
topic evolution function, a topic evolution graph with directed labeled multistage
graph Gλ = (T, E, similarity, λ) over T topics is created. Figure 3 outlines the
process of the topic evolution and graph generation from the topics extracted. The
edges E of the graph connect all topics from consecutive periods with higher or equal
similarity to threshold λ. The threshold value λ highly influences the evolution graph
complexity. As higher λ value generates ‘linear’ graphs with isolated topics, lower
values generate more complex graphs containing a variety of potentially interesting
structures with few unnecessary and undesired subgraphs. The optimal value of
threshold is a key element for the overall system accuracy and efficiency.
Topic Labeling: For visualization, we assume that all topics t of evolution graph
Gλ are labeled by the top-k weighted terms the topic description t.d of each topic.
For visualization, the topics will be divided and labeled as: long-lasting, peculiar
and fading as described in Table 1. Let t.k be the top-k weighted terms in t.d and t.k p
Table 1 Conceptualization of Four categories of topics

Evolving future terms: Topics do not exist in Fading past terms: Topics may not exist in
past though relevant in future searches, e.g., future but existed in past, e.g., ‘Plague’ has
COVID19 is evolving term and denoted by faded over time and denoted by t.k fd = t.k p −
t.k ev = t.k f − t.k p t.k f
Peculiar terms: Topics neither existed in the Long-lasting terms: Topics consistently exist
past nor appear future searches. And denoted in both past and future searches, denoted by
by t.k pe = t.k − (t.k p ∪ t.k f ) t.k ll = t.k p ∩ t.k f
⊆ t.k and t.k f ⊆ t.k be the subsets of past and future terms which appear, respectively,
in the ancestor or parent topics and in the descendant or child topics of t.
Each document within duration passes through topic extraction method to identify
potential topics and aligned to single graph Gλ with threshold λ. Threshold λ strongly
influences the evolution graph complexity, as higher λ value generate linear graphs
with isolated topics and lower results in complex graphs with interesting topic.
4 Performance Assessment and Analysis
In order to demonstrate the topic evolution growth patterns effectiveness, the

experimental setup was implemented as follows.
4.1 Data and Environment Settings
The implementation environment is primarily created on Python 3 (ver. 3.7) and

Jupyter notebook. The data objects of scientific publication articles extracted from the
Google Scholar API and arXiv.org of duration 1995–2022. For the experimentation,
10 pivot topics per duration (e.g., 1995–2001), and used the pandas dataframe to
store the data-in-memory. Further, the evolution graphs are generated by gensim
library (python NLP library) and graphviz’s [16] ‘dot’ to render the graph over the
derived topics.
4.2 Topics Generation and Visualization for User Search

Interactions
The number of topics generated is vital, as large document corpus may have ‘number
of topics,’ or generated duplicate topics or small number of topics. The measure of
coherence score really helps optimize the number of potential topics. Optimizing
a topic model for perplexity may not generate human interpretable topics, though
perplexity served as the motivation for the topic coherence. Topic coherence score
indicates the semantic similarity present between the highly weighted terms in the
topic and relative inter-topic distance.
The coherence measure of 0.9 or 1 being measured means that words are identical
or being bigrams. In Fig. 4, coherence score estimates of potential topics on the
published scientific documents (during 2008–2013), indicates two insights: first near
10 topics and second close to 25 topics. A higher peak value often leads to set of
duplicate topics, therefore optimal value 10 is opted for the experimentation.
Fig. 4 Coherence score

versus number of pivot topics
Figure 5 outlines the coherent topic to a pivot topic, e.g., topic-0 related to
subtopics to ‘Internet of Things’ and topics 1st, 2nd and 3rd are coherent to
‘Computer Architecture & Security,’ ‘Cognitive Radio Networks’ and ‘Cloud and
Mobile Computing,’ respectively.
Figure 6 illustrates the topic visualization of duration 2002–2007. Here, a bubble
represents a pivot topic and its size (%) emphasizes its importance. These topics are
generated by LDA and visualized, in axis and list view. In the axis view, the farther
bubbles are away from each other, the more different they are. The list view, Red
and Blue bars indicate the presence of the topics in the corpus. Red bars estimate
the coherent pivot topic frequency, whereas Blue bars give the overall frequency for
each term. The frequent terms may be visualized, if a user submits nothing.
The similarity matrix is one key criterion to capture the topic evolution and
modeling the alignment, for a pair of duration with threshold value. This align-
ment threshold value is chosen by optimality conditions of sparse and dense patterns
Fig. 5 Word cloud of topics

Fig. 6 LDA topic visualization
using the similarity matrix as shown in Fig. 7 which will show the similarity values
between topics so that users can decide the alignment threshold such that they want
to generate dense, medium or sparse patterns. So, the alignment threshold(λ) value
for the model is chosen as 0.15. The more the topics are there the larger the matrix
will be, Fig. 7 illustrates the Similarity Matrix of 2008–2013 and 2014–2019 with
threshold value λ as 0.15. Thus, threshold value must be chosen in an optimal way.
4.3 Assessment of Topic Evolution Graph Generation
The topic evolution graph represents a topic as node and edge as inter-topics similarity
with above or equal to the threshold, as shown in the following Fig. 8 topics of years
2002–07, 2008–13,2014–19 in first, second and thrid rows, respectively. Here, each
topic is colored specifically to indicate the importance, e.g., peculiar terms are red,
fading terms are blue, evolving terms are green and long-lasting terms are pink.
Each of the instance of evolution, graph reveals interesting patterns and grabs
user’s search interest. The user query transforms the current graph and evolves, as
unnecessary edges and nodes will be discarded.
For a user query on ‘spectrum,’ the topic evolution graph is transformed, shown
in Fig. 9. Here, all the nodes containing ‘spectrum’ and relevant edges satisfying
threshold are retained. To further assist the user in exploration within the emerging,
decay/fading, peculiar, and long-lasting subtopics terms from the graphs.
Fig. 7 Similarity matrix, {pivot topics of period 2008–13 and 2014–19}
4.4 Accuracy Performance
The overall retrieval performance is strictly related to fundamental metrics: perplexity

and coherence. The designed system is evaluated for perplexity metrics, as it captures
‘how the topic model is surprised with the newly identified data, which it has never
seen’ and usually measured as normalized log-likelihood. A lower perplexity value
characterizes a good model, as topic model for duration 1995–2001 and 2019–2025
looks optimal, shown in Fig. 10. Additionally, the coherence score characterizes the
optimal number of topics required and the inherent quality of derived topics for the
specified duration. The suitable number for the user may be decided by the user based
on the coherence score level, we have adopted 10 pivot topics (10 as optimal value)
for the evolution graph and as shown in below Fig. 10.
Further, the accuracy of placement of an unseen topic in a model is based on the
probability values. For example, ‘Cognitive radio network is subset of the Manet’
which has the highest probability to be part of the topic-0 for the duration 2008–13,
as our model has estimated probability of 0.6999 as in Fig. 11. A higher probability
value indicates the appropriate class of topics in the topic evolution model, such
Fig. 8 Instance of generated ‘Topic Evolution Graph’ {years: 2002–07, 2008–13, 2014–19}
2002-2007 2002-2007 2002-2007 2014-2019

ID:1 ID:2 ID:3 ID:7
associ architectur wireless outsourc
give manag primari cloud
develop resourc polici attack
knowledg machin
implement user
2007_6 open 2007_0 2007_4 high 2007_5 2007_7
communic nod
effici
environ scheme
exploit access spectrum
perform user
data
access research spectrum
sens
radio spectrum radio
secur
spectrum cognit
applic
0.17 0.16 0.17 0.16 0.17 0.22 0.19 0.16 0.2 0.16 0.17 0.17 0.2
2008-2013 2008-2013 2008-2013

ID:1 ID:2 ID:7
topic equip futur
neighbor energi commerci
consumpt band
algorithm
avail licens
sens 2013_3
distribut
rout challeng
channel channel
user
radio internet spectrum
nod access radio
cognit technolog technolog
spectrum spectrum cognit
Fig. 9 Modified pivot topic evolution graph, for a query ‘Spectrum’
for Topic-0 has terms like ‘spectrum,’ ‘cognit,’ ‘radio,’ ‘future,’ ‘technolog.’ The
designed model is capable to place the unseen data into the appropriate topic of
class.
Fig. 10 Perplexity and Coherence score (for each duration 1995–2001, 2001–07, etc.)
Fig. 11 Classification probability {of a document-term to a pivot topic}
4.5 Analysis
The topic generation will cohesively affect the overall system performance, as lesser
number of topics may be generated in the polynomial-time and large number of
topics makes it the NP-Hard problem. LDA for topic extraction has O(mnt + t 3 )
time complexity [17] and requires the space complexity of O(mn + mt + nt), where
m denotes the number of tokens or samples to be processed, n is the number of topics
to be extracted which are known as features and t = min(m, n). For ‘y’ number of
periods, time complexity will be O(y(mnt + t 3 )).), and for dividing the terms into four
subcategories, it will take O(4n2 ) = O(n2 ) time. For this, the overall time complexity
for the total period will be O(y.n2 ). After the topic modeling, we need to align the
topics per period, each alignment will take time of n2 making overall time complexity
for total period is O(y.n2 ) which will be used for graph generation. Therefore, the
overall time complexity of work for LDA topic modeling is O(y(mnt + t 3 )) and for
evolution graph and topic graph generation is O(y.n2 ) as all the steps are procedurally
evaluated one after another.
5 Conclusion
The topic evolution model steers a user information search and adapts to its cogni-
tive perception on the information stores in the corpus. The proposed model depen-
dent on cognitive view of scientific literature and adapts the temporal changes. The
topic evolution graph enhances the information seeker capability to search within
topics and subtopics, via submitting pivot topics and shown similarity scores. The
metrics, e.g., similarity score, coherence score and perplexity assert the designed
topic evolution model which is an adaptive framework for interactive information
search balances the exploitation–exploration during the search.
In the future, the similar topic evolution pattern may be adapted for the information
retrieval and Web search platforms. More specifically, scholarly search platforms
are the key applications area. Here, for the alignment of temporal changes on topic
evolutions, another document corpus, e.g., Wiley, Web of Science, may be integrated.
As well for the topic extraction, we can use the advanced techniques like deep learning
and artificial neural network which will help in increasing the automation possibilities
of topic evolutions implementation.
References
1. D. Shahaf, C. Guestrin, E. Horvitz, J. Leskovec, Information cartography. Commun. ACM

58(11), 62–73 (2015). https://doi.org/10.1145/2735624
2. K. Li, H. Naacke, B. Amann, EPIQUE: extracting meaningful science evolution patterns from
large document archives. in International Conference on Extending Database Technology
(EDBT) (2020)
3. T.S. Kuhn, O. Neurath, The structure of scientific revolutions (2nd ed., enlarged ed.), in Number
ed.-in-chief: Otto Neurath; Vol. 2 No. 2 in International Encyclopedia of Unified Science
Foundations of the Unity of Science (Chicago University Press, Chicago, Ill, 1994). https://doi.
org/10.1515/9781400831296-024
4. Google Scholar, https://scholar.google.com. Last accessed 20 Sept 2021
5. ArXiv.org, https://arxiv.org. Last accessed 20 Sept 2021
6. Q. He, B. Chen, J. Pei, B. Qiu, P. Mitra, L. Giles, Detecting topic evolution in scientific
literature: how can citations help?, in Proceedings of the 18th ACM conference on Information
and knowledge management (2009), pp. 957–966. https://doi.org/10.1145/1645953.1646076
7. D. Chavalarias, J. P. Cointet, Phylomemetic patterns in science evolution—The rise and fall of
scientific fields. PloS One8, 2 (2013), e54847. https://doi.org/10.1371/journal.pone.0054847
8. D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022
(2003)
9. P. Jaccard, The distribution of the flora in the alpine zone. 1. New Phytol. 11(2), 37–50 (1912).
https://doi.org/10.1111/j.1469-8137.1912.tb05611.x
10. F. Martin, M. Johnson, More efficient topic modelling through a noun only approach, in
Proceedings of the Australasian language technology association workshop 2015 (2015)
11. V. Andrei, O. Arandjelović, Complex temporal topic evolution modelling using the Kullback-
Leibler divergence and the Bhattacharyya distance. EURASIP J. Bioinf. Syst. Biol. 1, 1–11
(2016). https://doi.org/10.1186/s13637-016-0050-0
12. Y. Jo, J.E. Hopcroft, C. Lagoze, The web of topics: discovering the topology of topic evolution
in a corpus, in Proceedings of the 20th International Conference on World Wide Web (2011),
pp. 257–266. https://doi.org/10.1145/1963405.1963444
13. Z. Tong, H. Zhang, A text mining research based on LDA topic modelling, in International
Conference on Computer Science, Engineering and Information Technology (2016), pp. 201–
210. https://doi.org/10.5121/csit.2016.60616
14. A. Salatino, F. Osborne, E. Motta, AUGUR: Forecasting the “Emergence of New Research
Topics”, in ACM/IEEE on Joint Conference on Digital Libraries (ACM, New York, 2018), pp
303–312. https://doi.org/10.1145/3197026.3197052
15. A. Chaudhuri, N. Sinhababu, M. Sarma, D. Samanta, Hidden features identification for
designing an efficient research article recommendation system. Int. J. Digital Libr. 1–17 (2021),
https://doi.org/10.1007/s00799-021-00301-2
16. D. Cai, X. He, J. Han, Training linear discriminant analysis in linear time, in 2008 IEEE 24th
International Conference on Data Engineering (2008), pp 209–217. https://doi.org/10.1109/
ICDE.2008.4497429
17. Graphviz Homepage, https://graphviz.org. Last accessed 15 Sept 2021
A Novel Automated Human Face
Recognition and Temperature Detection
System Using Deep Neural
Networks—FRTDS
Varalatchoumy M and Pranav Durai
Abstract This paper proposes a novel FRTDS (Face Recognition and Tempera-
ture Detection System) that is contactless and performs real-time face recognition.
The system had proved to be fast, built at low-cost, and efficient in user authen-
tication. FRTDS consists of numerous algorithms and techniques that have been
implemented to improve the performance of the entire system with the help of Deep
Neural Networks. FRTDS can capture images from a video stream and can detect
faces from 70–90 cm away from the camera. An interactive front-end recognizes
and displays the identity of the person. FRTDS also includes a temperature sensor
to monitor the health of the person, before they enter any premises. The recognized
face along with temperature data is stored at the back-end with the current time
and date. This paper also presents a novel customized tool that eases the process of
dataset creation and augmentation, and a novel prediction discrepancy elimination
algorithm.
Keywords Face recognition · Temperature detection · Novel system · Non-contact

system
1 Introduction
In any organization, the login and logout credentials of all employees need to be
maintained. Traditionally, attendance is tracked using biometric fingerprint scanners
which are quick and easy to use. COVID-19 pandemic situation demands a contact-
less system that would aid in automatic attendance tracking. Due to the cost incurred
to develop such systems, most organizations reverted to manual attendance and this
is very time-consuming. Face recognition is a computer vision-based task that aims
to identify faces that its system has previously been trained with. It is widely used
V. M (B) · P. Durai
Cambridge Institute of Technology, Bangalore 560036, India
e-mail: varalatchoumy.cse@cambridge.edu.in
P. Durai
e-mail: pranav.19cs033@cambridge.edu.in
166 V. M and P. Durai
in smartphones, authentication systems, and human–computer interaction, such as

robotics. It involves the computation of a human being’s physical face characteristics.
Such a system can also be categorized as a biometric system. Tremendous amounts
of data are required to build an efficient face recognition system. The scale of the
system is directly proportional to the amount of data necessary. While creating a
database with face images of hundreds of employees, there was a need to create a
tool. The novel dataset creation and augmentation tool proposed in this paper aims
to solve this problem.
In addition to face recognition, body temperature of the person can also be read.
The temperature sensing module in the proposed system helps to keep track of
the temperature of all employees entering the premises and raises an alarm if the
temperature is detected to be above a threshold value (97.7–99.5°F), as referred by
the World Health Organization. Data from the temperature sensor [1] is stored in
the back-end system, to keep track of the employees’ health on daily basis. Stored
data can be retrieved and managed by anyone with the right authority to access the
organization’s private and sensitive data.
The novel FRTDS (Face Recognition and Temperature Detection System)
proposed in this paper aims at developing a contactless system to recognize human
faces and detect their body temperature while being completely low-cost and easy
to build.
2 Literature Survey
Some of the findings from the survey done, is represented in Table 1.
Face recognition and temperature sensing are the two major modules of novel
FRTDS. The various tasks involved are, pre-processing of the input image, face
detection, transformation of face data, feature extraction using deep neural network,
and classification. With respect to temperature sensing, detection of hand, reading the
temperature, and pushing the metrics to the back-end database are the tasks involved.
These two main modules can be developed individually and can be integrated later,
before deployment. Figure 1 depicts the overall design of the proposed integrated
system.
A Novel Automated Human Face Recognition and Temperature … 167
Table 1 Comparison of findings and results from other systems

Researchers Findings Results
Liao and Li [2] Facial features detected Higher facial feature
automatically detection accuracy
Zhao et al. [3] PCA with LDA More features for face
Chinimilli et al. [4] Eigenface comparison in Better performance
DB and more storage
Chang-jun et al. [5] Deniable non-target FR Comparatively better
efficiency
Senthilkumar—Gnanamurthy [6] Linear and Non-linear SVM Better classification
classifier rate
Guo et al. [7] Linear SVM classifier Binary tree
classification
Nasr [8] Bag of features Better classification
Ding and Tao [9] Sequential methods Person-specific facial
dynamics
Wang et al. [10] Convolutional neural Better locally
network connected weight
training
Montoro et al. [11] Haar-cascade and eigen Both day and night
face method face recognition
Sameem et al. [12] Viola-jones algorithm Single person SURF
features
Jain et al. [13] Digital forensics Problems and solutions
3.1 Pre-Processing
One of the main aspects of face recognition is to be able to construct a dataset

that contains all the face data of people. Loads of face data are vital to make the
system perform swiftly. The amount of data ingested into each class of the system is
directly proportional to the accuracy and performance of the recognition model. A
diagrammatic representation of the novel Dataset Creation and Augmentation tool
(DCA) Tool is depicted in Fig. 2. Capturing, collecting, and organizing face images
of hundreds of people is very time-consuming. Lots of manual work is necessary to
effectively construct a solid dataset, which will be used to train the system. This paper
proposes a novel dataset creation and augmentation tool (DCA), which is a custom-
written Python command-line-based application to assist in the tedious process of
dataset creation. It accepts a video of the person’s face as input and requires the user
to provide a Person identification number. The ID differentiates one person from
another without having to manually manage their names. The tool also solves the
name redundancy problem. A unique person ID is provided for each face class in the
dataset, which eases the data management task.
Fig. 1 Design diagram of proposed novel FRTDS
Haar-cascade classifier is used to detect faces, and only the frames that have
a face in them are selected and extracted. Face frames are re-sized to maintain a
uniform size for all face classes in the dataset. Extracted frames are passed through
a series of image processing filters such as grayscale, and histogram equalization.
This part of pre-processing is required to elevate the clarity of the features in face
images. Grayscale filter solves face color problem. Hence, a grayscale filter is used
to eliminate colors from the images in each face class. Similarly, the histogram
equalization filter equalizes and normalizes the bright and dark parts in the image.
To increase the amount of face data in each class, the data augmentation technique
was used. It applies random rotations concerning left and right directions to the face
frames. Another advantage of this technique is that the system can learn about how
a person’s face will look if it were to be tilted.
Finally, the face frames are stored in a folder with the previously entered personal
identification number as its name. All the folders will be saved under the main
working directory. Each face class would have 30 face frames. The number of faces
per class is dynamic and can vary. This was done to drastically decrease the size of
Fig. 2 Block diagram of novel DCA tool
the dataset, while not compromising on the amount of data required. Light affects a
person’s face in all directions. The trajectory of light rays introduces highlights and
shadows on various parts of the face. This makes the face seem different in terms
of appearance. Representation of how the direction of light affects the facade of a
person’s face in real-time data is depicted in Fig. 3. It becomes a challenge for the
Deep Neural Network (DNN) to extract facial features from the face with varying
lighting conditions. For this experiment, 30-s clips of employee’s face with variation
in lighting in each of it, was collected. The clips were then processed using the novel
“Dataset Creation and Augmentation Tool” to extract face frames.
DNN was trained with images of different employees with data from their face
images with varying lighting conditions. Figure 4 depicts the prediction the massive
improvement in prediction performance after training the DNN successfully with
face images that had variation in lighting.
3.2 Face Detection
Before the face can be recognized and validated, it needs to be detected in the image
frame. Multiple techniques facilitate this process. Effective detection of various
objects using Haar feature-based cascade classifiers is functional [2]. This is mainly
a machine-learning-based application where a function with a cascade is trained
Fig. 3 Effects of illumination
Fig. 4 Prediction accuracy metrics with respect to illumination

Fig. 5 Haar-cascade based

feature extraction
and tested with positive and negative image data. Feature representation with Haar-
cascade based classifier has been illustrated in Fig. 5. Any image can be represented
with the below-mentioned three feature patterns. This aids the proposed system to
accurately track faces from the extracted frames using Haar-based classifier [2] in
real-time for edge features, line features, and four rectangle features.
3.3 Transformation of Face Data
Detected faces are transformed into a uniform template where the eyes and bottom
lips are aligned at the same location in all the images. This is done to maintain
uniformity across all images. Any slight changes in the orientation of the images
are corrected in this module. Picture on the left represents the face image without
orientation transformation, and the picture on the right represents the face image
after the orientation corrections were done as depicted in Fig. 6. It can be observed
that after the transformation, the image is now straight.
Fig. 6 Image orientation

transformation on faces
3.4 Deep Neural Networks
A Deep Neural Network (DNN) is an Artificial Neural Network (ANN) with multiple
layers between the input and output layers. A DNN is trained to recognize the features
in a face from the given images and calculate the probability that the features match
a new image that is given to it. The user can review the results and select which
probabilities the network should display (above a certain threshold) and return the
proposed label. Each mathematical manipulation is considered a layer, and complex
DNNs have many layers, hence the name “Deep” networks.
Transformed images are fed into a dense, multi-layer Deep Neural Network.
Novel FRTDS mainly takes advantage of the Keras implementation of OpenFace.
DNN consists of feature maps with multiple convolution layers, a pooling layer,
and linear classification layers. In these layers, specifically in the convolution layer,
dominant features from the person’s face are extracted. Features are then passed on
to the pooling layers where the representation’s spatial size is gradually reduced.
Further, convolutions are applied to the generated feature maps. Proposed neural
network is illustrated diagrammatically in Fig. 7.
A person’s face can be represented in a 128-D unit hypersphere of 1-unit diam-
eter. The points in the hypersphere representation define the face embeddings of a
particular person’s facial features. Facial feature embeddings are the output of the
deep neural network, previously illustrated in Fig. 7. Furthermore, an illustration in
Fig. 8 highlights the parts of the human face and the various facial features map.
Fig. 7 Proposed neural network design

Fig. 8. 128-D facial feature

representation
3.5 Classification
Embeddings, in general, are just points in a hypersphere that represent a person’s face.
But the problem here is, once the embeddings of multiple faces are extracted with
the help of the DNN, there are chances of overlapping. In this case, one person’s
face embeddings might have a chance to interfere with another person. To tackle
this, labeled embeddings of individual people need to be fed into a classification
algorithm. Due to its supervised-learning method, each face class can be labeled and
given to the algorithm. The algorithm can then learn from these classes, separate
each class from one another, and increase the gap between each class. Finally, it can
classify across all the classes that it has learned from. Novel FRTDS uses the highly
efficient “Support Vector Machine” algorithm for the classification task. A Support
Vector Machine (SVM) [6], is an algorithm that works by using a nonlinear mapping
to transform the original training data into a higher dimension.
As far as the parameters for the SVM classifier is concerned, kernel parameters are
set for polynomial hyperplane. C-parameter is set as value “1”, to stop the problem
of overfitting. In degree parameters, as the polynomial degree increases, then the
training time increases linearly with it.
3.6 Face Recognition
The trained SVM-based classifier is used to classify and then recognize the face
when the person comes and stands in front of its camera. The name of the person
is also printed on the feed as an alert. Even with tons of training data, no machine
learning model can return a 100% prediction accuracy with high performance. In
highly complex problems such as face recognition, the system should be extremely
fault-free when it comes to prediction performance.
This paper presents a novel Prediction Discrepancy Elimination Algorithm, which
is a fairly lightweight, custom-designed algorithm that aims to remove errors while
prediction takes place. Two classes “A” and “B” have been taken to illustrate the
working of the novel algorithm as depicted in Fig. 9. A trained SVM classifier is
used to make multiple predictions as soon as a person arrives in front of the camera
of the system. These predictions are done in real-time, and the most occurred Person
Fig. 9 Novel prediction

discrepancy elimination
algorithm
ID is stored for final prediction. By implementing this algorithm, we are eliminating

the chances of a person being identified as another person. During the test run, it
took approximately 1 s for prediction. When live-feed detects a face in the frame,
it extracts the face from it and sends it for classification. If the IR Proximity sensor
detects a wrist (obstacle) with the 10 cm range, it will trigger the IR Temperature
Sensor to take the readings. This totally justifies the contactless aspect of the novel
FRTDS system.
4 Temperature Detection
Entire flow of the processes involved in a temperature detection system is represented

diagrammatically in Fig. 10. IR Proximity sensor detects a hand in front of it, the IR
Temperature sensor takes a reading. The temperature value is routed to the database.
Subsequently, the database is refreshed to update the new values from the sensor.
The front-end application will turn green if the person is safe to enter the premises.
In an unfortunate condition, if the system detects a person with fever, the front end
will turn red. After either of these processes are complete, the system will go back to
detecting the hand. If no hand is present, it will continue to return a null value to the
system. Sensor modules of the system can be put into sleep mode when not in use.
Fig. 10 Temperature detection algorithm
5 Integrated System—FRTDS
Parameters such as the name, ID, current date, time, and temperature is stored in
a database in the backend, and can be reviewed anytime. Preferred database is
MongoDB, which is a NoSQL database and MongoDB integrates very well with
Python. Temperature data can be used to keep track of the health of employees in
any given month. Web application was designed and developed which is completely
automated and non-intrusive, thus avoiding any physical contact. When a face is
detected, the camera screen opens up in the middle of the screen, displays the face
being recognized, with a text above the detected face showing the prediction. Real-
time implementation of the product and the result obtained for an employee data is
depicted in Fig. 11
We wanted to always keep the security and privacy of the users in mind while
building this system. For this, the camera will only be triggered when motion is
Fig. 11 a FRTDS product

developed in-house b Output
obtained
detected in the frame. Recognition takes around 1–2 s and the system immediately
displays the details of the recognized face as an alert. As soon as the employee’s
face is recognized, the name of the person, in-time and out-time, date, and their wrist
temperature data are sent to the database, with the help of the database schema.
6 Experimental Results
To benchmark the proposed FRTDS system, various tests were conducted.
6.1 System Performance Analysis
After a complete integrated system was achieved, a series of experiments were

conducted to predict the performance metrics. Based on the detailed literature survey
the proposed novel FRTDS system was evaluated on a dataset with 20 classes that
contained labeled data of their faces. Each face class consisted of 30 images of that
Fig. 12 ROC curves on dataset with 20 classes
Fig. 13 Final prediction

performance of novel
FRTDS
particular person, in different lighting conditions. Novel FRTDS was able to effec-
tively recognize all 20 people that took the test, with a prediction accuracy of 84.54%
as depicted in Fig. 12. Novel FRTDS is developed as a dynamic and real-time face
recognition system.
Prediction accuracy was checked during its development and is depicted in Fig. 13.
In each iteration, weights were tweaked and experimented until an optimal perfor-
mance metric output was observed. As an end result, we were able to achieve a
maximum prediction accuracy of 84.54%.
6.2 General Comparison with Existing Systems
In Table 2, the proposed FRTDS system has been compared with existing systems.
Table 2 Proposed system versus existing systems from survey

Parameter Proposed system (FRTDS) Existing systems
Form factor Has portable and deployable enclosure No enclosure, not deployable in
real-world
Integration Ability to perform face recognition and Cannot do both together, in
temperature detection parallelly real-time
Response time Less than 2-s response time Around 3–4 s response time
Database NoSQL/MongoDB-based attendance Basic attendance capability
capability
Accuracy Accurate temperature reading up to Varies from one manufacturer to
±0.5 °C another
Cost to build Cost-effective, easy to build Costly components, cannot be built
easily
7 Conclusion
At present, there is a constant increase in the number of COVID-19 cases across all
countries. There is a need for such a system to be able to effectively flatten the curve
of spread as much as possible in all the places with public gatherings, educational
institutions or at work places. There is a demand for the system proposed in this
paper because there is a lot of places where it can be accommodated.
Novel FRTDS can be built and deployed on walls outside the entrances of build-
ings, to effectively monitor the temperature of the people entering and leaving that
particular facility. In hospitals, patients can have a periodic temperature check, as well
as allow only authorized personnel into restricted rooms or labs. In educational insti-
tutions, both attendance and body temperature of students can be monitored while
they enter and leave the premises. In organizations, employees can be monitored for
their work hours. At airports, authorities can keep a track of all the passengers and
their temperatures. The temperature detection module of this system is effective not
only for the Novel Corona Virus but for Malaria as well. Body temperature plays a
major role in the diagnosis of Malaria syndrome in a person. Hence, the very motive
of this project is satisfied. As part of the future research work on novel FRTDS, the
collected face and temperature data can be routed to a separate database which is
deployed on the cloud, and perform data analytics to observe patterns. This gives
researchers and doctors deep new insights about the characteristics of the virus, and
learn underlying details if there exists any. Developers can leverage novel FRTDS
and implement it on handheld devices such as smartwatches.
References
1. M. Cheung, L. Chan, I. Lauder, C. Kumana, Detection of body temperature with infrared

thermography: accuracy in detection of fever. Hong Kong Med. J. = Xianggang Yi Xue za
Zhi/Hong Kong Acad. Med. 18(Suppl 3), 31–4 (2012)
2. R. Liao, S.Z. Li, Face recognition based on multiple facial features, in Proceedings Fourth IEEE
International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580)
(2000) pp. 239–244. https://doi.org/10.1109/AFGR.2000.840641
3. W. Zhao, R. Chellappa, A. Krishnaswamy, Discriminant analysis of principal components for
face recognition, in Proceedings Third IEEE International Conference on Automatic Face and
Gesture Recognition (1998), pp 336–341, https://doi.org/10.1109/AFGR.1998.670971
4. B.T. Chinimilli, T. Anjali, A. Kotturi, V.R. Kaipu, J.V. Mandapati,Face recognition based
attendance system using Haar cascade and local binary pattern histogram algorithm, in 2020
4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184) (2020),
pp. 701–704. https://doi.org/10.1109/ICOEI48184.2020.9143046
5. C. Chen, Y. Zhan, C. Wen, Hierarchical face recognition based on SVDD and SVM. Int. Conf.
Environ. Sci. Inf. Appl. Technol. 2009, 692–695 (2009). https://doi.org/10.1109/ESIAT.200
9.139
6. R. Senthilkumar, R. K. Gnanamurthy, Performance improvement in classification rate of
appearance based statistical face recognition methods using SVM classifier, 2017 4th Inter-
national Conference on Advanced Computing and Communication Systems (ICACCS) (2017),
pp. 1–7, https://doi.org/10.1109/ICACCS.2017.8014584
7. G. Guo, S.Z. Li, K. Chan, Face recognition by support vector machines, in Proceedings
Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No.
PR00580) (2000), pp. 196–201. https://doi.org/10.1109/AFGR.2000.840634
8. S. Nasr, K. Bouallegue, M. Shoaib, H. Mekki, Face recognition system using bag of features
and multi-class SVM for robot applications, in 2017 International Conference on Control,
Automation and Diagnosis (ICCAD) (2017), pp. 263–268. https://doi.org/10.1109/CADIAG.
2017.8075668
9. C. Ding, D. Tao, Trunk-Branch ensemble convolutional neural networks for video-based face
recognition, in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no.
4 (2018), pp. 1002–1014. 1 Apr 2018. https://doi.org/10.1109/TPAMI.2017.2700390
10. D. Wang, H. Yu, D. Wang, G. Li, Face recognition system based on CNN. Int. Conf. Comput.
Inf. Big Data Appl. (CIBDA) 2020, 470–473 (2020). https://doi.org/10.1109/CIBDA50819.
2020.00111
11. T. Mantoro, M.A. Ayu, Suhendi, Multi-faces recognition process using Haar cascades and
Eigenface methods, in 2018 6th International Conference on Multimedia Computing and
Systems (ICMCS) (2018), pp. 1–5, https://doi.org/10.1109/ICMCS.2018.8525935
12. M.S.I. Sameem, T. Qasim, K. Bakhat, Real time recognition of human faces. Int. Conf. Open-
Source Syst. Technol. (ICOSST) 2016, 62–65 (2016). https://doi.org/10.1109/ICOSST.2016.
7838578
13. A.K. Jain, B. Klare, U. Park, Face recognition: Some challenges in forensics. IEEE Int. Conf.
Autom. Face Gesture Recognit. (FG) 2011, 726–733 (2011). https://doi.org/10.1109/FG.2011.
5771338
A Novel BFS and CCDS-Based Efficient
Sleep Scheduling Algorithm for WSN
B. Srinivasa Rao
Abstract The main aim of the present research is to propose a novel BFS and CCDS-
based efficient sleep scheduling algorithm for WSN using two popular search tech-
niques, Breadth First Search (BFS) and Color Connected Dominated Set (CCDS), for
reducing energy consumption and delay when the message is broadcasted in WSN.
Breadth First Search (BFS) is implemented to find the minimum distance path from
a sensor node and reduce the delay in transmitting the message. Color Connected
Dominated Set is used to transmit messages to all nodes without collision and hence
minimize the energy consumption. The analysis is made between two algorithms
with the same set of nodes.
Keywords WSN · Sleep scheduling · BFS · CCDS · Energy efficiency
1 Introduction
The most important aspect of the wireless sensor networks (WSN) is the necessity of
long-term involvement and working of sensor node batteries without charging while
monitoring the critical events that affect the efficiency of the WSN. Hence, intelli-
gent techniques are required for effective conservation of the energy of the power
sources. Based upon the energy waste in WSN, various types of methods like data
reduction, control reduction, energy-efficient routing, duty cycling, topology control
have been reported in the literature [1]. Analytically, all these techniques have their
advantages and disadvantages at a wide range [2]. Obviously, it has been observed
that it is essential to design energy-efficient scheduling algorithms to enhance the
lifetime of the power source and in turn of the sensor nodes [3]. In that context,
sleep scheduling algorithms significantly reduced wireless sensor networks’ energy
consumption and time delay [4]. In general, sleep scheduling algorithms are used
in the form of synchronous, semi-synchronous, and asynchronous mechanisms. In a
synchronous scheduling mechanism, all sleeping nodes wake-up for communication
B. S. Rao (B)
Gokaraju Rangaraju Institute of Engineering and Technology, Bachupally, Hyderabad 500090,
India
182 B. S. Rao
and require additional control traffic. In a semi-synchronous mechanism, the nodes

form into clusters, and the sleeping and wake-up occur within the clusters. But all
these clusters are not in synchronization. In asynchronous sleep scheduling, each
node contains its scheduling as per requirement [5]. Most of the sleep scheduling
algorithms fall into any one of these mechanisms. Recently, optimization techniques
and heuristic approaches also become very popular for efficient sleep scheduling
mechanisms [6–8]. The main aim of the present research is to propose and imple-
ment a novel efficient energy and delay balance ensemble scheduling algorithm for
wireless sensor networks using two popular search techniques, Breadth First Search
and Color Connected Dominated Set, for reducing energy consumption and delay
when a message is broadcasted in WSN. The novelty in the present research problem
is an ensemble of Breadth First Search algorithm and Color Connected Dominated
Set algorithm, which earlier researchers did not consider. Breadth First Search is
implemented to find the minimum distance path from a sensor node and reduce the
delay in transmitting the message. CCDS is used to transmit messages to all nodes
without collision and hence minimize the energy consumption. In the present paper,
the concepts of BFS and CCDS scheduling are given briefly in Sect. 2. The related
work has been presented in Sect. 3. The proposed novel sleep scheduling model has
been described in Sect. 4. Methodology and experimental parts are given in Sects. 5
and 6, respectively. Section 7 deals with results, discussion parts. Finally, the work
is concluded in Sect. 8.
2 BFS and CCDS Scheduling
This section describes the working of BFS and CCDS scheduling algorithms.
2.1 BSF Scheduling
The Breadth First Search Scheduling is designed from the inspiration of the Breadth
First Search algorithm of the graph theory. The procedure begins with visiting the
sensor node and all of its neighbor sensors. Then, in the subsequent step, the nearest
neighboring sensors of the nodes are visited, and the same procedure is continued
in the next subsequent steps. The algorithm visits all the adjacent sensor nodes of
all the sensor nodes in the network and ensures that each sensor is visited exactly
once [9]. By implementing the BSF algorithm, a BSF tree can be constructed that
describes the uplink path in the following steps.
Step 1. A sensor node is selected as the central node.
Step 2. Categorization of all the sensor nodes into node levels L1, L2, L3…etc.
Step 3. Each level is depicted by different color.
A Novel BFS and CCDS-Based Efficient Sleep Scheduling … 183
Table 1 Node representation

S. No Source node Neighbor nodes
1 1 2, 3, 4
2 2 1, 6, 5
3 3 1, 4, 5, 8
4 4 3, 1, 7
5 8 3, 7
Step 4. In this process, the neighboring sensors of each sensor are computed, as
shown in Table 1.
From Table 1, BSF tree is constructed for each node as shown in Fig. 1. Then a
routing table is built for each node in the WSN, as shown in Table 2. Now each route
in the BSF tree routing table defines the uplink paths for that particular sensor node.
Fig. 1 BFS tree structure
Table 2 Routing table and

S. No Source node Uplink path
uplink paths
1 1 1-> 1
2 2 2-> 1
3 3 3-> 1
4 4 4-> 1
5 5 5-> 2-> 1 or 5-> 3-> 1
6 6 6-> 2-> 1
7 7 7-> 4-> 1
8 8 8-> 3-> 1
184 B. S. Rao
2.2 Colored Connected Dominant Set (CCDS) Scheduling
The CCDS scheduling is used to construct a downlink path in WSN when a crit-
ical event is occurred. The scheduling design is from the inspiration of the CCDS
concept proposed for WSN that implements self-regulation among the nodes [1]. In
WSNs, the downlink path construction is different with respect to uplink paths as the
communication will be between a long-distance node to the central node and possi-
bility of multiple paths and to select an optimal one. This process requires internal
self-regulation among the sensor nodes. In the present CCDS scheduling, different
algorithms are combined to construct a Connected Dominating Set and serve as
a backbone to the WSN. The main objective of the backbone is the reduction of
communication overhead, enhancement of bandwidth efficiency, minimization of
overall energy consumption, and improvement of network effective lifetime in a
WSN [1].
2.3 Construction
For the construction of CCDS, we have followed the earlier methods proposed in
[10, 11]. The construction process involves (i) Maximum Independent Set (MIS) in
G, (ii) Connected Dominated Set (CDS), and (iii) Internal Model Controlling (IMC)
algorithm [12].
3 Related Work
Different methods of sleep scheduling for extension of the lifetime of the WSN have
been recently reviewed [1–8]. In general, most of the proposed sleep scheduling
mechanisms involve the components like target prediction, reduction of awakened
sensors, and control active time of the sensor. The energy-efficient TDMA sleep
scheduling algorithm has the advantages of maximization of the lifetime of WSN,
but this mechanism has disadvantages like delay, data overlap, and reduction in
channel utilization [13]. The Balanced-Energy Scheduling used WSN sensor redun-
dancy to increase the lifetime and network load balance to improve the efficiency
[14], but it has a disadvantage of long-distance communication in the WSN. The
DESS algorithm reduces energy utilization and communication delay but has the
issues of communication delay and latency [15]. Recently, a new energy-efficient
reference node selection (EERS) algorithm has been proposed [16] for time synchro-
nization in industrial WSNs. EERS significantly increased the large savings in energy
consumption but applicable to many nodes at the industry level. A multilevel sleep
scheduling algorithm was developed, adopting the clustering concept for wireless
sensor networks [17]. Even though the model increases the network lifetime, asyn-
chronization among the clusters is an issue to consider. An efficient sleep scheduling
mechanism was developed for WSN using similarity measures [18]. This model
reduces the energy consumption by scheduling the nodes into active or sleeping
modes. But this mechanism is not effective in the case of sparse distribution of sensor
nodes. A heuristic-based delay tolerance and energy-saving model was developed for
WSN [19], which gave a better performance to other models but is confined to only
a mobile base station scenario. A Sensor Node Scheduling Algorithm was proposed
for heterogeneous WSN [20] to improve network lifetime and regional coverage rate.
However, this model is more suitable for only static nodes WSN and may not be effec-
tive for mobile nodes. Recently, Mhatre and Khot [21] proposed an energy-saving
opportunistic routing sleep scheduling algorithm that effectively reduces energy
dissipation but is confined to only one-dimensional topology networks. Recently,
Sinde et al. [22] proposed energy-efficient scheduling using deep reinforcement
learning that increases the lifetime and reduces the network delay and has shown
better performance than previous models. But the main issue with the model is with
the complexity of deep reinforcement learning. Ant colony optimization algorithm
was used for energy optimization of WSN with better energy efficiency [6]. Another
scheduling algorithm was proposed by Manikandan [23] using Game Theory and
Wake-up approach for energy efficiency in WSN. The main disadvantage with this
model is many approximations that make the model unrealistic. Recently, a metric
routing protocol was proposed for the evolution and stability of IoT networks [7]. A
heuristic approach-based ant colony optimization multipath routing algorithm was
proposed for virtual ad hoc networks to optimize the relay bus and route selection
issues [8]. The above two models give an idea for considering some new approaches
for efficient scheduling mechanisms for WSN. Also, minimizing energy consumption
is good security-providing aspect in WSN [24]. From the above discussion, we infer
that the above-mentioned scheduling algorithms are good at some points and have
drawbacks in other aspects and stressing further research in designing new and novel
efficient algorithms to enhance the lifetime of the sensors. In all the above algorithms
and models, no appropriate attempt has been made to balance energy consumption
and delay time simultaneously. The main advantage of Breadth First Search in WSN
is that during its traversal of the tree in the level by the level manner, it classifies
tree edges and cross edges, and its time complexity is O(N). This property is very
important for efficient routing in wireless sensor networks [9]. Also, Breadth First
Search (BFS) is implemented to find the minimum distance path from a sensor node
and reduce the delay in transmitting the message. On the other hand, the Connected
Dominating Set (CDS)-based routing is one kind of hierarchical method that has
received more attention in reducing routing overhead. Color Connected Dominated
Set is used to transmit messages to all nodes without collision and hence minimize
the energy consumption [1]. The ensemble of these two techniques balances both
energy consumption and delay time in the wireless sensor network. In the present
research work, we propose an efficient sleep scheduling procedure for WSN using
two popular search techniques Breadth First Search and Color Connected Dominated
186 B. S. Rao
Set, for reducing energy consumption and delay when the message is broadcasted in
WSN. In the next section, the proposed model has been presented.
4 The Proposed Novel Sleep Scheduling Model
In the present paper, we propose an efficient sleep scheduling procedure for wireless
sensor networks using two popular search techniques, Breadth First Search and Color
Connected Dominated Set, for reducing energy consumption and delay when the
message is broadcasted in WSN. The proposed algorithm is considered in two phases
(i) Uplink phase (ii) Downlink phase for scheduling. The uplink phase is scheduled
by Breadth First Search scheduling with the inspiration of graph theory’s Breadth
First Search algorithms. Color Connected Dominated Set schedules the downlink
phase-inspired scheduling. Finally, the combination of BFS Scheduling and CCDS
scheduling forms the proposed novel efficient sleep scheduling algorithm to reduce
energy consumption and delay when message is broadcasted in WSN. The proposed
new scheduling is briefly described as follows. A model wireless sensor network that
has been deployed for the detection of any critical event is shown in Fig. 2. It consists
of a central node (black shaded) also known as center node that has the capability
of communication with all the network nodes. In the case of detecting a critical or
disaster event by any node of the network (denoted by gray shaded node), the gray
node sends an alarm to the central node and thus constructs the uplink path. For the
construction of the uplink path, BFS scheduling is implemented. In constructing this
uplink path, the shortest path from any node to the central node is computed by BFS
scheduling.
Now the central node transmits the received alert from the gray node during
the uplink phase to all the other sensors in the network. For the construction of
the downlink path, the Color Connected Dominated Set scheduling is implemented.
For the construction of the downlink path, CCDS is constructed using the Internal
Fig. 2 Construction of
uplink path
Fig. 3 Construction of
downlink path
Model Control (IMC) algorithm [1, 12]. The Internal Model Control algorithm is
self-regulating process and characterizes the downlink path while transmitting the
alert from the central node to all other sensor nodes of WSN, as shown in Fig. 3.
5 Methodology
This section discusses the methodology for both uplink and downlink phases of the
present scheduling algorithm. Initially, deployment of nodes is performed. In order
to have communication in the WSN, route discovery is done using route table entries
of all the sensor nodes in WSN. A node initiates route discovery by sending a request
to its neighboring first hop nodes to know whether they are located in its path and
waits for a route reply from the first hop neighbors. Based on the first hop nodes’
reply messages, the broadcasting node updates its routing table entry destination ID.
The sequence number and battery status are updated for the latest information about
the nodes’ fresh routes and energy levels. After identifying the neighbor nodes, the
BSF algorithm is implemented to construct the BSF tree that divides the nodes into
different levels. Using these levels, a CCDS is constructed. Also, MIS is constructed
for independent nodes of each level. For comparative study, both scheduling algo-
rithms will be inputted with the same set of nodes. The methodology diagram for the
proposed work is shown in Fig. 4.
The pseudocode for phases of BSF scheduling and CCDS construction has been
presented below. The general notations are followed in the pseudocodes.
188 B. S. Rao
Fig. 4 Methodology diagram for proposed work
_____________________________________________________________________
Pseudocode-1: BSF-Scheduling Procedure for WSN denoted by V [9] (For notations
refer [9])
_____________________________________________________________________
begin
for each node n ε V do
Distance[n]= infinity, Predicate[n]=-1;
Color[n]=White;
Distance[s]=0, Color[s]=Gray;
Q=Empty , Queue Enqueue(Q,s);
while( Q is not empty) do
u=Head(Q);
for each neighbor of n of u do
if( Color[n] is White) then
Distance[n]=Distance[u] + 1;
Predicate[n]=u;
Color[n]=Gray;
Enqueue(Q,n);
Dequeue(Q);
Color[u]=Black;
end;
____________________________________________________________________________
____________________________________________________________________________
Pseudocode-2: Construction of Maximum Independent Set [10] (For notations refer [10])
____________________________________________________________________________
Function MIS(W)
begin
if (!connected(W)) then
begin
X=SCC(W);
if(|X|<=2) P=1 else P=MIS(X);
return (MIS(W-X) + P);
end;
if (|W|<=1) then return(|W|);
Select Y,Z of W such that
(i)d(Y,W) is minimal and
(ii) (Y,Z) is an Edge of W and d(Z,W) is maximal for all neighbors with degree d(Y,W);
if( d(Y,W)=1) then return 1+ MIS(W- M(Y);
if (d(Y,W)=2 then
begin
Z’:=M(Y)-Z;
if (Edge(Z, Z’) ) then return (1+ MIS(W-M(Y));
return Maximum(2+MIS(W-M(Z)-M(Z’)), 1 + MIS2 (W-M(Y)), M 2 (Y));
end;
if (d(Y,W)=3) then return Maximum(MIS 2 (W-Y, M(Y)), 1+ MIS(W- M(Y)));
if Y dominates Z then return MIS(W-Z);
return Maximum( MIS(W-Z), 1 + MIS(W-M(Z)))
end;
____________________________________________________________________
6 Experiment
The software requirements are: (a) Backend: Python 3 (b) Frontend: HTML,
CSS and Bootstrap 4. The Hardware requirements are: RAM—advised to have
>32 GB, Graphic Card, Nvidia GTX 1071, Processor—Intel Core i7-8750H,
Storage—512 GB SSD. Simulator: Network Simulators.
A novel efficient sleep acheduling algorithm for WSN was proposed and experi-
mented on network simulators in the present research work. The information of the
source node and its neighbors has been input and given in Table 3. The implemented
results of the BSF scheduling algorithm are presented in Table 4. Similarly, the exper-
imental results of the CCDS scheduling algorithm are given in Table 5. Table 6 shows
the paths from any sensor node to the central node in WSN. From Table 4, it can be
observed that during BSF scheduling, some parent nodes have children, and others
190 B. S. Rao
Table 3 Input neighbors list

S. No Nodes Adjacent list
1 S1 S2, S3, S4, S5, S6, S7
2 S2 S8, S9, S11, S3
3 S6 S7, S18, S19, S17
4 S8 S9, S20
5 S7 S8, S19, S20
6 S4 S15, S14, S13
7 S3 S12, S4, S13
8 S12 S13
9 S13 S14
Table 4 Output of BFS

S. No Parent node Child node
1 S1 S6, S5, S4, S7, S2, S3
2 S6 S19, S17, S18
3 S7 S20, S8
4 S8 –
5 S9 S22, S21, S23
6 S13 –
7 S4 S13, S14
8 S14 –
9 S17 S16
10 S5 S15
11 S15 –
12 S18 –
13 S3 S12
14 S12 –
15 S2 S11, S9
do not have as they are leaf nodes. It is well-understood that no energy is release
from leaf nodes.
Similarly, from Table 5, it is understood that during CCDS scheduling, the parent
nodes not having children would not be involved in message transmission. Comparing
Tables 4 and 5, it can be understood that the child node number of the parent nodes
of BSF scheduling and CCDS scheduling are not the same. Also, childless nodes are
more for CCDS. Therefore, finally, it could be understood that the childless parents
of BSF do not involve in the dissipation of energy as they are leaf nodes and childless
nodes of CCBS do not involve in transmission message and turn conserve energy
which is the main objective of the present research work.
Table 5 Output of CCDS

S. No Parent node Child node
1 S1 S6, S5, S4, S7, S2, S3
2 S6 S17
3 S7 S20, S19
4 S8 –
5 S9 S21, S22, S23
6 S13 S12, S14
7 S4 S13, S15
8 S14 –
9 S17 S16
10 S5 –
11 S15 –
12 S18 –
13 S3 –
14 S12 –
15 S2 S11, S9
Table 6 Paths representation

Node to center node BFS path to center CCDS path to
node center
S15 -S1 S15-S5-S1 S15-S4-S1
S13-S1 S13-S4-S1 S13-S4-S1
S12-S1 S12-S3-S1 S12-S13-S4-S1
S14-S1 S14-S4-S1 S14-S13-S4-S1
S8-S1 S8-S7-S1 S8-S20-S7-S1
S9-S1 S9-S2-S1 S9-S2-S1
S17-S1 S17-S6-S1 S17-S6-S1
Table 6 shows the paths from any sensor node to the central node in WSN. The
bold lettered nodes in the table reveal the difference in the number of hops. It can be
observed that the sensor nodes S12, S14, S8 have a smaller number of hops in path
to the central node N1 for BFS scheduling in comparison with CCDS scheduling.
At the same time, the nodes S3, S4, S5, S6, and NS are at one hop distance from S1.
Other nodes such as S2, S3, S5, S6, and S7 are one hop away from the center node.
It is obvious that the hop count varies with the levels of the nodes in the WSN.
192 B. S. Rao
8 Conclusion
In the present paper, the proposed efficient sleep scheduling algorithm has been
successfully designed and implemented. From the experimental results, it can be
concluded that to have balance and better energy-saving and fastest transmission of
a message in WSN, the present algorithm has successfully balanced both the BSF
and CCDS. Further, research is required to improve the model. In the future, we
extend this model for both homogeneous and heterogeneous networks of sufficiently
large size compared with other models numerically and analytically.
Acknowledgements The author is thankful to the management of GRIET for their encouragement.
References
1. Z. Liu, B. Wang, L. Guo, A survey on connected dominating set construction algorithm for
wireless sensor networks. Inf. Technol. J. 9(6) 1081–1092 (2010)
2. R. Soua, P. Minet, A survey on energy efficient techniques in wireless sensor networks. in 2011
4th Joint IFIP Wireless and Mobile Networking Conference (WMNC 2011) (2011), pp. 1–9,
https://doi.org/10.1109/WMNC.2011.6097244
3. A.R. Pagar, D.C. Mehetre,A survey of Energy Efficient Sleep Scheduling in WSN. Semantic-
scholar.org, Corpus ID 212548604 (2015)
4. M. Karthihadevi, S. Pavalarajan, Sleep scheduling strategies in wireless sensor network. Adv.
Nat. Appl. Sci. 11(7), 635–641 (2017)
5. Z. Zhang, L. Shu, C. Zhu, M. Mukherjee, A short review on sleep scheduling mechanism in
wireless sensor networks, in QShine 2017 eds. by L. Wang et al. (LNICST 234, 2018), p. 66
6. J. I. Z. Chen, K. Lai, Machine learning based energy management at internet of things network
nodes. J. Trends Comput. Sci. Smart Technol. 2(3), 127–133 (2020)
7. S. Smys, C. Vijesh Joe, Metric routing protocol for detecting untrustworthy nodes for packet
transmission. J. Inf. Technol. 3(2), 67 (2021)
8. R. Dhaya, Kanthavel R., Bus-based VANET using ACO multipath routing algorithm. J. Trends
Comput. Sci. Smart Technol. (TCSST) 3(1), 40 (2021)
9. V.K. Akram, O. Dagdeviren, Breadth-first search-based single-phase algorithms for bridge
detection in wireless sensor networks. Sensors (Basel, Switz.) 13(7), 8786–8813 (2013). https://
doi.org/10.3390/s130708786
10. J.M. Robson, Algorithms for Maximum independent Sets. J. Algorithms 7, 425–440 (1986)
11. G. Peng, J. Tao, Z. Qian, Z. Kui, Sleep scheduling for critical event monitoring in wireless
sensor networks. IEEE Trans. Parallel Distrib. Syst. 23(2), Feb (2012)
12. D. Rivera, M. Morari, S. Skogestad, Internal model control—PID controller design. Ind Eng.
Chem Process Des Dev. 25, 252–265 (1986)
13. P. Laxman, P. Rajeev,Comparative analysis of TDMA scheduling algorithms in wireless sensor
networks. https://www.semanticscholar.org/ Corpus ID: 61778529 (2014)
14. J. Feng, H. Zhao,Energy balanced multisensory sensory scheduling for target tracking in WSN.
Sensors (Basel) 18(10), 3585 (2018)
15. S. Soumyadip, D. Swagatam, M. Nasir, V.V. Athanasios, P. Witold, An evolutionary multiob-
jective sleep-scheduling scheme for differentiated coverage in wireless sensor networks. IEEE
Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(6) (2012)
16. E. Mahmoud, M.A. Abd El-Gawad, K. Haneul, P. Sangheon, EERS: Energy-efficient reference
node selection algorithm for synchronization in industrial wireless sensor networks. Sensors
20, 4095 (2020) https://doi.org/10.3390/s20154095
17. S. Hassan, M.S. Nisar, H. Jiang, Multilevel sleep scheduling for heterogeneous wireless sensor
networks. Comput. Sci. Technol. Appl. 227 (2016)
18. R. Wan, N. Xiong, N.T. Loc, An energy efficient sleep scheduling mechanism with similarity
measure for wireless sensor networks. Hum. Cent. Comput. Inf. Sci. 8, 18 (2018)
19. O. Jerew, N. Bassan, Delay tolerance and energy saving in WSN in mobile base station. Wirel.
Commun. Mob. Comput. 2019 (2019) Article ID 3929876
20. Z. Wang, Y. Chen, B. Liu, H. Yang, Z. Su, Y. Zhu, A sensor node scheduling algorithm for
heterogeneous wireless sensor networks. Int. J. Distrib. Sens. Netw. 15, 1 (2019)
21. K.P. Mhatre, U.P. Khot, Wireless Personal Communications, 112(1243) (2020)
22. R. Sinde, F. Begum, K. Njau, S. Kaijage, Refining network life time of wireless sensor networks
using energy efficient clustering and DRL based sleep scheduling. Sensors, 20(5), 1540 (2020)
23. K.B. Manikandan,Game theory and wake up approach scheduling in WSN for energy efficiency.
Turk. J. Comput. Math. Educ. 12(10), 2922 (2021)
24. R.B. Gudivada, R.C. Hansdah,Energy efficient secure communication in wireless sensor
networks. in 2018 IEEE 32nd International Conference on Advanced Information Networking
and Applications (AINA) (2018), pp 311–319. https://doi.org/10.1109/AINA.2018.00055
Face Recognition: A Review and Analysis
Amit Verma, Aarti Goyal, Nitish Kumar, and Hitesh Tekchandani
Abstract Face recognition is the process of identifying and verifying a person from
image or video. A significant amount of contributions have already been made by
the researchers in this field of face identity techniques and recognition. In this paper,
we further explored and investigated the evolution of face recognition methods from
low-level features extracted from global features such as PCA, eigen face and SVM-
based methods to high-level features extracted from deep learning models such as
DeepFace and VGGFace. We also discussed the challenges such as illumination and
pose variation and available standard data sets such as LFW and Yale data set in the
field of face recognition.
Keywords Face recognition · Global methods · Deep learning methods · PCA ·

SVM · DeepFace · VGGFace
1 Introduction
The field of computer vision has produced various tremendous subfields such as face
detection [40, 59], activity recognition [72, 80, 84, 85] and medical image processing
[7, 18]. Facial recognition is one of the crucial subfields which plays an important
A. Verma (B) · H. Tekchandani

Department of Electronics and Communication Engineering (ECE), Koneru Lakshmaiah
Education Foundation (KLEF), Hyderabad, India
e-mail: amit.verma@klh.edu.in
H. Tekchandani
e-mail: hitesh@klh.edu.in
A. Goyal · N. Kumar
Department of Electronics and Communication Engineering (ECE), National Institute of
Technology, Raipur, India
e-mail: agoyal.phd2016.etc@nitrr.ac.in
N. Kumar
e-mail: nkumar.phd2018.etc@nitrr.ac.in
196 A. Verma et al.
role in biometrics identification [3, 81]. Facial recognition system has the capability
of matching a human face from an image or a video and tries to find in database
of faces and tries to verify the user. Facial recognition system find its application
in various fields such as for security and surveillance—finding missing children,
kinship verification, tracking criminals, etc., for health care—patient medication,
detecting genetic diseases, etc., and for banking and retail—customer verification,
KYC, mobile users, etc.
Global features normally attempt to combine low-level structural and geometric
statistics of the complete objects or entire region of interest as a whole for face
recognition. Few examples of global features-based methods are linear subspace
[10, 26, 50], manifold [25, 36, 90] and sparse representation [22, 23, 88, 94].
However, these features become sensitive to noise. Also, uncontrolled behaviour of
facial changes cannot be handled by these global methods. Later, in early 2000s,
local feature-based methods have been introduced. Local representations describe
the extraction and collection of local features specifically in the spatial–temporal
domain and are obtained in a bottom-up structure. Different feature detector such as
Gabor [45] and LBP [4] and their extensions [16, 24, 95] obtained improved results
in the task of facial recognition. Later, the concept of bag-of-visual words (BOW)
is utilized for various visual classification applications like texture classification,
object/scene retrieval, image categorization and object localization, respectively. For
facial recognition, this codebook representation was used by researchers in [14,
15, 44] which provides better distinctiveness. Both global and local representation
provided significant progress. However, these low-level features need a lot of manual
labour. Furthermore, the generalization capability due to complex nonlinear facial
appearance variations is limited.
A deep convolutional neural network (DCNN) is able to learn robust and high-level
feature representations of an image or video. The supervised deep learning network
of Krizhevsky et al. [43] has got remarkable performance enhancement on large-scale
visual classification data set, such as ImageNet [60], using DCNN [33]. Recently,
many deep architectures such as VGG [68], ResNet [35] and GoogleNet [77] have
been proposed with improved classification performance on various data sets.
The hierarchical features extracted from various layers enabled the network tackle
variation due to face pose and expression changes [15, 74, 78]. The initial convolu-
tion layers extracted features similar to Gabor and SIFT, whereas the higher layers
obtained more complex features to learn facial recognition. The state-of-the-art deep
learning model developed was DeepFace [78]. It achieved great recognition accuracy
on LFW benchmark [37] close to human vision. After this, many deep learning-based
approaches have been introduced in the field of facial recognition such as DeepID
series [73–76], VGGFace [54], FaceNet [64] and VGGFace2 [13].
A significant amount of contributions have already been made by the researchers
in this field of face identity techniques, and recognition and comprehensive reviews
of the related work can be found in [2, 12, 96]. In this paper, we further explore
and investigate the evolution of face recognition methods from low-level features
extracted from global features-based methods to high-level features extracted from
deep learning models.
Face Recognition: A Review and Analysis 197
2 Basics of Face Recognition System
Face recognition is used to recognize images gave by notable symmetric structures,

which are generally used in numerous PC vision frameworks. Perception applications
use symmetric pictures, and identification calculations recognize faces and focus on
facial pictures that incorporate the features of face. This makes it more difficult and
unpredictable compared to a solitary identification or recognition calculation.
The basic workflow of face recognition system is shown in Fig. 1.
1. The initial phase in the face recognition programme is to capture an image from
the camera. In case of a video, multiple frames are captured, and an additional
face tracking step is included to the basic workflow.
2. In the next step, face detection takes place. Based on the requirement, it may
include some preprocessing steps such as noise reduction and background extrac-
tion.
3. In the next step, face alignment is performed. Face alignment is one of the
important aspects for an accurate face recognition system. It deals with accu-
rate localization and normalization of facial parts such as eyes, nose, eyelids
and mouth. The geometrical relationship among various facial parts contributes
towards higher accuracy of facial recognition.
4. After that, the feature extraction takes place. Feature extraction is a process of
collecting various effective facial features or information which can be utilized as
representation of specific user in the database. For global and local feature-based
methods, we extract low and middle-level features, whereas for deep-learning-
based approaches, we extract high level and complex feature.
5. This collected feature information is later used in feature matching stage to match
a query face from the user database.
6. Finally, based on the feature matching results, we assign a user label to the query
face.
Further, as per the requirement, facial recognition can be divided into two types of
matching. First, we can perform one-to-one matching to generate only one matching.
Second, we can generate list of suspects matching to the query face which is also
called one-to-many matching.
Further facial identity, the face needs to be checked to apprehend the people in the
facial pics. After face recognition, ordinary pics need to be made with some methods.
While general pics are made, faces may be dispatched to algorithm detection. There
are techniques for this: one is 2D technique and second is 3D technique. In 2D
techniques, 2D pics are applied as records, and other learning/getting ready strategies
are applied to recognize human distinguishing proof. In 3D techniques, 3D records
are applied as a recognition input. Different strategies are applied for recognition, e.g.
making use of a relating point scale, a half-face estimation and a 3D mathematical
scale.
198 A. Verma et al.
Fig. 1 Basic workflow of

face recognition system
3 Face Recognition Methods
Since the development of one of the first facial recognition systems by Sakai et. al.
[61], many algorithms have been developed for the same. In this section, we discuss
some global and local feature-based methods. Further we also discuss some recent
deep leaning models for facial recognition.
3.1 Eigen Face
The eigen face method [83] is considered to be an important facial recognition tech-
nique. The technique primarily calculates eigen vectors [67] and eigen values. The
eigen face method computes variance of faces to represent an face image as a eigen
vector. The technique primarily calculates eigen vectors and eigen values from the
covariance matrix. Further we use the principal component analysis (PCA) [65, 70,
82] to project the higher dimensionality space to a lower-dimensional subspace of
principal components. It enables us to operate with a small set of features for huge
database which also reduces the computational complexity. It was first introduced
by Sirovich et. al. [69] This method computes the difference between features corre-
sponding to different part of the face. Many researchers have used this method with
non-frontal face [79] also.
3.2 Gabor Wavelet
Gabor functions were first introduced by Dennis Gabor [9] as a tool for signal detec-
tion in noise. Later, these Gabor filters were redefined as 2D functions called as
Gabor wavelets by Daugman [21]. In order to perform facial recognition, the Gabor
wavelets utilize local features [20] call computed at different facial parts which are
termed as Gabor features [66]. These local features are computed at different scales.
Hence, there lies a high probability redundancy in features. Researchers utilized many
feature reduction algorithms [42, 91] to minimize the redundancy in features. As a
wavelet function, Gabor features could extract both spatial and frequency features.
Hence, a facial can be represented as a combination of both [86, 97].
3.3 Artificial Neural Network (ANN)
In the meantime, ANN was gaining significant attention and being utilized in many
classification task like age and gender classification. Several face recognition meth-
ods also apply ANN for face recognition [5]. The multi-layer architecture of ANN
enables the network to learn from past experience and also eliminates the distur-
bances due to illumination and face pose variation. It is to be noted that the initial
features still extracted local feature detectors, and ANN was used for classification
only. In [87], the authors utilized ANN with frontal faces for face recognition. In
another works [42, 91], the authors utilized self-organizing map neural network
(SOM) for face recognition. Although the use of ANN provided significant boost
in the face recognition accuracy, the amount of training time was very high due to
multiple layers of the network.

y= wi ∗ xi + b (1)
where wi and b represent the weights and biases and xi and y represent the input and
output.
3.4 Hidden Markov Models (HMM)
Hiddel Markov model (HMM) [62] is one of the statistical models used for face
recognition. It consists of two processes. First is the Markov chain which consists of
a set of states, and second, each of this state consists of a probability density function.
It has been utilized in variety of pattern recognition application. In [63], the authors
used HMM for face recognition. However, it was limited to one-dimensional data.
Later, five-state HMMs [52] were developed for facial recognition problems. These
five states correspond to nose, eyes, chin, mouth and forehead. Later, HMM was
bundled with other algorithms like wavelets [11] and also used for facial recogni-
200 A. Verma et al.
tion from temporal data [19]. Further, researchers utilized advance HMM such as
structural hidden Markov model (SHMM) [53] and adaptive hidden Markov model
(AHMM)[48] for face recognition.
3.5 Support Vector Machine (SVM)
Similar to neural network, support vector machine (SVM) [41] was being utilized for
many classification tasks. SVM is an important learning-based method, which can
be effectively utilized to design a classifier as shown in Eq. 2 for facial recognition
problems. The primary features are extracted by local feature detector [6]. Based
on the feature properties, a SVM classifier is trained using the face features. This
kind of hybrid approach was utilized by many researchers such as ICA with SVM
[27] and binary tree with SVM [32]. In order to perform feature extraction and
reduction, various algorithms such as PCA and LDA [51, 71] were utilized before
classification with SVM. In comparison with ANN, SVM provides faster training
and less computation. However, the accuracy is comparatively low.

w T xi + b ≥ 1 for ∀i such that yi = +1
(2)
w T xi + b ≤ 1 for ∀i such that yi = −1
where feature vectors xi ∈ R n and output label yi ∈ +1, −1. w and b represent the
parameters of the hyperplane.
3.6 Deep Convolutional Neural Networks (DCNN)
DCNN enabled to extract a wide range of features from images and videos.
Krizhevsky et al. [43] have got remarkable performance enhancement on large data
set in 2012, such as ImageNet [60], using DCNN [33]. Recently, many state-of-
the-art deep architectures such as VGG [68], ResNet [35] and GoogleNet [77] have
been proposed with improved classification performance on various object recogni-
tion data sets.
A significant progress was also been made by the researchers in the facial recog-
nition. They utilized the transfer learning approach to use the state-of-the-art DCNN
model for facial recognition. DeepFace [78] was proposed in 2014 using the AlexNet
architecture. It achieved a very high accuracy on LFW data set. DeepID [75, 76] is
a series of systems (e.g. DeepID, DeepID2, etc.) developed for both identification
and verification. VGGFace [54] and FaceNet [64] were proposed in 2015 utiliz-
ing the VGGNet and GoogleNet architectures respectively. Both the models uti-
lized triple loss function and surpassed the accuracy of DeepFace model. Later in
2017, SphereFace [47] was proposed using the ResNet architecture. The timeline for
progress of facial recognition models has been shown in Fig. 2.
Fig. 2 Timeline of growth

of DCNN models for object
and face recognition
These deep architectures were performing significantly well. However, their com-
putational requirements were very high due to large size of networks. It was difficult
to fit these networks into small embedded devices. Hence, a set of small architec-
tures have been developed such as MobiFace [29] and SqueezeNet [39]. The manual
selection of layers and tuning of parameters are still time consuming and causing
error. Hence, it is required to have a adaptive network architecture model. Recently,
neural architecture search (NAS) [101] has performed outstandingly well in object
classification. It was used for face recognition in [100] to achieve a optimum archi-
tecture.
It was also important to have an end-to-end learning model which will be able to
perform all the task required in a face recognition system, i.e. detection, alignment
and recognition [17, 34, 89, 98]. These models represent more robust and optimal
architecture for face recognition. In [34], the authors were able to register and rep-
resent faces in the same time. In [89], the CNN model performed alignment and
verification task in the same model. Apart from the alignment issue, there are other
factors such as illumination and variation of poses, which are affecting the perfor-
mance of face recognition. The researchers understood this problem, and various
models have been proposed which can perform multitasking [55, 58, 93]. In [57],
authors represented a task-specific network.
In order to extract variety of features from a single image, image augmentation
has been utilized. To process each of the feature, individual networks are assembled
together and formed a multi-input networks [28, 46, 99]. These multi-input networks
202 A. Verma et al.
Table 1 Different face recognition methods along with data sets

Method Data set Year References
Wavelet coding AT&T 2003 [11]
ICA, SVM Yale and AR 2003 [27]
SVM and LDA VTSFace 2006 [71]
Incremental LDA Yale 2007 [5]
DCT and HMM BANCAFace 2007 [62]
Maximum confidence HMM FERET 2008 [19]
Haar and Gabor wavelet AT&T 2008 [53]
Probabilistic neural network BioID 2009 [87]
Gabor features Yale, FERET and AR 2010 [91]
PCA eigen faces AT&T 2012 [70]
Eigen face FRAVFace 2013 [82]
Local Gabor binary pattern Yale 2014 [20]
DeepFace (AlexNet) LFW 2014 [78]
DeepID LFW 2014 [75]
VGGFace (VGGNet) LFW 2015 [54]
FaceNet (GoogleNet) LFW 2015 [64]
SphereFace (ResNet) LFW 2017 [47]
utilize differently cropped images in different scales to learn a DCNN for recognition
task. Later, for image augmentation, some generative models have been utilized such
as autoencoder [92] and generative adversarial network (GAN)[38]. GAN has gained
a lot of popularity in facial recognition problem due to its effective generation prop-
erty, and it was first proposed by Goodfellow et al. [31]. Many researchers utilized
GAN for face processing application [8, 36]. It also reduced the issue availability of
limited data set for faces required by deep learning architectures. Few of the local
features and deep learning face recognition models are shown in Table 1.
4 Data Sets
In this section, we will discuss some important and commonly used data sets for
facial recognition problems.
4.1 FERET Data Set
It is one of the benchmark data sets for face recognition [56]. It contains 14,126
images from 1199 participants for both training and validation. It was recorded in 15
different sessions. It consists of some duplicate images from the same user as well
in order to understand facial change with respect to time.
4.2 AT&T Face Data Set
It is another benchmark data set for face recognition [1]. It contains 400 images
from 40 participants for both training and validation. The images are recorded in the
span of two years. Compared to other data set, this is relatively less challenging and
considered to be used by beginners.
4.3 Yale Face Data Set
It is another benchmark data set for face recognition [30]. It contains 2414 images
from 38 participants for both training and validation. It consists of two sets. The data
set is comprised of various challenging environments such as illumination changes,
different expressions and occlusion.
4.4 AR Data Set
It is another standard data set for face recognition [49]. It contains 4000 images from
126 participants for both training and validation. The images are collected in specific
environment with slight variation in illumination and expressions.
4.5 LFW Data Set
The LFW data set [37] was one of the favourite data sets for deep learning researchers.
Many of the standard face recognition deep architectures [47, 54, 64, 78] were trained
utilizing this data set. It consists of 13,233 images from 5749 users.
5 Challenges in Face Recognition
There has been significant progress in the field of facial recognition. However, it
is not easy, in fact a very challenging task in the field of computer vision. There
are many factors which affect the accuracy of face recognition such as age group,
illumination, changes in expression and pose variation.
204 A. Verma et al.
5.1 Illumination
The illumination effect is also called as lighting effect. In case of different envi-
ronment and background such as day or night, the illumination gets affected. It may
cause extra light or dark patches in the region of interest which can affect the accuracy
of facial recognition significantly.
5.2 Pose Variation
Any slight change in head pose will affect the accuracy of face recognition system.
The head pose has three degrees of freedom, i.e. roll, pitch and yaw. In case of
movement, the head utilizes these three motions. Hence, the view angle and eyesight
cause significant variation in face pose which ultimately results in poor accuracy.
5.3 Expression Change
The facial parts such as mouth, nose, eyes, chin and their internal relationships man-
ufacture important features for facial recognition. Hence, any change in expression
during registration or verification of human face may affect the performance of the
face verification system directly.
5.4 Age Variation
Age is an important factor while preparing a good quality face recognition system.
Face is an combination of skin, tissues and muscles. With ageing, the facial skin and
muscles get changes for each person. Hence, it is important to keep updating the
database of each user timely.
5.5 Occlusion
As mentioned earlier, facial parts such as mouth, nose, eyes, chin and their internal
relationships manufacture important features for facial recognition. If any of the
part is occluded by some other object or not visible properly, then it can affect the
accuracy of facial recognition system.
5.6 Model Complexity and Parameters
Recent, face recognition systems, especially deep learning models, are trained on
larger networks which require high computational resources. Hence, it is important
to have sufficient computational architecture to operate these models for real-time
interface. Also, different models are trained with different data set and tuned with
different parameters. Hence, a good system should incorporate variety of data set
and sufficiently train to avoid under fitting or over fitting effect.
Due to these number of challenges, it is difficult to have an ideal or 100% accu-
rate face recognition system. However, we can expect to have a system, which can
minimize the effect of these challenges to a significant level.
6 Conclusion
In this paper, we presented a detailed survey in the field of facial recognition. At

first, a basic facial recognition system is presented with a block diagram. Further,
we discussed few global and local feature-based methods such as eigen face, HMM
and SVM. We also highlighted the evolution of face recognition field utilizing the
deep learning architectures. Furthermore, we identified and discussed the challenges
in the field of face recognition.
References
1. AT&T Laboratories Cambridge, The Database of Faces (2002). http://www.cl.cam.ac.uk/

research/dtg/attarchive/facedatabase.html
2. A.F. Abate, M. Nappi, D. Riccio, G. Sabatino, 2d and 3d face recognition: a sur-
vey. Pattern Recogn. Lett. 28(14), 1885–1906 (2007). Image: Information and Control.
https://doi.org/10.1016/j.patrec.2006.12.018, https://www.sciencedirect.com/science/article/
pii/S0167865507000189
3. R. Abinaya, L. Maguluri, S. Narayana, M. Syamala, A novel biometric approach for facial
image recognition using deep learning techniques. Int. J. Adv. Trends Comput. Sci. Eng. 9(5),
8874–8879 (2020). https://doi.org/10.30534/ijatcse/2020/283952020
4. T. Ahonen, A. Hadid, M. Pietikainen, Face description with local binary patterns: application
to face recognition. IEEE Trans. Pattern Anal. Mach. Intel. 28(12), 2037–2041 (2006). https://
doi.org/10.1109/TPAMI.2006.244
5. Y. Aliyari, H. Moghaddam, A Face Recognition System Using Neural Networks with Incre-
mental Learning Ability (2007), pp. 291–296. https://doi.org/10.1109/CIRA.2007.382904
6. A. Azeem, M. Sharif, J. Shah, M. Raza, Hexagonal scale invariant feature transform (h-SIFT)
for facial feature extraction. J. Appl. Res. Technol. 13(3), 402–408 (2015). https://doi.org/10.
1016/j.jart.2015.07.006
7. K. Babu, K. Sony, N. Indira, K.V. Prasad, S. Shameem, An effective brain tumor detection from
t1w MR images using active contour segmentation techniques. J. Phys. Conf. Ser. 1804(1),
012174 (2021). https://doi.org/10.1088/1742-6596/1804/1/012174
206 A. Verma et al.
8. J. Bao, D. Chen, F. Wen, H. Li, G. Hua, Cvae-gan: fine-grained image generation through
asymmetric training, in 2017 IEEE International Conference on Computer Vision (ICCV)
(2017), pp. 2764–2773. https://doi.org/10.1109/ICCV.2017.299
9. T. Barbu, Gabor filter-based face recognition technique, Proceedings of the Romanian
Academy-Series A: Mathematics, Physics, Technical Sciences, Information. Science 11, 277–
283 (2010)
10. P. Belhumeur, J. Hespanha, D. Kriegman, Eigenfaces vs. fisherfaces: recognition using class
specific linear projection. IEEE Trans. Pattern Anal. Mach. Intel. 19(7), 711–720 (1997).
https://doi.org/10.1109/34.598228
11. M. Bicego, U. Castellani, V. Murino, Using hidden markov models and wavelets for face
recognition, in 12th International Conference on Image Analysis and Processing, 2003 Pro-
ceedings (2003), pp. 52–56. https://doi.org/10.1109/ICIAP.2003.1234024
12. K. Bowyer, J.K. Chang, P. Flynn, A survey of approaches and challenges in 3d and multi-
modal 3d+2d face recognition. Comput. Vis. Image Underst. 101, 1–15 (2006). https://doi.
org/10.1016/j.cviu.2005.05.005
13. Q. Cao, L. Shen, W. Xie, O.M. Parkhi, A. Zisserman, Vggface2: a dataset for recognising
faces across pose and age, in 2018 13th IEEE International Conference on Automatic Face
Gesture Recognition (FG 2018) (2018), pp. 67–74. https://doi.org/10.1109/FG.2018.00020
14. Z. Cao, Q. Yin, X. Tang, J. Sun, Face recognition with learning-based descriptor, in 2010
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2010), pp.
2707–2714. https://doi.org/10.1109/CVPR.2010.5539992
15. T.H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, Y. Ma, Pcanet: a simple deep learning baseline for
image classification? IEEE Trans. Image Proces. 24(12), 5017–5032 (2015). https://doi.org/
10.1109/TIP.2015.2475625
16. D. Chen, X. Cao, F. Wen, J. Sun, Blessing of dimensionality: high-dimensional feature and
its efficient compression for face verification, in 2013 IEEE Conference on Computer Vision
and Pattern Recognition (2013), pp. 3025–3032. https://doi.org/10.1109/CVPR.2013.389
17. J.C. Chen, R. Ranjan, A. Kumar, C.H. Chen, V.M. Patel, R. Chellappa, An end-to-end sys-
tem for unconstrained face verification with deep convolutional neural networks, in 2015
IEEE International Conference on Computer Vision Workshop (ICCVW) (2015), pp. 360–
368. https://doi.org/10.1109/ICCVW.2015.55
18. N. Cherukuri, N.R. Bethapudi, V.S.K. Thotakura, P. Chitturi, C.Z. Basha, R.M. Mummidi,
Deep learning for lung cancer prediction using nscls patients ct information, in 2021 Interna-
tional Conference on Artificial Intelligence and Smart Systems (ICAIS) (2021), pp. 325–330.
https://doi.org/10.1109/ICAIS50930.2021.9395934
19. J.T. Chien, C.P. Liao, Maximum confidence hidden markov modeling for face recogni-
tion. IEEE Trans. Pattern Anal. Mach. Intel. 30(4), 606–616 (2008). https://doi.org/10.1109/
TPAMI.2007.70715
20. H. Cho, R. Roberts, B. Jung, O. Choi, S. Moon, An efficient hybrid face recognition algorithm
using pca and gabor wavelets. Int. J. Adv. Rob. Syst. 11 (2014)
21. J. Daugman, Two-dimensional spectral analysis of cortical receptive field profiles. Vis. Res.
20, 847–856 (1980)
22. W. Deng, J. Hu, J. Guo, Extended SRC: undersampled face recognition via intraclass variant
dictionary. IEEE Trans. Pattern Anal. Mach. Intel. 34(9), 1864–1870 (2012). https://doi.org/
10.1109/TPAMI.2012.30
23. W. Deng, J. Hu, J. Guo, Face recognition via collaborative representation: its discriminant
nature and superposed representation. IEEE Trans. Pattern Anal. Mach. Intel. 40(10), 2513–
2521 (2018). https://doi.org/10.1109/TPAMI.2017.2757923
24. W. Deng, J. Hu, J. Guo, Compressive binary patterns: Designing a robust binary face descriptor
with random-field eigenfilters. IEEE Trans. Pattern Anal. Mach. Intel. 41(3), 758–767 (2019).
https://doi.org/10.1109/TPAMI.2018.2800008
25. W. Deng, J. Hu, J. Guo, H. Zhang, C. Zhang, Comments on globally maximizing, locally min-
imizing: Unsupervised discriminant projection with application to face and palm biometrics.
IEEE Trans. Pattern Anal. Mach. Intel. 30(8), 1503–1504 (2008). https://doi.org/10.1109/
TPAMI.2007.70783
26. W. Deng, J. Hu, J. Lu, J. Guo, Transform-invariant pca: a unified approach to fully automatic
facealignment, representation, and recognition. IEEE Trans. Pattern Anal. Mach. Intel. 36(6),
1275–1284 (2014). https://doi.org/10.1109/TPAMI.2013.194
27. O. Déniz, M. Castrillón, M. Hernández, Face recognition using independent component anal-
ysis and support vector machines. Pattern Recogn. Lett. 24(13), 2153–2157 (2003). https://
doi.org/10.1016/s0167-8655(03)00081-3
28. C. Ding, D. Tao, Robust face recognition via multimodal deep face representation. IEEE
Trans. Multimedia 17(11), 2049–2058 (2015). https://doi.org/10.1109/TMM.2015.2477042
29. C.N. Duong, K.G. Quach, I. Jalata, N. Le, K. Luu, Mobiface: a lightweight deep learning
face recognition on mobile devices, in 2019 IEEE 10th International Conference on Bio-
metrics Theory, Applications and Systems (BTAS) (2019), pp. 1–6. https://doi.org/10.1109/
BTAS46853.2019.9185981
30. A. Georghiades, P. Belhumeur, D. Kriegman, From few to many: illumination cone models
for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intel.
23(6), 643–660 (2001)
31. I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
Y. Bengio, Generative adversarial nets, in Proceedings of the 27th International Conference
on Neural Information Processing Systems, vol. 2, pp. 2672–2680 (NIPS’14, MIT Press,
Cambridge, 2014)
32. G. Guo, S. Li, K. Chan, Face recognition by support vector machines, in Proceedings
Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat.
No. PR00580) (2000), pp. 196–201. https://doi.org/10.1109/AFGR.2000.840634
33. Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M.S. Lew, Deep learning for visual under-
standing. Neurocomputer 187(C), 27–48 (2016). https://doi.org/10.1016/j.neucom.2015.09.
116
34. M. Hayat, S.H. Khan, N. Werghi, R. Goecke, Joint registration and representation learning for
unconstrained face identification, in 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2017), pp. 1551–1560. https://doi.org/10.1109/CVPR.2017.169
35. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 770–778.
https://doi.org/10.1109/CVPR.2016.90
36. X. He, S. Yan, Y. Hu, P. Niyogi, H.J. Zhang, Face recognition using laplacianfaces. IEEE
Trans. Pattern Anal. Mach. Intel. 27(3), 328–340 (2005). https://doi.org/10.1109/TPAMI.
2005.55
37. G.B. Huang, M. Mattar, T. Berg, E. Learned-Miller, Labeled faces in the wild: a database for
studying face recognition in unconstrained environments, in Workshop on Faces in ’Real-Life’
Images: Detection, Alignment, and Recognition. Erik Learned-Miller and Andras Ferencz and
Frédéric Jurie, Marseille, France (2008), https://hal.inria.fr/inria-00321923
38. R. Huang, S. Zhang, T. Li, R. He, Beyond face rotation: global and local perception gan for
photorealistic and identity preserving frontal view synthesis, in 2017 IEEE International Con-
ference on Computer Vision (ICCV) (2017), pp. 2458–2467. https://doi.org/10.1109/ICCV.
2017.267
39. F.N. Iandola, M. Moskewicz, K. Ashraf, S. Han, W. Dally, K. Keutzer, Squeezenet: Alexnet-
level accuracy with 50x fewer parameters and <1mb model size. ArXiv abs/1602.07360
(2016)
40. M. Jaya Bhaskar, Y. Venkatesh, R. Sai Bhaskar Pranai, M. Rohith, Face recognition for
attendance management. Int. J. Emerg. Trends Eng. Res. 8(4), 964–968 (2020). https://doi.
org/10.30534/ijeter/2020/04842020
41. K. Jonsson, J. Kittler, Y.P Li, J. Matas, Support vector machines for face authentication. Image
Vis. Comput. 20(5–6), 369–375 (2002). https://doi.org/10.1016/s0262-8856(02)00009-4
42. T. Kathirvalavakumar, Jebakumari, J. Beulah Vasanthi, Face representation using combined
method of gabor filters, wavelet transformation and DCV and recognition using RBF. J. Intel.
Learn. Syst. Appl. 04(04), 266–273 (2012). https://doi.org/10.4236/jilsa.2012.44027
208 A. Verma et al.
43. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional
neural networks. Commun. ACM 60(6), 84–90 (2017). https://doi.org/10.1145/3065386
44. Z. Lei, M. Pietikainen, S.Z. Li, Learning discriminant face descriptor. IEEE Trans. Pattern
Anal. Mach. Intel. 36(2), 289–302 (2014). https://doi.org/10.1109/TPAMI.2013.112
45. C. Liu, H. Wechsler, Gabor feature based classification using the enhanced fisher linear dis-
criminant model for face recognition. IEEE Trans. Image Proces. 11(4), 467–476 (2002).
https://doi.org/10.1109/TIP.2002.999679
46. J. Liu, Y. Deng, T. Bai, C. Huang, Targeting ultimate accuracy: face recognition via deep
embedding. CoRR abs/1506.07310 (2015), http://arxiv.org/abs/1506.07310
47. W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, L. Song, Sphereface: deep hypersphere embedding
for face recognition, in 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (2017), pp. 6738–6746. https://doi.org/10.1109/CVPR.2017.713
48. X. Liu, T. Cheng, Video-based face recognition using adaptive hidden markov models, in
2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003
Proceedings, vol. 1 (2003), pp. I–I. https://doi.org/10.1109/CVPR.2003.1211373
49. A. Martinez, R. Benavente, The ar face database. Tech. Rep. 24 CVC Technical Report (1998)
50. B. Moghaddam, W. Wahid, A. Pentland, Beyond eigenfaces: probabilistic matching for face
recognition, in Proceedings Third IEEE International Conference on Automatic Face and
Gesture Recognition (1998), pp. 30–35. https://doi.org/10.1109/AFGR.1998.670921
51. M. Murtaza, M. Sharif, M. Raza, J. Shah, Face recognition using adaptive margin fisherâŁ™s
criterion and linear discriminant analysis. Int. Arab J. Inform. Technol. 11, 149–158 (2014)
52. A. Nefian, M. Hayes, Face detection and recognition using hidden markov models 1, 141–145
(1998). https://doi.org/10.1109/ICIP.1998.723445
53. P. Nicholl, A. Amira, D. Bouchaffra, R.H. Perrott, A statistical multiresolution approach for
face recognition using structural hidden markov models. EURASIP J. Adv. Signal Process
2008 (2008). https://doi.org/10.1155/2008/675787
54. O.M. Parkhi, A. Vedaldi, A. Zisserman, Deep face recognition, in Proceedings of the British
Machine Vision Conference (BMVC), ed. by M.W.J. Xianghua Xie, G.K.L. Tam (BMVA
Press, 2015), pp. 41.1–41.12. https://doi.org/10.5244/C.29.41
55. X. Peng, X. Yu, K. Sohn, D.N. Metaxas, M. Chandraker, Reconstruction-based disentangle-
ment for pose-invariant face recognition, in 2017 IEEE International Conference on Computer
Vision (ICCV) (2017), pp. 1632–1641. https://doi.org/10.1109/ICCV.2017.180
56. P.J. Phillips, H. Wechsler, J. Huang, P.J. Rauss, The feret database and evaluation procedure
for face-recognition algorithms. Image Vis. Comput. 16(5), 295–306 (1998), http://dblp.uni-
trier.de/db/journals/ivc/ivc16.html
57. Y. Qian, W. Deng, J. Hu, Task specific networks for identity and face variation, in 2018 13th
IEEE International Conference on Automatic Face Gesture Recognition (FG 2018) (2018),
pp. 271–277. https://doi.org/10.1109/FG.2018.00047
58. R. Ranjan, S. Sankaranarayanan, C.D. Castillo, R. Chellappa, An all-in-one convolutional
neural network for face analysis, in 2017 12th IEEE International Conference on Automatic
Face Gesture Recognition (FG 2017) (2017), pp. 17–24. https://doi.org/10.1109/FG.2017.
137
59. L. Rao, C. Harshitha, C.Z. Basha, N. Parveen, Hybrid computerized face recognition system
using bag of visual words and mlp-based bpnn, in 2020 4th International Conference on
Electronics, Communication and Aerospace Technology (ICECA) (2020), pp. 1113–1117.
https://doi.org/10.1109/ICECA49313.2020.9297499
60. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A.
Khosla, M. Bernstein, A.C. Berg, L. Fei-Fei, Imagenet large scale visual recognition challenge.
Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
61. T. Sakai, T. Kanade, M. Nagao, Y. Ichi Ohta, Picture processing system using a computer
complex. Comput. Gr. Image Proces. 2(3–4), 207–215 (1973). https://doi.org/10.1016/0146-
664x(73)90002-6
62. A. Salah, M. Bicego, L. Akarun, E. Grosso, M. Tistarelli, Hidden markov model-based face
recognition using selective attention - art. no. 649214, in Proceedings of SPIE—The Interna-
tional Society for Optical Engineering (2007). https://doi.org/10.1117/12.707333
63. F. Samaria, S. Young, Hmm-based architecture for face identification. Image Vis. Comput.
12(8), 537–543 (1994). https://doi.org/10.1016/0262-8856(94)90007-8
64. F. Schroff, D. Kalenichenko, J. Philbin, Facenet: a unified embedding for face recognition and
clustering, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2015), pp. 815–823. https://doi.org/10.1109/CVPR.2015.7298682
65. F. Shamrat, P. Ghosh, Z. Tasnim, A. Khan, M. Uddin, T. Chowdhury, Human face recognition
using eigenface, surf methods (2021). https://doi.org/10.1109/ICPCSN.2021.0908305
66. M. Sharif, S. Mohsin, M. Jamal, M. Javed, M. Raza, Face recognition for disguised variations
using gabor feature extraction. Aust. J. Basic Appl. Sci. 5, 1648–1656 (2011)
67. M. Sharif, S. Mohsin, M.Y. Javed, A survey: face recognition techniques. Res. J. Appl. Sci.
Eng. Technol. 4(23), 4979–4990 (2012)
68. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recog-
nition. CoRR abs/1409.1556 (2014), http://arxiv.org/abs/1409.1556
69. L. Sirovich, M. Kirby, Low-dimensional procedure for the characterization of human faces.
J. Opt. Soc. Am. A, Opt. Image Sci. 4, 519–24 (1987). https://doi.org/10.1364/JOSAA.4.
000519
70. M. Slavkovic, D. Jevtic, Face recognition using eigenface approach. Serb. J. Electric. Eng. 9,
121–130 (2012). https://doi.org/10.2298/SJEE1201121S
71. R. Smith, J. Kittler, M. Hamouz, J. Illingworth, Face recognition using angular lda and svm
ensembles, in 18th International Conference on Pattern Recognition (ICPR’06), vol. 3 (2006),
pp. 1008–1012. https://doi.org/10.1109/ICPR.2006.529
72. D. Srihari, P. Kishore, K. Eepuri, D. Anil Kumar, T. Maddala, M. Prasad, R. Prasad, A four-
stream convnet based on spatial and depth flow for human action classification using rgb-d
data. Multimed. Tools Appl. 79 (2020). https://doi.org/10.1007/s11042-019-08588-9
73. Y. Sun, D. Liang, X. Wang, X. Tang, Deepid3: face recognition with very deep neural networks.
ArXiv abs/1502.00873 (2015)
74. Y. Sun, Y. Chen, X. Wang, X. Tang, Deep learning face representation by joint identification-
verification, in Proceedings of the 27th International Conference on Neural Information Pro-
cessing Systems, vol. 2. NIPS 14 (MIT Press, Cambridge, 2014), pp. 1988–1996
75. Y. Sun, X. Wang, X. Tang, Deep learning face representation from predicting 10,000 classes, in
2014 IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 1891–1898.
76. Y. Sun, X. Wang, X. Tang, Deeply learned face representations are sparse, selective, and robust,
in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp.
2892–2900. https://doi.org/10.1109/CVPR.2015.7298907
77. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A.
Rabinovich, Going deeper with convolutions, in Computer Vision and Pattern Recognition
(CVPR) (2015), http://arxiv.org/abs/1409.4842
78. Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: closing the gap to human-level per-
formance in face verification, in 2014 IEEE Conference on Computer Vision and Pattern
Recognition (2014), pp. 1701–1708. https://doi.org/10.1109/CVPR.2014.220
79. Y. Tayal, P. Pandey, D.B.V. Singh, Face recognition using eigenface. Int. J. Emerg. Technol.
Comput. Appl. Sci. (IJETCAS) 3, 50–55 (2013)
80. M. Teja Kiran Kumar, P. Kishore, M. Prasad, Cnn-lstm hybrid model based human action
recognition with skeletal representation using joint movements based energy maps. Int. J.
Emerg. Trend. Eng. Res. 8(7), 3502–3508 (2020). DOI 10.30534/ijeter/2020/100872020
81. P. Tumuluru, L. Burra, D. Bhavanidasari, C. Saibaba, B. Revathi, B. Venkateswarlu, A
Novel Privacy Preserving Biometric Authentication Scheme Using Polynomial Time Key
Algorithm in Cloud Computing (2021), pp. 1330–1335. https://doi.org/10.1109/ICAIS50930.
2021.9395964
82. M. Turk, A. Pentland, Face recognition using eigenfaces, in Proceedings of 1991 IEEE Com-
puter Society Conference on Computer Vision and Pattern Recognition (1991), pp. 586–591.
83. M. Turk, Eigenfaces and Beyond (Advanced Modeling and Methods, Face Processing, 2005)
210 A. Verma et al.
84. A. Verma, T. Meenpal, B. Acharya, Human interaction recognition in videos with body pose
traversal analysis and pairwise interaction framework. IETE J. Res. 1–13 (2020). https://doi.
org/10.1080/03772063.2020.1802355
85. A. Verma, T. Meenpal, B. Acharya, Multiperson interaction recognition in images: a body
keypoint based feature image analysis. Comput. Intel. 37(1), 461–483 (2021). https://doi.org/
10.1111/coin.12419
86. A. Vinay, S. Vinay, K.N. Balasubramanya, S. Natarajan, Face recognition using gabor wavelet
features with PCA and KPCA–a comparative study. Procedia Comput. Sci. 57, 650–659
(2015). https://doi.org/10.1016/j.procs.2015.07.434
87. K.V. Vinitha, G.S. Kumar, Face recognition using probabilistic neural networks, in 2009
World Congress on Nature Biologically Inspired Computing (NaBIC) (2009), pp. 1388–1393.
https://doi.org/10.1109/NABIC.2009.5393716
88. J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse
representation. IEEE Trans. Pattern Anal. Mach. Intel. 31(2), 210–227 (2009). https://doi.
org/10.1109/TPAMI.2008.79
89. W. Wu, M. Kan, X. Liu, Y. Yang, S. Shan, X. Chen, Recursive spatial transformer (rest) for
alignment-free face recognition, in 2017 IEEE International Conference on Computer Vision
(ICCV) (2017), pp. 3792–3800. https://doi.org/10.1109/ICCV.2017.407
90. S. Yan, D. Xu, B. Zhang, H.J. Zhang, Q. Yang, S. Lin, Graph embedding and extensions:
a general framework for dimensionality reduction. IEEE Trans. Pattern Anal. Mach. Intel.
29(1), 40–51 (2007). https://doi.org/10.1109/TPAMI.2007.250598
91. M. Yang, L. Zhang, Gabor feature based sparse representation for face recognition with gabor
occlusion dictionary. 6316, 448–461 (2010)
92. J. Yim, H. Jung, B. Yoo, C. Choi, D. Park, J. Kim, Rotating your face using multi-task
deep neural network, in 2015 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (2015), pp. 676–684. https://doi.org/10.1109/CVPR.2015.7298667
93. X. Yin, X. Liu, Multi-task convolutional neural network for pose-invariant face recogni-
tion. IEEE Trans. Image Proces. 27(2), 964–975 (2018). https://doi.org/10.1109/TIP.2017.
2765830
94. L. Zhang, M. Yang, X. Feng, Sparse representation or collaborative representation: which
helps face recognition?, in 2011 International Conference on Computer Vision (2011), pp.
471–478. https://doi.org/10.1109/ICCV.2011.6126277
95. W. Zhang, S. Shan, W. Gao, X. Chen, H. Zhang, Local gabor binary pattern histogram sequence
(lgbphs): a novel non-statistical model for face representation and recognition, in Tenth IEEE
International Conference on Computer Vision (ICCV’05), vol. 1 (2005), pp. 786–791. https://
doi.org/10.1109/ICCV.2005.147
96. W. Zhao, R. Chellappa, P.J. Phillips, A. Rosenfeld, Face recognition: a literature survey. ACM
Comput. Surv. 35(4), 399–458 (2003). https://doi.org/10.1145/954339.954342
97. Z. Zheng, J. Zhao, J. Yang, Gabor feature based face recognition using supervised locality
preserving projection, in Advanced Concepts for Intelligent Vision Systems (Springer, Berlin,
2006), pp. 644–653
98. Y. Zhong, J. Chen, B. Huang, Toward end-to-end face recognition through alignment learn-
ing. IEEE Signal Proces. Lett. 24(8), 1213–1217 (2017). https://doi.org/10.1109/LSP.2017.
2715076
99. E. Zhou, Z. Cao, Q. Yin, Naive-deep face recognition: touching the limit of LFW benchmark
or not? CoRR abs/1501.04690 (2015). http://arxiv.org/abs/1501.04690
100. N. Zhu, Z. Yu, C. Kou, A new deep neural architecture search pipeline for face recognition.
IEEE Access 8, 91303–91310 (2020). https://doi.org/10.1109/ACCESS.2020.2994207
101. B. Zoph, Q.V. Le, Neural architecture search with reinforcement learning. CoRR
abs/1611.01578 (2016). http://arxiv.org/abs/1611.01578
COVID-19 Time Series Prediction
and Lockdown Effectiveness
Rajdeep Biswas and Soumi Dutta
Abstract The origin of the COVID-19 pandemic lies at the wet market of Wuhan,
China, which reportedly incepted from a person’s consumption of a wild animal
that was already infected with the disease. Since then, the virus has spread world-
wide like wildfire and poses a major threat to the entirety of the human species
itself. Coronavirus causes respiratory tract infections that can range from mild to
lethal. This paper discusses the use of data analysis and machine learning to draw
from the implications of the growth patterns of previous pandemics in general and
projects that specifically predict future scenarios of COVID-19. It also compares and
measures some of the present pandemic’s short- and long-span predictions with the
equivalent real-world data observed during and after the said span. It also attempts
to analyze how effective the lockdown has been across various countries and what
India specifically must do to prevent a catastrophic outcome.
Keywords COVID-19 · Coronavirus · Time series · Prediction · Lockdown
1 Introduction
Coronaviruses, from a broader perspective, are a group of related ribonucleic acid

(RNA) viruses. These viruses cause diseases in mammals and birds. In human beings,
the same can cause respiratory tract infections that can range from mild to lethal. At
this point, it has been evident from the way India has been affected by the second
wave of the pandemic, and there could have been much better ways to tackle it.
Primarily, the virus spreads between people at close contact. Small droplets are
produced by coughing, sneezing, and talking is the carrier most often [1]. Gener-
ally, the droplets fall onto the ground or surfaces rather than being airborne over long
distances. Transmission might also occur by smaller droplets that can stay suspended
R. Biswas
SAP Labs, Bengaluru, India
S. Dutta (B)
Institute of Engineering & Management, Kolkata, India
e-mail: soumi.dutta@iemcal.com
212 R. Biswas and S. Dutta
in the air for a longer duration. A person may also get infected by touching a contam-
inated surface and then touching their face [2]. The virus is most contagious in the
duration of the first three days that begins with the onset of symptoms. Spread is
also possible before symptoms appear and from a person who does not show any
symptom; these are known as asymptomatic carriers.
Many factors can influence the way a pandemic grows and evolves, including but
not limited to the various ways the virus itself mutates over time and geographical
area and how various populations interact with and spread the virus. Since these
factors are not exhaustive in nature, it is really hard to conclude which factors might
be useful features to a machine learning model.
This paper aims to highlight how it is possible to train a univariate model and
still be able to accurately assess the short- and long-term outcomes of a pandemic of
this nature. The study was commenced in early 2020, taking records of the infection
rates from the beginning of the pandemic, the model was finalized and trained in the
month of June, and the accuracy assessment was continued well into the following
year up until the submission of this paper in August 2021.
2 Historic Pandemic Patterns
A pandemic is defined as an epidemic of an infectious disease that has spread across a

very large area, such as several continents or even worldwide, affecting a considerable
number of people. There have been multiple pandemics of diseases like smallpox,
tuberculosis. The most lethal pandemic in all of recorded history was the Black Death
(also referred to as The Plague), which took the lives of an estimated 75–200 million
people during the fourteenth century [3, 4]. Other significant pandemics include
the 1918 influenza (Spanish flu) pandemic. Current ongoing pandemics include
HIV/AIDS and COVID-19 [5, 6]. Generally, most pandemics have an exponential
growth pattern. Exponential growth is the specific way in which a quantity might
increase over time. It is the phenomenon when the instantaneous rate of change of the
quantity in order with time is proportional to the quantity in itself. For the definition
of a function, a quantity undergoing exponential growth is an exponential function of
time; i.e., the feature representing time is an exponent. Coronavirus, similarly, was
expected to spread exponentially at first if no artificial immunization was available,
and the real-world outcome turned out to be exactly so. Each infected person has the
potential to infect multiple new people.
This exponential nature of the growth of a pandemic and the capability of being
able to predict its spread by the statistical and advanced mathematical methods by
means of the area of machine learning are what this paper aims to discuss primarily.
Figure 1: Different behaviors are shown by colors: red—linear growth, blue—cubic
growth, green—exponential growth.
The above is an example of exponential growth (green) in comparison with other
forms of growth, linear (red), and cubic (blue) [7]. It was a matter of concern to the
World Health Organization (WHO) back in the month of March when the endemic
COVID-19 Time Series Prediction and Lockdown Effectiveness 213
Fig. 1 Sample plot of

growth
(not yet a pandemic, as of then) had started to display the characteristics of expo-
nential growth. This was of substantial discussion since pandemics are exceptionally
harder to control and contain once the numbers start hitting these extents [8]. To
mitigate the spread of the virus, nearly all countries, whether already affected or
anticipating a serious outbreak, started to implement what is now colloquially known
as The Great Lockdown [9], the effectiveness of which is going to be the next biggest
matter of discussion for this paper in context.
3 Impact on the Economy
This essay goes incomplete without mentioning that there is a major ongoing global
economic recession that arose as a consequence of the COVID-19 pandemic. The
first significant sign of the recession was the 2020 stock market crash on the 20th of
February, and the International Monetary Fund (IMF) informed the public on 14th
of April that each one of the G7 countries had already descended or were descending
into a “deep recession” and the fact that there had already been a major stunt of
growth in countries with emerging economies [10]. International Monetary Fund
projections imply that this recession will be the severest global economic slowdown
since the great depression and that it will be much worse compared to even the great
recession of 2009.
The 2019–2020 COVID-19 pandemic is projected to have a substantial negative
effect on the economy worldwide, likely for years to come, with profound drops in
GDP accompanied by spikes in unemployment noted around the globe [11].
It is expected that the COVID-19 pandemic, like any other pandemic, will follow an
exponential growth curve. Even if not, at a minimal, a cubic curve can be expected.
There are several traditional curve fitting methods that can be used for this, but in
order to help our model learn the best representation of the time series data that we
have, it is a very good implication to train a few machine learning algorithms. Due
to the cubic nature of the exponential growth, it makes sense to express a polynomial
function in terms of at least a degree of three. We did some experiments in terms of
the degree of polynomial to be chosen, degrees of three and five seemed too good
fits, and demonstrably, the mean optimal of four was the degree that was settled on.
In the first 20–30 days of the pandemic, the growth was very uneven making it
difficult to assess whether the predictions being made were any accurate. Hence,
there was no point dividing up the data into a training set and a test set. In that case,
the attempt we made was to train the models with the entire dataset available at hand
and project the next 15 or 20 days of predictions according to the trained model.
Based upon awaited observations after the said number of days, we compare our
real-world observations to the predictions, giving us a fair enough metric. Even as of
the writing of this paper, for India, we have only 170 data points, i.e., 170 days worth
of data since January 30th: The date on which the first COVID-19 confirmed case
was reported in India. Even considering other countries, we can have an upward of
230 data points which, either way, gives us a very constrained environment to discuss
solely in the context of COVID-19. The process that we conducted and some of the
results that we obtained are as follows. In Fig. 2, the x-axis represents the number of
days passed since detection of the first COVID-19 case in India, and y-axis depicts
the number of cases in lakhs. This is actual data, and no predictions are made yet
(data source: api.covid19india.org).
We are using a univariate supervised model with the input feature (independent
variable) being the number of days passed since the first case was confirmed in the
country and the label (dependent variable) being the daily number of cumulative
Fig. 2 Sample plot

representing the relation
between no. of days passed
and no. of detection
cases. Note that common media, including internet visual informatics, use the daily
new number of cases to be the y-axis parameter, but we will avoid it due to the chaotic
nature of that representation since we want to help our model learn the best possible
fit. Also, exponential nature means that we take the cumulation of the last observed
data point added to the new information, so it only makes sense that the model shall
learn the best possible representation using a smoother time series.
Most importantly, perhaps, the one central takeaway of this paper, the under-
standing, is that we will always prefer to overestimate the pandemic rather than
underestimate it. The reason is simply considering the preparation of the worst-case
scenario instead of always trying to be metrically accurate. It would be nice to have
if we could closely represent the pandemic if we use enough features, but given the
univariate nature of the model, we will take overestimation as fair acceptance and
underestimation strictly not so. But, overestimation should be reasonable enough so
as not to represent something unrealistic.
We aim to fit this data into a polynomial regression model of degree 4. The
implementation that we chose was from Python’s sklearn library (scikit-learn.org),
linear regression was used from the linear_model module, and since we required
the features to be polynomial in nature, we preprocessed our input features using
polynomial features model. The loss function that was used is the one that defaults
to regression loss, mean squared error loss function, also known as quadratic loss.
The optimizer that it uses is traditional gradient descent. The following is when you
train the model with the first 40 days of the data and try to predict 170 days. In Fig. 3,
x-axis represents the number of days passed since the detection of the first COVID-19
case in India, whereas the y-axis denotes the number of cases in lakhs. Blue curve:
number of actual cases and red curve: number of cases the model estimates after
being trained with 40 days of data.
The model is taught that the number of cases is deemed to be really low, to begin
with (the red line), but things in the real world start spiking around the 75-day mark.
Further, training the model with the first 60 days of data produces the following
Fig. 3 Plot of the total

number of confirmed cases
over the number of days
passed for number of cases
in the country of India
Fig. 4 Blue curve: number

of actual cases and red curve:
number of cases the model
estimates after being trained
with 60 days of data
Fig. 5 X-axis: number of

days passed since the
detection of the first
COVID-19 case in India.
Y-axis: number of cases in
lakhs. Blue curve: number of
actual cases and red curve:
estimates after being trained
with 75 days of data
plot. X-axis represents the number of days passed since the detection of the first
COVID-19 case in India. Y-axis represents the number of cases in lakhs Fig. 4.
In Fig. 5, we can note that the prediction gets closer to the real-life data, but even
the first 60 days (two months) are not enough to be able to tell what we are looking
at in the case of a potential pandemic. The next figure is of training the model with
the first 75 days of data.
Here, we get an interesting plot where our prediction actually overshoots the actual
figures that were later seen in the real world around the 100th day mark. This has
some intriguing implications. More on that further down in this section. The next
Fig. 6 is obtained on training the same model with the first 90 days (nearly half of
the duration so far) of data.
This yet again has interesting implications. That is to be discussed in the summary
below. And finally, this is the plot obtained with 120 days of data (Fig. 7), i.e., about
70% of all of the data that is available to us. Now, the model has started to closely
estimate what the pandemic situation is going to look like in the near future. This
has given us interesting insights that it is nearly impossible to tell which direction a
pandemic is going to go, judging by the duration between the first few weeks or even

days passed since detection
of the first COVID-19 case in
India. Y-axis: number of
cases in lakhs. Blue curve:
number of actual cases and
red curve: number of cases
the model estimates after
being trained with 90 days of
data pandemic

India. Y-axis: number of
cases in lakhs. Blue curve:
number of actual cases and
red curve: number of cases
the model estimates after
being trained with 120 days
of data
three months of the duration. All of this growth is the organic growth of the virus
unhindered by any human intervention; neither has any medical advances occurred
nor has any proper social distancing norms been followed leading to the growth of
this nature, in India.
About the overestimation during the 75-day training, it indeed looked like India
had been making significant improvement but that soon, again, changed on the exten-
sion of training to 90 days of observation where again the estimation was overshot.
This tells us that India has been through a mix of improvement, and this improvement
only to be followed by a steady incline in the curve that marks neither improvement
nor the other way.
Sticking to the linear regression model with polynomial features of degree 4 had
made nearly accurate predictions of the
1. 50,000 mark to be on the 6th of May.
2. 100,000 mark to be on the 18th of May.
3. 500,000 mark to be on the 26th of June.
4. 1,000,000 mark to be on the 16th of July.

COVID-19 case in India.
actual cases and red curve:
estimates with a tree-based
algorithm
All of these predictions were made 10–15 days prior to the predicted date, majorly
implying that this pandemic, too, is following a very predictable near-exponential
curve.
It is a point of note that the model was retained, and predictions made from
it were recorded on a daily basis over the course of a year and still continue to
date (as of August 2021). The following (Fig. 8) was plotted a week prior to the
finalized submission of this paper, and it records roughly 550 days of the growth of
the pandemic (in blue). It is interesting to note that despite of the model being trained
with only initial 55 days of data (i.e., 10% of the total existing data), it succeeds at
closely depicting what the pandemic will look like well into the future (in orange).
This is to highlight how less than two months of data can show the nature of
pandemic one and half years into the future, especially trained with only one feature
of the number of the total cases, which is a very simply obtainable metric. Of course,
it is clear that it underestimates the pandemic growth in the short term, but as medical
infrastructure and government is expected to view this as a long-term problem (over
the course of years). It can be noted that longer-term worst-case scenarios need
to be considered even though the curve might seem to flatten few months into the
pandemic, but it can as easily spike again, unexpectedly, due to the unknown nature
of the factors.
Other popular time series prediction models are, however, tried and tested. None
of which significantly improved the performance of any of the predictions over
polynomial regression. However, a point of note is that it was realized that tree-
based algorithms like random forest regression, decision tree regression, XGBoost
are not any useful to time series predictions. This is even after considering the fact
that these can be very good algorithms in other use cases, but they only perform
well when the test setpoints lie within the upper and lower bounds of the training
set, making it unusable for time series predictions; the entire premise of which is
to talk about points that are outside of your training set (i.e., in the future). Here
is an example of how a tree-based algorithm fails when fed with data of only the
first 140 days. Notice how it flattens out after the 140-day mark since that part is
beyond the training set (Fig. 8). This is just to demonstrate of tree-based prediction
algorithms fail to perform outside of their training domain. But, this will be kept just
as one example since any further discussion on this is beyond the scope of this paper.
5 Determining Lockdown Effectiveness
Since the COVID-19 pandemic grew in exponential terms, as presented above, the
world has witnessed an unprecedented surge of need for health care. Even with
the best health care systems like Italy and the United States, the primary problems
that nations faced were due to the overpopulation at the hospitals. COVID-19 being a
disease with no mild symptoms requiring treatment, people got admitted at hospitals,
taking up beds, and soon led to overcrowding. Even individuals with only mild
symptoms in the face of panic got themselves admitted, which led to lesser care
for actually serious but treatable individuals where these countries saw a spike in
the death rate. Even individuals with other medical issues like accidents and other
diseases could not get health care on time, and that contributed to the death rate even
more. Lack of public awareness did not help decrease the contamination rate, and
more and more individuals kept getting infected to the point of requiring medical
treatment.
“Flatten the curve” is a public health strategy to alleviate the spread of the virus
during the pandemic. The curve in question is the epidemic curve, a visual repre-
sentation of the number of people that are infected who require health care plotted
over time. During a pandemic, a health care system is prone to breaking down when
number of people contaminated exceeds the capacity of the health care system’s
ability to treat them. Flattening the curve stands for slowing the spread of the virus
so that the steepest number of people requiring medical attendee on at a single time
is minimized, and the health care system does not exceed its maximum capacity.
Flattening the curve heavily depends on mitigation techniques especially social
distancing.
Warnings about the potential risk of a pandemic were made repeatedly throughout
the 2000s and the 2010s by prominent international organizations, including the
World Health Organization (WHO) and the World Bank, especially after the 2002
SARS outbreak [12]. Forms of the government, including the ones in the United States
and France, both before the 2009 swine flu pandemic and during the years following
the pandemic mentioned, strengthened their health care capacities but then again
weakened them [13]. At the time of the COVID-19 pandemic, health care systems
in many countries were compelled to function near their maximum capacities. In
situations like these, during which a sizable new epidemic emerges, an amount of
infected and symptomatic individuals cause a spike in demand for health care which
has only been predicted statistically, without the commencement of the epidemic or
the potential infectivity and lethality are known in advance. If the need circumvents
the capacity line in the infections per day curve, then the existing health care facility
cannot handle the surge of patients completely, resulting in higher mortality rates
than what would occur had preparations been made [14].

COVID-19 case in Italy.
lakhs. This is actual data, and
no predictions are made yet
A significant UK study showcased that an unmitigated response to COVID-19

in the UK could have made necessary, up to 46 times, the number of available
capacity of ICU beds. The challenge of public health management centrally is keeping
the epidemic wave of additional patients needing material and human health care
resources supplied in an adequate amount deemed to be medically justified [15].
Though the lockdown was advertised to carry the purpose of controlling the
pandemic, whereas the real objective was only to slow it down, India started with
the first infection as late as the end of January, by which time Italy had already been
massively affected. Even with one of the best health care systems in the world, it
suffered so badly because of public unawareness, which led to the overflowing of
the hospital beds, which gave rise to the “flatten the curve” movement as described
above.
Italy is a good example where the disease uncontrollably got transmitted across
the entire country and took a near-exponential shape. With some examples, it is
possible to visualize what the scenario would have looked like with the growth not
being stunted by means of intervention such as public awareness involving properly
implemented lockdown, which eventually helped flatten the curve and helped the
hospitals bring down the number of fatal cases. Figure 9 is a plot of the number
of total confirmed cases in Italy over the number of days passing since the first
confirmed case. We can note the eventual flattening out of the curve, which means
about 200 new cases per day. Now, this plot (Fig. 10) is the same model (as in the
above examples involving India) trained with only the first 100 days of data, and a
prediction is made of what the total number of cases would look like (the orange
curve) by the same time span as the present, which means that the number of diseases
would have been much higher than they actually were. This was made possible by the
proper implementation of a lockdown which definitely involved cooperation from
the public. A very similar plot (Fig. 11) can be observed for Spain, who implemented
lockdowns the correct way, to which the public responded with cooperation which
led to an effective flattening of the curve.
On the contrary, the following is the exact same representation of the United States
of America data. Following the observations for the first 100 days, the expected
curve actually undershoots the actual numbers (Fig. 12). Meaning that, the United

actual cases and orange
curve: number of cases the
model estimates after being
trained with 100 days of data

COVID-19 case in Spain.

the United States of America.
States of America is actually doing even worse with their lockdown than they started
with. Although, contrary to popular belief, this did not have anything to do with the
increased Black Lives Matter movement which were majorly a consequence of the
death of George Floyd. Looking at India next, we have some obvious conclusions.

The model observed (Fig. 13) only the first 100 days of data, yet it could nearly
accurately predict what the situation is going to be. This implies that the implemented
lockdowns barely had any effect, and even lesser so when they were eventually lifted
on and off, which is going on as of today. India initially implemented a 21-day
lockdown not to control the pandemic but rather to buy some time for the authorities
to brace for impact before the actual widespread transmission of the virus begins.
In a country like India, mass spreading was inevitable. And especially without any
control interventions, being a country of 1.3-billion population, looking at the current
statistics, if 50% of the population gets affected, it will be a mere 650 million, and if
0.5 percent of this count suffers death, we are looking at about 3 million deaths and
all of this not taking into account the arrival of a vaccine on treatment.
6 Conclusion and Future Implications
All of this leads to the conclusion that any of the lockdowns implemented in India
did slow down the spread of the virus. India had the lead time to learn from other
countries, and the steps taken from these observations were effective in making the
climb less steep compared to other countries. But, what India has to be aware that it is
still a steep climb nonetheless and will continue to grow in the same exact fashion. The
number of cases will keep getting higher at a near-exponential fashion without being
controlled by public means of social distancing, without the intervention of medical
advances in the form of treatments and vaccinations, even more so. The government
should do better at containing this pandemic. The medical industry needs to be more
considerate at handling patients considering the economy of the country. The general
public needs to be more aware of the situation. If so, we might be looking toward a
better future. The means of attaining any of the above, however, is beyond the scope
of this paper.
References
1. A.K. Sahai, et al., ARIMA modelling & forecasting of COVID-19 in top five affected countries.
Diab. Metab. Synd. Clin. Res. Rev. 14(5), 1419–1427 (2020)
2. P. De Masques, Une Responsabilité Partagée Par Les Gouvernements. Public Senat
(2020). https://www.publicsenat.fr/article/politique/penurie-de-masques-une-responsabilite-
partagee-par-les-gouvernements-successifs. Accessed on June 20, 2021
3. W. Dawn Kopecki, B. Jr. Lovelace, World Health Organization declares the coronavirus
outbreak a global pandemic. CNBC (2020). https://www.cnbc.com/2020/03/11/who-declares-
the-coronavirus-outbreak-a-global-pandemic.html. Accesed on June 25, 2021
4. Imperial College London, Report 9—Impact of Non-pharmaceutical Interventions
(Npis) to Reduce COVID-19 Mortality and Healthcare Demand. Imperial College
London. https://www.imperial.ac.uk/mrc-global-infectious-disease-analysis/covid-19/report-
9-impact-of-npis-on-covid-19/. Accessed on June 27, 2021
5. Mohammed, A.-Q., et al., Optimization method for forecasting confirmed cases of COVID-19
in China. J. Clin. Med. 9(3), 674 (2020)
6. N. Sharma, India’s swiftness in dealing with Covid-19 will decide the world’s future, says who.
Quartz. https://qz.com/india/1824041/who-says-indias-action-on-coronavirus-critical-for-the-
world/
7. D. Fanelli, F. Piazza, Analysis and forecast of covid-19 spreading in China, Italy and France.
Chaos, Solitons Fractals 134, 109761 (2020)
8. O.-D. Ilie, et al., Forecasting the spreading of COVID-19 across nine countries from Europe,
Asia, and the American continents using the ARIMA models. Microorganisms 8(8), 1158
(2020)
9. S.P. Stawicki, et al., The 2019–2020 novel coronavirus (severe acute respiratory syndrome
coronavirus 2) pandemic: a joint American college of academic international medicine-
world academic council of emergency medicine multidisciplinary COVID-19 working group
consensus paper. J. Glob. Infect. Diseases 12(2), 47 (2020)
10. C. Anastassopoulou, et al. Data-based analysis, modelling and forecasting of the COVID-19
outbreak. PloS One 15(3), e0230405 (2020)
11. F. Islam, Coronavirus recession not yet a depression. BBC News, BBC, ar. 2020, https://www.
bbc.com/news/business-51984470. Accesed on June 20, 2021
12. Centers for Disease Control and Prevention, Covid-19, in Centers for Disease Control and
Prevention (2021). https://www.cdc.gov/media/dpk/diseases-and-conditions/coronavirus/cor
onavirus-2020.html. Accesed on June 25, 2021
13. B.C. Archibald, A.B. Koehler, Normalisation of seasonal factors in winters’ methods. Int. J.
Forecast. 19(1), 143–148 (2003)
14. A. Tarsitano, I.L. Amerise, Short-term load forecasting using a two-stage sari-max model.
Energy 133, 108–114 (2017)
15. V. Stadnytskyi, et al., The airborne lifetime of small speech droplets and their potential
importance in SARS-CoV-2 transmission. Proc. Natl. Acad. Sci. 117(22), 11875–11877 (2020)
Performance Evaluation
of Electrogastrogram (EGG) Signal
Compression for Telemedicine Using
Various Wavelet Transform
M. Gokul , M. Sameera Fathimal , S. Jothiraj ,

and Pradeep Murugesan
Abstract This paper discusses the recording and compression analysis of an Elec-
trogastrogram (EGG), a non-invasive instrument that visually represents the electrical
activity of the stomach to diagnose stomach illnesses. The EGG signal’s compression
is important in the diagnosis, prognosis, and survival analysis of all stomach-related
disorders, especially in telemedicine applications where the patient is geographically
isolated. Over the years, several signal compression algorithms have been presented.
High cost, signal degradation, and a low compression ratio are just a few drawbacks
that result in an inefficient signal at the receiver’s end. The advantages of EGG
compression in digital domain for telemedicine applications are effective utilization
of storage data, reduced data transmission rate, and efficient transmission band-
width. Various wavelet transformations such as biorthogonal, coiflet, daubechies,
haar, reverse biorthogonal, and symlet wavelet transforms are applied to EGG signals
and examined using MATLAB software in this paper. The wavelet’s performance
was evaluated to select the best wavelet for telemedicine. This is accomplished by
a quantitative analysis of the recovery ratio, percent root mean square difference
(PRD), and compression ratio (CR) measurements. The findings of this study in
terms of determining the optimal signal compression performance can undoubtedly
become a valuable asset in the telemedicine area for the transmission of quantitative
biological signals.
M. Gokul (B)
Department of Biomedical Engineering, School of Bio and Chemical Engineering, Kalasalingam
Academy of Research and Education, Krishnankoil, TN, India
M. Sameera Fathimal
Department of Biomedical Engineering, SRM University, Chennai, Tamilnadu, India
S. Jothiraj
Department of Biomedical Engineering, Rajalakshmi Engineering College, Chennai, Tamilnadu,
India
P. Murugesan
Department of Biomedical Engineering, Faculty of Engineering, Vrije University, Brussel,
Belgium
226 M. Gokul et al.
Keywords Compression · Electrogastrogram (EGG) · Non-invasive ·

Telemedicine and wavelet transform
1 Introduction
Medical informatics intersects computer, information, and health science. This

involves optimizing the information, equipment, and methods needed to obtain store,
retrieve, and utilize information in healthcare and telemedicine applications. This
study has been carried out from the preliminary research, which has already found
the best order for wavelets; now the performance of that chosen order will be analyzed
in this study [1].
Basically, Medical information is composed of physiological data that have
become an essential part of the diagnosis and therapeutic phase. The physiolog-
ical data obtained from the human body are either images or signals that provide
diagnostic information. This data can be stored, updated, retrieved, and transmitted
from one place to another. Transmission of the signals within healthcare sectors or
organizations has storage difficulty and archival of data [2]. Therefore, compression
techniques have their significance while enormous quantities of data transfer are
considered. If an EGG signal is sampled at 180 samples per second (180 Hz) and
each sample is coded on 12 bits, then 24 h of information will amount to 22.8 MB.
This figure increases at a rapid rate in accordance with the length of the recording. So,
compression presents a low-cost alternative to the repeated updating and increment of
storage capacities and lines of communication [3]. The compression technique maxi-
mizes the quantity of information attained for online healthcare access. Compression
is crucial in telemedicine application for transmission of images or signals to compare
or analyze particular results in detail. Digital data sharing (telemedicine) in biomed-
ical research revolutionizes certain methods and allows researchers to quickly and
remotely analyze preliminary examinations. The advantage of this compression tech-
nique is efficient transmission of information by decreasing the size of data to be
stored. This paper concentrates only on EGG signal compression, which is almost
similar for all types of bio-signal compression.
Many people around the world are suffering from arrhythmias such as peptic
ulcer disease, gastroparesis, indigestion, gastritis, gastric emptying, nausea, motion
sickness, chronic mesenteric ischemia etc. In order to achieve a non-invasive stomach
diagnosis, the Electrogastrogram technique will be helpful and can be acquired easily.
So, the acquired EGG signals are tested with various wavelets transforms such as
Biorthogonal, coiflet, Daubechies, Haar, reverse Biorthogonal, and symlet wavelet
transforms using MATLAB software for finding the best and efficient performance in
compression technique. The objective of this study is to identify the best performance
wavelet for compression in telemedicine.
Performance Evaluation of Electrogastrogram (EGG) Signal … 227
2 Materials and Methods
The EGG acquired with surface electrode was amplified, filtered, and converted to a
digital signal. The performance of the wavelet transform for data compression was
analyzed using the MATLAB software.
2.1 Electrogastrogram Acquisition
The accurate placement of the electrode on the abdominal wall is an important

factor in obtaining good quality EGG. As electrode position and acquisition process
mentioned [4, 5] are followed in this study to acquire good quality EGGs. The
terminals of the electrode are connected to the signal conditioning unit, which consists
of both the amplification part (amplifies 100–500 microvolts) and filtering part (filters
0.02–0.20 Hz, where bradygastria (0.02 Hz), typical range (0.05 Hz), tachygastria
(0.11 Hz), and duodenal/breath (0.20 Hz).
Then the signal conditioning unit output is given to the analog to digital converter
(ADC) to digitize the acquired signal, which is given to PC for further analysis.
Usually, the analyzed signal is transmitted to the respective department of interest
or the doctor’s end. Here, the signal is given to MATLAB installed in PC to analyze
the compression performance in wavelet transforms. The simplified process of EGG
data acquisition is shown in pictorial representation (see Fig. 1). The EGG input
digital signal before decomposition and compression is shown (see Fig. 2).
Instrumenta
tion Bandpass MATLAB
Surface Filter Analog to
amplifier Analysis
Electrode digital
(amplifies (filters 0.02 (At
(sensor) converter
100 – 500 – 0.20 Hz) PC/Laptop)
micro volts)
Fig. 1 Pictorial representation of EGG data acquisition

228 M. Gokul et al.
Fig. 2 Input EGG signal
2.2 Wavelet Transform Performance Analysis
Wavelets are ideal for processing biological signals. It outperforms Discrete Cosine
Transform (DCT) compression as it consists of non-uniform frequency spectral char-
acteristics that promote multi-scale analysis and multiresolution properties, reduced
distortion, sophisticated distortion, and sophisticated compression strategies. Addi-
tionally, there are some drawbacks in processing the bio-signals in other transform
like Fourier Transform, where there is a complete loss of time information [6]. The
frequency axis has been uniformly divided, and it is possible to make the resolu-
tion accurate if it is integrated along the whole-time axis. In Short Time Fourier
Transform (STFT), the information in the time domain is taken into consideration
by addition of a window. The frequency resolution depends on time resolution and
the size of the window considered. Due to the box uniformity, it is not possible to
enhance a particular frequency range in STFT. However, the wavelet can capture
useful and effective details of a bio-signal through localized function [7]. Efficient
time and frequency localization can be produced when the signal is portrayed as a time
function. Compared to Fourier transform, the wavelet method has greater flexibility
and enables choosing the characterization for specific biomedical applications. The
Discrete Wavelet Transform (DWT) is more potent as it yields adequate information
to evaluate and integrate a bio-signal. DWT is calculated utilizing the filterbanks in
which filters with different cutoff frequencies analyze the signal at different scales.
It is feasible to alter the resolution by upsampling and downsampling of a signal as
it passes through the high and low pass filters [6, 7].
2.3 Performance Measurements
The wavelet transforms such as haar, biorthogonal, daubechies, coiflet, reverse

biorthogonal, and symlet wavelet transforms are applied on the acquired EGG signal
to examine the best compression performance in telemedicine. Three major parame-
ters are measured in the compressed signal to compare and analyze compression
performance: PRD, CR, and recovery ratio. The entire process of this analysis
includes signal decomposition, compression, transmission, reconstruction, and anal-
ysis of three parameters such as percentage root mean difference, compression ratio,
and recovery ratio (see Fig. 3).
Wireless Transmission
Fig. 3 Block diagram of the proposed study
Percentage root mean difference is a quantity that measures the distortion between
reconstructed and original input signal. The PRD establishes point-wise comparison
with the original data, providing reconstruction fidelity [8, 9]. The mathematical
formula for PRD computation is shown (see Eq. 1).

N
N
[x(n) − X (n)]2 x(n)2 (1)
n=1 n=1
Compressed data must have better fidelity when the compression ratio is high [8].
The mathematical formula for CR is shown below (see Eq. 2).
CR = B(original) /B(compressed) (2)
where Boriginal is bit rate of the original signal, and Bcompressed is bit rate of the
compressed signal.
According to the decomposition level, the signal can be recovered in the recovery
ratio parameter. If the decomposed level is very high, it has a low recovery ratio and
vice versa.
Signal decomposition involves down sampling and filtering. The decomposed
signal consists of a detailed part (high frequency) and an approximation part
(low frequency). The sub-signal generated from the lowpass filter is considered a
maximum frequency equal to half the frequency of the signal in agreement with
the Nyquist principle [10]. The perfect signal can be reconstructed with only half
of the original stores and transmitted samples. Up sampling, when employed, elimi-
nates consecutive second sample. Then the lowpass filtered approximation sub-signal
is passed through a filter, and the process is continued till the desired degree of
decomposition is obtained [11].
The objective of signal compression is to minimize the no. of bits in the data
while storing the data with acceptable quality. Even though numerous compression
techniques have been suggested in the literature, few aim to perfect reconstruct
230 M. Gokul et al.
original data [12]. When an application needs only limited bitrates, methods that
enable a supervised loss of information can be used. The loss is very small and is
called the “lossy” method that combines high compression ratios with acceptable
visual quality. So, compression ratio, performance root means square difference,
and recovery ratio are calculated [13]. In DWT, the reconstruction process of the
wavelet consists of up sampling and filtering. The filter selection is essential for the
reconstructed signal to be perfect [12, 14]. The down sampling of the biological
signal performed during the decomposition phase produces an artifact or distortion
called aliasing. It turns to be a key factor while selecting the filters for decomposition
and reconstruction aspects that are closest (that may not be identical) to cancel the
effects of aliasing [14].
2.4 Wavelet Type and Its Performance
The orthogonality characteristics and the multi resolution analysis (MRA) should be
performed to construct the base wavelet. The orthogonal wavelet describes the infor-
mation contained in an image and leads to the creation of multiresolution analysis
[15, 16]. The spline method is used to construct the base wavelet [17, 18].
All the wavelets have orthogonal properties. Biorthogonal and harr wavelets are
symmetrical, whereas coiflet and symlet are near symmetric and daubechies are
asymmetric. The input signal applied to biorthogonal (left) and reverse biorthog-
onal (Right) wavelets are shown in with a clear view of the original EGG signal,
compressed signal, reconstructed signal, and error signal (see Fig. 4). The harr and
daubechies (see Fig. 5) wavelet transform is applied on the EGG signal, then the
signal is compressed and reconstructed, also the error signal is computed [19]. The
input signal is applied with coiflet and symlet wavelet transform (see Fig. 6) [20, 21].
Fig. 4 EGG signal processing using biorthogonal and reverse biorthogonal wavelet transform
Fig. 5 EGG signal processing using haar and daubechies wavelet transform
Fig. 6 EGG signal processing using coiflet and symlet wavelet transform
Results are obtained from different wavelet transforms on various sets of data is
tabulated (see Table 1). For good transmission, the PRD should be low. Among the
wavelets tested, Haar wavelet has very low PRD [8]. But to compare corresponding
Table 1 Performance
Wavelets CR (%) PRD Recovery ratio
analysis with various wavelet
transforms Haar 98.5371 6.1854e−15 67.8443
Symlet 98.5371 8.3502e−13 67.8443
Daubechies 98.5271 1.8939e−11 67.7987
Coiflet 98.5540 2.6572e−08 67.4655
Biorthogonal 98.5492 1.4608e−12 67.6211
Reverse biorthogonal 98.5540 1.4701e−12 67.4655
232 M. Gokul et al.
CR, Coiflet and reverse biorthogonal has better CR than other transforms. In terms
of overall performance, reverse biorthogonal has good and decent performance in
compression, which has better CR with low PRD.
After finding the best wavelet for transmission, the best and suitable order of
wavelet should be chosen to achieve the lowest reconstruction error. To find the
lowest reconstruction error, MRA is applied [22]. The best low average error order
for EGG signal processing is 1.3 in reverse biorthogonal wavelet. The average error
calculated in all levels of reverse biorthogonal wavelet where rbior 1.3 (reverse
biorthogonal 1.3) has the lowest error rate.
4 Conclusion
According to the principal objective of our paper, the best performance wavelet for
compression and suitable order for the lowest reconstruction error was found. The
results obtained proved that the reverse biorthogonal wavelet is more suitable for
Electrogastrogram in telemedicine. While incorporating this wavelet in compression
technique, faster and less expensive transmissions are possible on a daily basis. The
future work of this research is to apply these wavelets for all kinds of biological signals
and to add additional analysis features like an automatic diagnosis in the receiver
end. These advancements could be very useful in the field of gastroenterology and
telemedicine when the subject is affected by severe gastric illness like stomach cancer,
ulcer, etc.
References
1. M. Gokul, P. Murugesan, Choice of wavelets for Electrogastrogram (EGG). Acta Sci.

Gastrointest. Disord. 3(12), 01–03 (2020)
2. L. Tarassenko, C. Peggram, P. Hayton, O. Gibson, A. George, J. Wheeler, e San Ltd, 2006.
Telemedicine system. U.S. Patent Application 10/528,365
3. R. Wootton, J. Craig, V. Patterson (eds.), Introduction to Telemedicine, 2nd edn. (CRC Press,
2006). https://doi.org/10.1201/9781315272924
4. M. Gokul, N. Durgadevi, R. Bhuvaneshwari, V.R.S. Vadivu, C.R. Kumar, Rehabilitation tool
for gastroparesis by the analysis of interstitial cells of cajal (The external gastric pacmaker with
a feedback of gastric potential). J. Gastrointest. Dig. Syst. 8(557), 2 (2018)
5. M. Gokul, P. Murugesan, M. Harshavardhan, Fast fourier transform (FFT) based electrogas-
trogram (egg) analysis under water load test (WLT). Eur. J. Pharm. Med. Res. 7(9), 632–638
(2020)
6. R.X. Gao, R. Yan, Wavelets: Theory and Applications for Manufacturing (Springer Science &
Business Media, 2010). ISBN 978 1 4419 1544 3
7. I. Daubechies, Ten lectures on wavelets. Soc. Indus. Appl. Math. (1992) ISBN: 978-0-89871-
274-2
8. R. Javaid, R. Besar, F.S. Abas, Performance evaluation of percent root mean square difference
for ecg signals compression. Signal Proces. Int. J. (SPIJ) 48, 1–9 (2018)
9. C.L. Tseng, C.C. Hsiao, I.C. Chou, C.J. Hsu, Y.J. Chang, R.G. Lee, Design and implementation
of ECG compression algorithm with controllable percent root-mean-square difference. Biomed.
Eng. Appl. Basis Commun. 19(04), 259–268 (2007)
10. J. Kevric, A. Subasi, Comparison of signal decomposition methods in classification of EEG
signals for motor-imagery BCI system. Biomed. Signal Process. Control 31, 398–406 (2017)
11. A. Cicone, J. Liu, H. Zhou, Adaptive local iterative filtering for signal decomposition and
instantaneous frequency analysis. Appl. Comput. Harmon. Anal. 41(2), 384–411 (2016)
12. M. Gokul, N. Durgadevi, B. Sukita, Medical product aspects of antenatal wellbeing belt—the
consolidated analysis on product design and specification. World J. Pharm. Res. 9 (2020)
13. S.O. Rajankar, S.N. Talbar, An electrocardiogram signal compression technique: a compre-
hensive review. Analog Integr. Circ. Sig. Process 98(1), 59–74 (2019)
14. C.A.E. Kothe, T.P. Jung, Artifact removal techniques with signal reconstruction. U.S. Patent
Application 14/895,440 (2016)
15. A. Sake, R. Tirumala, Bi-orthogonal wavelet transform based video watermarking using
optimization techniques. Mater. Today: Proc. 5(1), 1470–1477 (2018)
16. P.M.K. Prasad, D.Y.V. Prasad, G.S. Rao, Performance analysis of orthogonal and biorthogonal
wavelets for edge detection of X-ray images. Procedia Comput. Sci. 87, 116–121 (2016)
17. M. Sharma, A. Dhere, R.B. Pachori, U.R. Acharya, An automatic detection of focal EEG signals
using new class of time–frequency localized orthogonal wavelet filter banks. Knowl.-Based
Syst. 118, 217–227 (2017)
18. K. Mourad, B.R. Fethi, Efficient automatic detection of QRS complexes in ECG signal based on
reverse biorthogonal wavelet decomposition and nonlinear filtering. Measurement 94, 663–670
(2016)
19. D. Zhang, Wavelet transform, in Fundamentals of image data mining (Springer, Cham, 2019),
pp. 35–44
20. A. Zaeni, T. Kasnalestari, U. Khayam, Application of wavelet transformation symlet type and
coiflet type for partial discharge signals denoising, in 2018 5th International Conference on
Electric Vehicular Technology (ICEVT). IEEE (2018), pp. 78–82
21. P.P.S. Saputra, R. Firmansyah, D. Irawan, Various and multilevel of coiflet discrete wavelet
transform and quadratic discriminant analysis for classification misalignment on three phase
induction motor. J. Phys. Conf. Ser. 1367(1), 012049 (2019)
22. A. Gudigar, U. Raghavendra, T.R. San, E.J. Ciaccio, U.R. Acharya, Application of multireso-
lution analysis for automated detection of brain abnormality using MR images: a comparative
study. Futur. Gener. Comput. Syst. 90, 359–367 (2019)
The Impact of UV-C Treatment on Fruits
and Vegetables for Quality and Shelf Life
Improvement Using Internet of Things
N. Sneha and Bhagya M. Patil
Abstract The objective of the research is to develop a device that helps sanitize the
consumable goods, materials and make free from bacteria and virus. The materials
bought from outside are exposed to UV-C rays (254 nm) for a duration of 5–15 min so
that fearlessly use them without altering the quality of the goods. The device consists
of UVC light installed inside a chamber with calibrated dosage UVC light. The type
of UV used for the purpose is Type-3 UVC, which is effective with microorganisms,
including new viruses. The items to be disinfected are carried on to the chamber using
a specialized tray to provide the complete penetration of UVC light inside the tray or
directly placing the items; the process begins with scanning and sanitization of the
items, covering the whole surface area in 360°. In this study, the short UV-C therapy
for whole tomato and banana and fresh sliced apple, guava, grapes were chosen, and
its efficiency was assessed with various time duration with multiple methods to check
its nutrients level degradation. The sliced fruits treated with UV-C (254 nm) reported
lower oxidation relative to the control fruit quality by increasing the shelf life. The
result shows the comparative changes when treated with 5–15 min of 254 nm UV-C
dose were more important in retaining quality. The subsequent experiments with
standardized High Performance Liquid Chromatography (HPLC), Bradford assay,
and Dinitrosalicylic acid assay methods were used to test changes in nutrients level
and compared with the fruits treated under 5, 10, and 15 min. The test results showed
that it would inactivate the microorganisms on the object’s surface and proved that
it would not deteriorate the nutrients level.
Keywords UV C Light · Internet of things · HPLC
N. Sneha (B) · B. M. Patil

School of Computer Science and Applications, REVA University, Bangalore, Karnataka, India
e-mail: sneha.n@reva.edu.in
B. M. Patil
e-mail: bhagyam.patil@reva.edu.in
236 N. Sneha and B. M. Patil
1 Introduction
During the current pandemic situation COVID-19 breakout, all of us know how
we are taking precautions in ourselves and our surroundings. The safest way is to
applying sanitizer to our hands and wear a mask to keep us safe from the virus
spreading in and around us. Another critical thing during the period is to safeguard
from the materials brought from the outside like groceries, marketplaces, shops, etc.,
since it may carry the virus or bacteria spreading all over the world like wildfire or
sandstorm. People stopped going out even though it’s a basic necessity and running
their lives in fear for many days or even months from march till May 2020. The
COVID-19 is spreading out all over the country made us fear. Few people went to
depression for fear of touching any objects around them. According to research, the
viruses can long last up to 48 h, upon the surfaces like mask, money, bills, mobiles,
clothes, packages, an individual need to be careful in contact with these materials.
Sanitization of the materials [1] is required before they are used in daily activities at
home, office, or outside.
1.1 Type of UVC Types
Ultraviolet is a portion of the Electromagnetic Spectrum. Spectrum [2] consists of

the various intensity of light, including visible light, X rays, infrared rays, gamma
rays etc. The UV radiation falls under shorter wavelength and high frequency, it very
harmful for human beings and it’s not visible human that usually ranges between
400 and 10 nm. UV radiation is generally divided into three types: UV-A, UV-B, and
UV-C. These categories are classified based on the wavelength and also biological
impact on the atmosphere (Fig. 1). The division of spectrum is shown in Fig. 1 [3].
• UV-A (315–399 nm)
• UV-B (280–315 nm)
• UV-C (100–279).
Fig. 1 Types of UVC [3]

The Impact of UV-C Treatment on Fruits and Vegetables … 237
1.2 About IoT and Its Applications in Food Processing

Industries
The need for food for humans and animals is increasing drastically, and there is a
sacristy of food. Perhaps food wastage is not stopped in a few locations, events, food
industry [4]. The food wastage during processing should be taken care of well to
serve food for few more living beings. The processing of food aims to process raw
materials into value-added food for human and animal consumption. The processing
of food should ensure food safety for safe use. The preparation and processing of food
also include drying, cooling, and salting, which guarantees the shelf life of perish-
able food products. By enhancing their tastes, flavor, and texture, some microbes
such as yeast and bacteria are used to increase the quality of food products. This
accounts for an improvement in the food item’s value. All these activities include
automated processes carrying out the smooth system working on completing the
food processing activities using Technologies such as IoT, Machine learning [5],
and Image Processing [6]. The technologies include identifying fruits, vegetable
conditions such as ripen, rotten, raw, percentage of microorganisms on the surface
etc.
2 Literature Survey
Koutchma et al. [7] give a comparative analysis using 92 studies which are published
between 2004 and 2015 in the ultraviolet (UV) light and high-pressure processing
(HPP). One of the limitations to their more extensive commercial application lies
in the lack of comparative effects on nutritional and quality-related compounds in
juice products. Minimal processing nonthermal techniques such as ultraviolet (UV)
light and high-pressure processing (HPP) are expected to be used to extend shelf-
life while retaining physicochemical, nutritional, and sensory characteristics with
reduced microbial loads. Moreno et al. [8] used short UV-C treatment for fresh-
cut carambola. The experiment was carried for controlled environment fruit and
UV-C exposed fruit. The test was carried out for the first day, and after 21 days,
the nutritional values were checked. They proved that UV-C exposure reduces the
yeast, bacterial count, and controlled spoilage. Prestorage UV-C exposure was highly
effective in controlling fruit browning through Polyphenol Oxidase (PPO) inhibition
and improved maintenance of tissue integrity.
Yang et al. [9] proposed a model used at the hospital to disinfect or sterilize the
hospital patient room before the new patient gets admitted. They conducted a study
to find the effectiveness of a mobile, automated device, hyperlight disinfection Robot
which utilized UV-C to kill Pseudomonas aeruginosa, Acinetobacter baumanni etc.
They observed that after UV-C exposure in an uncleaned hospital room and found a
significant reduction in the bacteria count. It was giving their feasible involvement in
acquiring the activities of cell wall degrading fruits and vegetable firmness. Weaver
et al. [10] provide the details with respect to the sterilization of surgical masks and
N95 respirators. In this, they gave a procedure for sterilizing N95 respirators in a
bio-safety cabinet. Because of Coronavirus worldwide, a shortage of masks plays a
vital role in protecting any individual from effect. So, the main intention behind this
paper is to reuse the masks after exposing to UV-C rays for 15–20 min. This will
help to some extent avoid a shortage of surgical masks and N95 masks. Raeiszadeh
et al. [11] present a review of applications of UV-C light in various places, which
helps sanitize the frequently touched places. The UV-C light were installed in places
like shopping malls, airport etc., to kill the bacteria. Apart from the applications, the
author also discussed the safety considerations for UV-C light usage. Jiang et al. [12]
give the survey of utilization of UV light for sterilization of fruits and vegetables.
This study gives insight into how useful the UV light is to kill the bacteria or fungus
and helps improve the shelf life of fruits and vegetables. However, multiple devices
have been developed recently for the sterilization of groceries, fruits, and vegetables,
etc. The limitation with all these devices is that no exposure at the bottom surfaces
is taken care of in our device.
3 Methodology
The device is developed using Module HC-05, Relay, Motor Driver L298N, and
UV Light, Bluetooth device shown in Fig. 2. The major part of the device lies on
the component UVC light (254 nm), more effective in reducing microbial growth
and cleaning the surface than the commonly used chlorine, hydrogen peroxide, or
ozone, which can leave residue and ultimately reduce quality. After exposure, the
quality of the goods won’t be changed as it retains the nutritional level. The device
can be operated using the APP or else manually. Also, it can be operated based
upon the customer’s requirements. The goods are kept inside the box, and we can
control the action using the buttons ON/OFF, RESET, SET Time, Water (ON/OFF)
Fig. 2 Architecture of UV-C based device for Home

through your mobile or external switches to switch on the device. Users will have
an option of selecting either water sprinkling mode or UV exposure mode. After
Cleaning/completion, the System will automatically beep the status for completion.
This effectively works based on bulbs that can produce the precise dosage of UV-
C light required to destroy DNA and RNA viruses. Dual light, one light tube can be
balanced and placed at 180° at the bottom, and another light tube can be placed at the
bottom at 180°. This makes sure the vegetables and fruits are exposed to 360° using
aluminum foils [13], which helps in reflecting light in all degrees. Also, the device
consists of a security door that can prevent light rays from holding the light coming
out of the box. The device can be used for home-based and industry depending on the
quantity of the materials. The food industry is exposed to cleaning of huge vegetables
and fruits to prepare packed foods like juice jam and food materials, which need to be
thoroughly washed when received from the market or directly from the farm. Some
of the research has focused on disinfecting surface area through methods [14].
Depending on the quantity device can be modified, as shown in Fig. 3. The device
consists of a conveyor belt attached with the UV-C light device, which keeps on
scrolling and pushing the vegetables/fruits to the device for continuous exposure of
the UV-C light for few minutes. Also, the device can have incorporated with a water
sprinkler based on the type of vegetables being exposed, like potatoes, carrots etc.
UV radiation is generally used for the treatment of drinking and waste water [15],
air, disinfection, fruit and vegetable juice treatment. UV has been a choice within
research facilities when obtaining biological safety cabinets for years and can even
be used within the laboratory. Within the 100–280 nm wavelengths of UV, identified
as UV-C, with the wavelength peak for 265 nm of germicidal action. This spectrum
action includes absorption of DNA and RNA Microorganisms. The source of UV
light includes sunlight and other electrically human-designed germicidal lamps that
generate light as per the designed dosage of wavelength range of 100–400 nm The
light sample is shown in Fig. 4.
Fig. 3 Architecture of UV-C based device for industry
Fig. 4 UVC lamp

Arduino Uno (Microcontroller) is an open-source hardware and developed and

upgraded easily, and we can find more information and solutions related to this.
It acts as a core part of a device where are all the components are attached to it.
It is responsible for controlling those components. To be specific, these boards are
equipped with input and output pins that have both analog and digital i/o. These are
programmed using software which is also called Arduino; this is also open source,
and it uses C basic as a programming language. An ultrasonic sensor [16] is an
active motion sensor. It acts as both sender and receiver; as the name states, it will
use ultrasonic waves to detect any changes in the motion of any object present in the
straight path.
3.1 Plant Material and Fruits Used
The plant material used for the testing under device are Tomato, guava, apple, grapes,
and banana. Tomato botanical name Solanum lycopersicum [17] is a vegetable and
edible berry of a tomato plant rich in Vitamin C, Potassium, Folate, simple sugars,
glucose, and fructose. The plants are widely affected by the tobacco mosaic virus,
curly top, Pyrenochaeta lycopersici, Didymella stem rot, Early blight, Alternaria
solani, etc. Banana, botanical name Musa genus, is a fruit an edible berry of the banana
plant, which is rich in magnesium, vitamin C, B6, carbohydrates, and fibers. The fruit
is widely affected by panama disease, tropical race 4 etc. Guava, botanical name
Psidium guajava [18], is rich in vitamin C, dietary fiber, folic acid, few concentrates
of vitamin A, and potassium. Grapes, botanical name Vitaceae [19], is a fruit that
is rich in vitamin C, K, E, carbohydrates. Apple, botanical name Malus domestica
[20], is an edible fruit rich in carbohydrates and fiber.
Figure 5 shows the circuit diagram of the device consisting of Arduino and Blue-
tooth, Relay, Stepdown transformer, UV sensor, Fan, Fuse, UV Light, and connec-
tions. The device can be operated by using an APP that is designed to operate it.
To communicate with the device, it needs to be configured with the APP through
Bluetooth. Figure 6 shows the device’s working flow of the device and after the
connectivity between the device and the APP, after connection is successful, the user
can start the device. The device will turn on until and unless the door is closed. If the
door is closed than the user has to set for 5 min and start the device. If in between
the door gets open, then the timer will pause, and the UV light is turned off. Later,
when the door is closed, the timer will be resumed.
3.2 Quality of Fruits and Vegetables
Once the food products such as fruits and vegetables harvested from the farm it needs
to thoroughly clean before sending to the processing industry or industry should take
care in cleaning before it can be used in packed food. The availability of products
Fig. 5 Circuit diagram for the device
Fig. 6 Flow chart of the device operation

needs to be stored in proper storage units for preservation for future processing,
and the quality of food may decrease once it goes to cold storage units. Finding the
Deterioration level [21] of fruits and vegetables can be done using Systematic Visual
Image Analysis, and damage level that can be rated from 1 to 5 or 10 to 100%,
where 1—Poor Quality and 5—best Quality, 10—rotten stage and 100% indicates—
freshness. The Deterioration level of fruits and vegetables or any food products can
be calculated using Eq. (1).
n
i=1 r.s r 1 s1 + r 2 s2 + r 3 s3 + · · · + r n sn
Detorioration level = = (1)
Total Samples n
Case 1: If 10 samples of fruits are considered for quality assessment and among 2
fruits are rated 3 scales, and 8 are rated 5 scales, then deterioration of the 10 samples.
DI = 3 * 2 + 5 * 8/10 = 4 and weighted of the samples is 4 rated. The conclusion
will be 4 rated for 10 samples and they are fresh to use it.
Case 2: If 100 sample of fruits is considered for quality assessment and among 16
fruits are rated 1 scale, 56 fruits are 2 rated, 23 fruits are 3 rated, and 5 are 4 rated
scale, then deterioration of the 100 samples is DI = 1 * 16 + 2 * 56 + 3 * 23 +
4 * 5/100 samples, and the weighted average is 2.17 average rating. The conclusion
will be 2.17 rating samples is not consumable.
The device UV C Based Sterilization for vegetables and fruits is shown in Fig. 7.
The objective is to disinfect the vegetables and fruits from bacteria and viruses by
exposing the materials to UV Light (254 nm—UVC). Three samples were taken:
Sample 1: Untreated, Sample 2: Exposing the sample for UV—10 min, Sample 3:
Exposing the sample for varies minutes. After exposure, the sample is tested for the
quantity of carbohydrates and proteins concentration.
Fig. 7 Sample vegetable exposed to UV-C device

A sample consists of one vegetable and one fruit. The testing results are shown in
Figs. 8 and 9. Figure 8 shows the results of carbohydrate content were determined
using Dinitrosalicylic acid assay [22], a standard spectrophotometric biochemical
assay. The values indicate that there is a higher quantity of carbohydrates, with an
increase in treatment time, indicating possible hydrolysis. Figure 9 shows the results
of protein content were determined using Bradford assay [23], a standard spectropho-
tometric biochemical assay. Banana: The values indicate there is no presence of free
proteins in bananas, possibly due to the low content of proteins in banana fruit.
In tomatoes, there is an increase in the concentration of proteins with increasing
treatment time, possibly due to increased protein release. High-performance liquid
chromatography (HPLC) [24], which is also named as high-pressure liquid chro-
matography. This method is used to separate, identify, and quantify each component
in a mixture.
HPLC method is used for many applications such as during the pharmaceutical
product manufacturing process, separation of a complex biological sample proposes.
The analysis was carried out equipped with a binary pump (LC-20 AD), a variable
wavelength UV–VIS (SPD-M20A) detector, Shimadzu C18 column (250 × 4.6 mm,
5 µ), and a manual injection valve fitted with 20 µl sample loop. The instrument was
controlled by LC solution software. Apple, Grape, and Guava samples were exposed
under a UV incubator chamber for 5 and 10 min, after that macerated in water for
Malic acid, Citric acid, and Vitamin C Analysis.
Vitamin C was analyzed using Shimadzu C18 column with a mobile phase of
acetonitrile and 10 mM potassium di-hydrogen orthophosphate buffer mixed in a
Fig. 8 Carbohydrate
concentration results using
dinitrosalicylic acid assay
Fig. 9 Protein concentration

results using bradford assay
ratio of 40:60 (pH = 2.1) at a wavelength of 268 nm with a UV detector. The flow
of the mobile phase was maintained at a speed of 1 ml/min. Vitamin C content has
remained unchanged in all three fruit after 5 and 10 min exposure in the chamber.
Maximum Vitamin C was observed in Guava (4.0 mg/gm FW) followed by Grape
and Apple.
Citric acid and Malic acid were isocratic ally separated using a Shimadzu C18
column at a 1 ml/min flow rate. The mobile phase used was 2% (w/v) ammonium
di-hydrogen orthophosphate (NH4H2PO4) buffer (pH 2.18). UV–VIS detector at
wavelength 214 nm was used for detection. Both Citric acid and Malic acid have
remained unaffected by the exposure in the UV chamber for 5 and 10 min. Maximum
Citric acid was observed in Guava (2.31 mg/gm FW) followed by Grape and Apple.
A similar trend has been observed for malicacid as well. The variation results are
shown in Fig. 10 and sample images during testing is shown in Fig. 11.
Applications: UV-based sterilization system/device can be used in homes, food
industry, hostel mess, hotels, and restaurants etc. The device assures us of cleanness
and microbial-free use for the individual and processing vegetables and fruits for
packaged goods, juices, jams, pickles, snacks, instant foods etc. The device can
have incorporated with deep learning and high-end IoT devices to perform more
analysis along with the cloud, to recognize the quality of the sample used so far in
the industries.
Fig. 10 Variations of citric acid, malic acid, vitamin C in guava, grapes, and apple when exposed
to untreated, 5 and 10 min
Fig. 11 Testing samples of vegetables and fruits
5 Conclusion
During this pandemic, the usage of fruits and vegetables, even after washing, will not
guarantee the destruction of the bacteria. So, we developed a model which consists
of UV light at the top and bottom of the device. The 2 lights will help expose the
items at the top and bottom, and aluminum foils help in reflecting light in all 360°.
After exposure, it destroys the microorganisms on the surface and preserves shelf life
for a further few days compared to cold storage. The device can be used per the load
and device exposure followed by cold storage to preserve a few more days. After
exposure, the experiment was conducted using micrological and HPLC methods
for different samples of tomato, banana, guava, grapes, and apple. The results were
evaluated to show the changes in concentration in carbohydrates, protein, malic acid,
citric acid, and vitamins. These methods proved that after exposure to UV light, the
quality of the items wouldn’t change, and also it extended the shelf life.
Acknowledgements The authors express their sincere gratitude to our Honorable Chancellor Dr.
P. Shyama Raju Sir, Our Director Dr. S. Senthil Sir, Dr. Rajeev Ranjan Sir, School of Computer
Science and Applications, and Dr. Shilpa BR, Dr. Jayashree, School of Applied Science and REVA
family for giving constant encouragement, support to carry out research at REVA University. The
implementation work was carried out along with BCA Students Mr. Yashwanth, Mr. Vinay, Mr.
Uthej, and Mr. Darshan from the School of Computer science and Applications.
References
1. M.H. Khan, H. Yadav, Sanitization during and after COVID-19 pandemic: a short review. Trans.
Indian Natl. Acad. Eng. 5, 617–627 (2020). https://doi.org/10.1007/s41403-020-00177-9
2. . S.Z. Li, A. Jain (eds.), Electromagnetic spectrum, in Encyclopedia of Biometrics (Springer,
Boston, 2009). https://doi.org/10.1007/978-0-387-73003-5_504
3. https://uvceco.com/why-uv-c-cannot-produce-ozone/
4. Y. Gu, W. Han, L. Zheng, B. Jin, Using IoT technologies to resolve the food safety problem—
an analysis based on Chinese food standards, in Web Information Systems and Mining. WISM
2012, ed. by F.L. Wang, J. Lei, Z. Gong, X. Luo, Lecture Notes in Computer Science, vol. 7529
(Springer, Berlin, 2012). https://doi.org/10.1007/978-3-642-33469-6_50
5. S.K. Behera, A.K. Rath, A. Mahapatra et al., Identification, classification & grading of fruits
using machine learning & computer intelligence: a review. J. Ambient. Intell. Human Comput.
(2020). https://doi.org/10.1007/s12652-020-01865-8
6. D. Yogesh, A.K. Dubey, R. Ratan, et al., Computer vision based analysis and detection of
defects in fruits causes due to nutrients deficiency. Cluster Comput. 23, 1817–1826 (2020).
https://doi.org/10.1007/s10586-019-03029-6
7. T. Koutchma, V. Popović, V. Ros-Polski, A. Popielarz, Effects of ultraviolet light and high-
pressure processing on quality and health-related constituents of fresh juice products. Compr.
Rev. Food Sci. Food Saf. 15, 844–867 (2016). https://doi.org/10.1111/1541-4337.12214
8. C. Moreno, M.J. Andrade-Cuvi, M.J. Zaro, M. Darre, A.R. Vicente, A. Concellón, Short UV-C
treatment prevents browning and extends the shelf-life of fresh-cut carambola. J. Food Qual.
Article ID 2548791, 9 (2017). https://doi.org/10.1155/2017/2548791
9. H.-H. Yang, U.-I. Wu, H.-M. Tai, W.-H. Sheng, Effectiveness of an ultraviolet-C disinfection
system for reduction of healthcare-associated pathogens. J. Microbiol. Immunol. Infect. 52(3),
487–493 (2019). ISSN 1684-1182. https://doi.org/10.1016/j.jmii.2017.08.017
10. D.T. Weaver, B.D. McElvany, V. Gopalakrishnan, K.J. Card, D. Crozier, A. Dhawan, J.G. Scott,
UV decontamination of personal protective equipment with idle laboratory biosafety cabinets
during the COVID-19 pandemic. Plos One 16(7), e0241734 (2021)
11. M. Raeiszadeh, B. Adeli, A critical review on ultraviolet disinfection systems against COVID-
19 outbreak: applicability, validation, and safety considerations. ACS Photonics 7(11), 2941–
2951 (2020)
12. Q. Jiang, M. Zhang, B. Xu, Application of ultrasonic technology in postharvested fruits and
vegetables storage: a review. Ultrason. Sonochem. 105261 (2020)
13. E.V. Grabovski, P.V. Sasorov, A.P. Shevelko et al., Radiative heating of thin Al foils by intense
extreme ultraviolet radiation. Jetp Lett. 103, 350–356 (2016). https://doi.org/10.1134/S00213
64016050040
14. S. Bredholt, J. Maukonen, K. Kujanpää et al., Microbial methods for assessment of cleaning
and disinfection of food-processing surfaces cleaned in a low-pressure system. Eur. Food Res.
Technol. 209, 145–152 (1999). https://doi.org/10.1007/s002170050474
15. J.P. Chen, L. Yang, L.K. Wang, B. Zhang, Ultraviolet radiation for disinfection, in Advanced
Physicochemical Treatment Processes, ed. by L.K. Wang, Y.T. Hung, N.K. Shammas. Hand-
book of Environmental Engineering, vol. 4 (Humana Press, 2006). https://doi.org/10.1007/978-
1-59745-029-4_10
16. S. Wang, Q. Liu, S. Chen, Y. Xue, Design and application of distance measure ultrasonic sensor,
in Advances in Mechanical and Electronic Engineering, ed. by D. Jin, S. Lin. Lecture Notes in
Electrical Engineering, vol. 178 (Springer, Berlin, 2013). https://doi.org/10.1007/978-3-642-
31528-2_18
17. M. Dorais, D.L. Ehret, A.P. Papadopoulos, Tomato (Solanum lycopersicum) health compo-
nents: from the seed to the consumer. Phytochem. Rev. 7, 231 (2008). https://doi.org/10.1007/
s11101-007-9085-x
18. A. Vijaya Anand, S. Velayuthaprabhu, R.L. Rengarajan, P. Sampathkumar, R. Radhakrishnan,
Bioactive compounds of Guava (Psidium guajava L.), in Bioactive Compounds in Underutilized
Fruits and Nuts, ed. by H. Murthy, V. Bapat. Reference Series in Phytochemistry (Springer,
Cham, 2020). https://doi.org/10.1007/978-3-030-30182-8_37
19. J. Wen, Vitaceae, in Flowering Plants · Eudicots. The Families and Genera of Vascular Plants,
vol. 9, ed. by K. Kubitzki (Springer, Berlin, 2007). https://doi.org/10.1007/978-3-540-32219-
1_54
20. E. Szücs, T. Kállay, Determination of fruiting capacity of apple trees (Malus domestica) by
DRIS, in Plant Nutrition—Physiology and Applications. Developments in Plant and Soil
Sciences, vol. 41, ed. by M.L. van Beusichem (Springer, Dordrecht, 1990). https://doi.org/
10.1007/978-94-009-0585-6_120
21. P. Li, X. Yu, B. Xu, Effects of UV-C light exposure and refrigeration on phenolic and antioxidant
profiles of subtropical fruits (Litchi, Longan, and Rambutan) in different fruit forms. J. Food
Qual. 2017, 12 (2017), Article ID 8785121. https://doi.org/10.1155/2017/8785121
22. M.J. Bailey, A note on the use of 8dinitrosalicylic acid for determining the products of enzymatic
reactions. Appl. Microbiol. Biotechnol. 29, 494–496 (1988). https://doi.org/10.1007/BF0026
9074
23. C.G. Jones, J. Daniel Hare, S.J. Compton, Measuring plant protein with the Bradford assay. J.
Chem. Ecol. 15, 979–992 (1989). https://doi.org/10.1007/BF01015193
24. V.R. Meyer, High-performance liquid chromatography (HPLC), in Practical Methods in
Cardiovascular Research, ed. by S. Dhein, F.W. Mohr, M. Delmar (Springer, Berlin, 2005).
https://doi.org/10.1007/3-540-26574-0_35
Modeling and Forecasting Stock Closing
Prices with Hybrid Functional Link
Artificial Neural Network
Subhranginee Das, Sarat Chandra Nayak, and Biswajit Sahoo
Abstract Stock closing prices fluctuate arbitrarily, accompanying high nonlinearity,

and are inclined toward many macro-economic aspects; therefore, it is hard to antic-
ipate. Functional link artificial neural networks (FLANNs) are popular nonlinear
approximation methods for predicting stock prices. The training process of FLANN
vastly affects its generalization performance. In contrast to gradient-based training,
nature-inspired optimization algorithms-based training is found efficient. This article
presents a hybrid forecast combining artificial electric field algorithm (AEFA) and
FLANN called AEFA + FLANN to predict stock price movements. AEFA is
used to evolve an optimal FLANN structure through the exploration of the best
feasible model parameters. The proposed model carries experiments conducted on
actual prices collected from different stock markets such as Dow Jones Industrial
Average (DJIA), National Association of Securities Dealers Automated Quotations
(NASDAQ), Bombay Stock Exchange (BSE), and Hang Seng Index (HSI). Outcomes
of result analysis and comparative studies have shown the out-performance of AEFA
+ FLANN-based forecasting.
Keywords Stock price · Forecasting · Financial · Functional link artificial neural

network · Electric field algorithm · Gradient descent
S. Das (B)
Department of Computer Science and Engineering, KL University, Hyderabad, India
e-mail: subhranginee.das@klh.edu.in
S. C. Nayak
Department of Computer Science and Engineering, CMR College of Engineering and Technology,
Hyderabad, India
B. Sahoo
School of Computer Engineering, KIIT University, Bhubaneswar, India
e-mail: bsahoofcs@kiit.ac.in
250 S. Das et al.
1 Introduction
The economy of a country has a direct link to the stock market. The stock price
changes arbitrarily depending upon international law, global market scenario, gold
price, petrol price, exchange rate, various socio-economic and political factors, etc.
[1, 2]. Due to such random movement, the trend it follows is a nonlinear curve.
Predicting a point on such a highly nonlinear curve is a tough job. A very nominal
variation of stock price can have an impact on the prices of the international economy.
Consequently, for the stock market, an efficient prediction mechanism is desired.
Based on the linear association of past and current data, several statistical models are
recommended in the early days to forecast financial data. However, these methods
are not found promising in forecasting stock series efficiently. Therefore, advances in
intelligence techniques such as artificial neural networks (ANNs) have been consid-
ered a better substitute for statistical methods and ANN have been successfully used
in the literature for stock price forecasting [3–6]. ANNs need many parameter selec-
tions, such as hidden layer numbers and number of nodes in each hidden layer.
But higher-order neural network having fewer parameter requirements with impres-
sive computational, storage, and learning capabilities can overcome the demerits of
ANN. This paper used one such higher-order neural network, i.e., FLANN, for better
prediction efficiency.
Finding the optimal weight and biases of FLANN structure is a crucial aspect that
needs human skill. Usually, gradient-based methods are used to achieve this, but it is
associated with few drawbacks such as slow convergence and trapped in local optima.
Later, many evolutionary optimization techniques came forward inspired by natural
phenomena and have been used as a better substitute for gradient-based methods
[7]. Evolutionary learning methods such as genetic algorithm (GA), particle swarm
optimization (PSO), and differential evolution (DE) are more proficient methods for
searching the optimal FLANN parameters. Since no single technique was found suit-
able in solving all sorts of problems, continuous improvements in existing methods
have been carried out by researchers through enhancement in an algorithm [8, 9]
or hybridization of them [10–15]. Recently, AEFA has been anticipated as an opti-
mization method inspired by the principle of electrostatic force [16]. AEFA is based
on a robust theoretical concept of charged particles, electric field, and the attrac-
tion/repulsion force between two charged particles in an electric field. The learning
capacity, convergence rate, and acceleration updates of AEFA have been established
in [16] through solving some benchmark optimization problems.
This work is an initiative toward investigating the potential of AEFA in fine-
tuning the parameters of a FLANN, thus designing a hybrid model called as AEFA
+ FLANN. The proposed AEFA + FLANN is assessed through forecasting stock
prices of DJIA, NASDAQ, BSE, and HSI. Data pre-processing, input selection, and
model design steps are also explained.
Modeling and Forecasting Stock Closing Prices with Hybrid … 251
2 FLANN
A FLANN is a class of higher-order neural network built on functional expan-

sion. FLANN may have different functional expansions viz. Chebyshev expansion,
trigonometric expansion, or polynomial expansion. Here in this research, we have
used trigonometric FLANN. Figure 1 shows the FLANN architecture.
Let x(i) be the representation of the input pattern before expansion where 1 <
i < I (totalnumberoffeatures) and then the functionally expanded inputs are f n (i),
1 < n < k, where k = number of expanded points for each input element and let k =
6 and I = the total number of features has been taken from the dataset. Expansion
of each input pattern using trigonometric expansion is as Eq. (1):
⎫
f 1 (i) = x(i), f 2 (i) = sin(x(i)), ⎪
⎬
f 3 (i) = sin(π x(i)), f 4 (i) = sin(2π x(i)), (1)
⎪
⎭
f 5 (i) = cos(x(i)), f 6 (i) = cos(π x(i))
For easier naming, we are using z j (t) as the expanded input at tth iteration, where
1 ≤ j ≤ J and J = k × I .
These nonlinear outcomes of the input layer are multiplied with the corresponding
randomly initialized weight values chosen from the range [−1,1], and the summed
result is then passed through the activation function to produce output. Here, we have
taken one neuron in the output layer. The computed output from the output layer is
compared with the desired output to compute error. The resulting error for the given
Fig. 1 Functional link artificial neural network

252 S. Das et al.
pattern is given by Eq. (2)

e(t) = y(t) − y (t) (2)

Here, y(t) is the target output and y (t) is the computed output for an input pattern

set at tth iteration. y (t) is calculated as Eq. (3)

J
y (t) = z j (t).w j (t) (3)
j=0
where z j (t) is the functionally expanded input of node j at tth iteration, z 0 (t) = 1
and w j (t) is the assigned weight from node j to the output node at the tth iteration and
w0 (t) is the bias which is initialized with some random value from the range [−0.5,
0.5]. The computed error from Eq. (4) is used to calculate the change in weight
w dj (t) = μ × z j (t) × e(t) (4)
where μ is the convergence coefficient and wdj (t) is the change in weight value at
tth iteration for any input set patterns d. If there are total p set of patterns are there
in training set, then the average change in each weight is given by Eq. (5)
1
p
w j (t) = w dj (t) (5)
p d=1
Then, the updated weight will be given by Eq. (6)
w j (t + 1) = w j (t) + w j (t) (6)
3 AEFA + FLANN-Based Forecasting
AEFA is designed on the principle of electrostatic force of Coulomb’s law [16]. It

simulates the charged particles as agents and measures their strength in terms of their
amounts. The particles are moveable in the search domain through electrostatic force
of attraction/repulsion among them. The charges possessed by the particles are used
for interaction, and the charges’ positions are considered the potential solutions for
the problem. According to AEFA, the particle having the highest charge is regarded
as the best individual. It has the ability to attract other particles having inferior
charge and it moves in the search domain. The mathematical justification of AEFA
is illustrated in [16]. Here, we simulate a potential solution of FLANN as a charged
particle and its fitness function as the quantity of charge associated with that element.
The velocity and position of a particle at time instant t are updated as per Eq. 7 and
8, respectively.
Vid (t + 1) = randi ∗ Vid (t) + accelerationid (t) (7)
X id (t + 1) = X id (t) + Vid (t + 1) (8)
The following steps present the AEFA + FLANN-based forecasting method and
the overall AEFA + FLANN steps are shown in Fig. 2.
Fig. 2 The process of AEFA + FLANN

254 S. Das et al.
For AEFA + FLANN, the collected data are divided into two datasets as training
data and test data. The training dataset is used to train the network of FLANN. Once
the network gets trained and the network parameters are computed, the testing dataset
is used to test the model’s accuracy. For both the data, the rolling window method is
conditioned to select input. The input data are normalized by using sigmoid normal-
ization. Then, from the normalized data, I number of data are selected randomly.
Those I numbers of data pass through the expansion function of the given FLANN
and get expanded. Each input feature gets developed to k numbers of features. So
total I × k numbers of inputs are created in each step. These expanded inputs are
assigned with random weights and biases. Then, the weighted sum goes through
the activation function to produce output. The computed output is denormalized and
compared with the actual output to calculate the error. The generated error is used to
determine the fitness value. From that select the best and the worst fit then by using
the AEFA steps, the best-fit weights are calculated. Once the best possible weights
are computed, the FLANN model is created using these weights to test new patterns.
4 Experimental Results and Analysis
We conducted different experiments on four real stock price series such as NASDAQ,
DJIA, BSE, and HSI to measure the predictability of the proposed approach and
comparative methods. The actual stock prices on daily basis for one financial year
are collected from https://finance.yahoo.com/quote/history. There are approximately
252 data points on each series. From a series, inputs are carefully chosen through
the rolling window method. Raw data are gone through the normalization process
using the sigmoid method and then fed to the models separately [17]. The FLANN
is trained through AEFA-based learning and approximates an output. The esti-
mated output is denormalized and compared with the observed value. The variation
from actual output is measured as the error caused by the model. Six comparative
models such as gradient descent FLANN (GD-FLANN), genetic algorithm-based
FLANN (GA-FLANN), differential evolution-based FLANN (DE-FLANN), and
PSO-FLANN, multilayer perceptron (MLP), and autoregressive integrated moving
average (ARIMA) are developed similarly and used for fair comparisons.
To avoid the stochastic nature of the neural network-based models, we simulated
each model twenty times and the mean error from twenty experiments is summarized
in Table 1. The best average errors are shown in boldface. For all datasets, AEFA
+ FLANN produced the best average errors. Though few tie-ups are found with
GA-FLANN and PSO-FLANN, the AEFA + FLANN achieved the lowest average
error for all datasets. The AEFA + FLANN estimated prices against actual prices
are plotted in Figs. 3, 4, 5 and 6.
Table 1 Error statistics from all forecasts
Closing price series Error statistic Forecast
MLP ARIMA GD-FLANN GA-FLANN DE-FLANN PSO-FLANN AEFA + FLANN
DJIA Average 0.84293 1.00372 0.01568 0.00773 0.00977 0.00862 0.00773
Std 0.02018 0.04728 0.00875 0.00888 0.00835 0.00875 0.00299
NASDAQ Average 0.97930 0.99529 0.08580 0.04522 0.04407 0.04377 0.02165
Std 0.04283 0.05227 0.03269 0.05364 0.03465 0.00560 0.00935
BSE Average 0.86115 0.99274 0.03805 0.00835 0.02377 0.00835 0.00835
Std 0.03228 0.07274 0.04283 0.02623 0.02113 0.01757 0.02113
Modeling and Forecasting Stock Closing Prices with Hybrid …
HSI Average 0.93227 0.98306 0.06892 0.01495 0.00469 0.00930 0.00930

Std 0.02215 0.04261 0.01903 0.01302 0.01148 0.01038 0.00535
255
256 S. Das et al.
Fig. 3 Forecast plots from NASDAQ
Fig. 4 Forecast plots from DJIA

Fig. 5 Forecast plots from BSE
Fig. 6 Forecast plots from HSI

258 S. Das et al.
5 Conclusions and Future Scope
This article presented a hybrid prediction model called as AEFA + FLANN for
modeling of stock market price movements. AEFA is used to determine the optimal
parameters of FLANN, thus evolutionarily constructing the forecast. The resulted
hybrid forecast is applied to predict the future closing prices of four real stock data
series. The model inputs are extracted from the original data series using the rolling
window method. The model output is finally denormalized to get the prediction price.
Six relative forecasts are developed. The model performances are measured in terms
of average error and standard deviation. From exhaustive simulation studies, it is
spotted that the AEFA + FLANN model is pretty efficient in catching the hidden
patterns in the stock prices than others and generated the lowest error signals. This
model can be used for other time series data prediction. Therefore, the present study
can be stretched with some improvised version of AEFA and adopting other neural
models.
References
1. C.J. Huang, D.X. Yang, Y.T. Chuang, Application of wrapper approach and composite classifier
to the stock trend prediction. Expert Syst. Appl. 34(4), 2870–2878 (2008)
2. H.-C. Liu, Y.-H. Lee, M.-C. Lee, Forecasting china stock markets volatility via GARCH models
under skewed-GED distribution. J. Money Invest. Bank. 5–14 (2009)
3. S. Soni, Applications of ANNs in stock market prediction: a survey. Int. J. Comput. Sci. Eng.
Technol. 2(3), 71–83 (2011)
4. A. Rao, S. Hule, H. Shaikh, E. Nirwan, P.M. Daflapurkar, Survey: stock market prediction
using statistical computational methodologies and artificial neural networks. Int. Res. J. Eng.
Technol. 08, 2395–2456 (2015)
5. V. Rajput, S. Bobde, Stock market forecasting techniques: literature survey. Int. J. Comput.
Sci. Mob. Comput. 5(6), 500–506 (2016)
6. A. Sharma, D. Bhuriya, U. Singh, Survey of stock market prediction using machine learning
approach, in 2017 International conference of electronics, communication and aerospace
technology (ICECA), Vol. 2, pp. 506–509. IEEE (2017, April)
7. P.B. Rana, J.L. Patel, D.I. Lalwani, Parametric optimization of turning process using evolu-
tionary optimization techniques—a review (2000–2016). Soft Comput. Probl. Solv. 165–180
(2019)
8. N. Shadbolt, From the Editor in Chief: Nature-inspired computing. IEEE Intell. Syst. 19(01),
2–3 (2004)
9. K. Opara, J. Arabas, Comparison of mutation strategies in differential evolution—a proba-
bilistic perspective. Swarm Evol. Comput. 39, 53–69 (2018)
10. S. Jiang, Y. Wang, Z. Ji, Convergence analysis and performance of an improved gravitational
search algorithm. Appl. Soft Comput. 24, 363–384 (2014)
11. S.C. Nayak, B.B. Misra, A Chemical Reaction Optimization based Neuro-Fuzzy hybrid Network
for Stock Closing Prices Prediction, Financial Innovation (Springer, Berlin, 2019)
12. S.C. Nayak, M.D. Ansari, COA-HONN: cooperative optimization algorithm based higher
order neural networks for stock forecasting, in Recent Advances in Computer Science and
Communications (Bentham Science, 2019)
13. S. C. Nayak, S. Das, Md. Ansari, TLBO-FLN: teaching learning-based optimization of func-
tional link neural networks for stock closing price prediction. Int. J. Sens. Wirel. Commun.
Control Bentham Sci. (2019)
14. S. Das, S.C. Nayak, B. Sahoo, Towards crafting optimal functional link artificial neural
networks with RAO algorithms for stock closing prices prediction. Comput. Econ. 1–23 (2021)
15. S.C. Nayak, A fireworks algorithm based Pi-Sigma neural network (FWA-PSNN) for modelling
and forecasting chaotic crude oil price time series. EAI Trans. Energy Web (2020)
16. A. Yadav, AEFA: artificial electric field algorithm for global optimization. Swarm Evol.
Comput. 48, 93–108 (2019)
17. S.C. Nayak, B.B. Misra, H.S. Behera, ACFLN: artificial chemical functional link network for
prediction of stock market index. Evol. Syst. 10(4), 567–592 (2019)
Whale Optimization Algorithm Based
Optimal Power Flow to Reduce
Generation Cost
T. Papi Naidu, B. Venkateswararao, and G. Balasubramanian
Abstract The whale optimization algorithm (WOA), motivated by the direction of

the whale, is applied to solve the optimal power flow (OPF) problem in this paper.
The OPF is an extremely nonlinear and composite optimization problem, where the
study state parameters of an electrical system need to be determined for its cost-
effective and capable operation. Solving OPF remains a popular and demanding job
among power system investigators. In this paper, WOA is executed on the IEEE 30
bus power system; this approach is useful to optimize the regulated variables like
active power generations, setting of tap changing transformers, generator voltages,
and shunt capacitance values. By considering these regulated variables with several
power system constraints, obtain the minimization of active power generation cost
using WOA. The advantages of the WOA have been demonstrated in comparison
with other algorithms like gray wolf optimizer (GWO) and moth swarm algorithm
(MSA).
Keywords OPF · WOA · Fuel cost · Optimization
1 Introduction
Electrical power utilization is increasing day by day and also looking for economical
operation by reducing the generation cost. In recent years, one of the predominant
issues applied to realize the optimal planning process of a practical system is optimal
power flow (OPF). The function of optimal power flow (OPF) is more significant in
modern power system operation and control. OPF problem optimizes the regulated
variables by considering the minimization of fuel cost to reduce the generation cost.
T. Papi Naidu
Lendi Institute of Engineering and Technology, Vijayanagaram, AP 535005, India
Annamalai University, Chidambaram, Tamil Nadu, India
B. Venkateswararao (B)
V. R. Siddhartha Engineering College, Vijayawada, AP 520007, India
G. Balasubramanian
Government College of Engineering, Tirunelveli, TN 627001, India
262 T. Papi Naidu et al.
Generations of power from numerous sources in the system are to be optimally

organized for the cost-effective and competent function of the system.
OPF problem is articulated with all relevant information with generator outputs
to get the best possible settings. To handle different problems, numerous researchers
applied different techniques. Some of them are given as follows. Asija et al. [1]
explained the current scenario of the deregulated power system, which needs to be
determined for better cost-effective and system efficiency. The true power generation
cost reduction for a power system is realized based on many parameters, one of them is
generator allocation. They used GA-based optimization technique to acquire optimal
generator allocation. They tested on IEEE 14 bus system. Bouchekara et al. [2]
proposed a novel nature motivated algorithm called black-hole-based optimization to
answer the OPF problem, and they applied it to the IEEE 30-bus system. Bouchekara
et al. [3] proposed a teaching–learning-based approach to solve the OPF problem. In
order to demonstrate the efficacy of the projected technique, it has been pragmatic
to IEEE 30 bus system for altered objectives like fuel cost, power loss, and deviation
in voltage that imitate the enactment of power system.
Duman et al. [4] implemented a gravitational search algorithm (GSA) to find the
best elucidation for the OPF problem. This approach is functional to govern the best
sets scheme with various objective functions. In [5], El-Fergany et al. applied the
DE and GWO for solving IEEE 30 and 118 bus systems with various objectives.
Houssem Rafik et al. [6] use an improved version of electromagnetism to solve
the optimal solution for OPF difficulty in a power system. To show the success of
the developed technique, it is verified on IEEE 30 and 57 bus systems for altered
objectives. Ghasemi et al. [7] applied the evolutionary method called an imperialist
competitive algorithm (ICA) to solve OPF, which is a significant apparatus in together
scheduling and operating stages; the enactment of this method is calculated on IEEE
30, 57 bus systems with several objective functions. Ramesh Kumar et al. [8] used
ARCBBO for the optimization of several objective tasks of OPF. Roy et al. [9]
focused mainly on the employment of OPF problem to get the benefit in terms of
economic aspect. It is solved by GABC. Surender Reddy et al. in [10] proposed a
capable approach for evolutionary algorithm-based OPF. The major disadvantage is
the extreme finishing time due to the huge number of power flows necessary in the
result progression. Chaib [11] used the latest technique called BSA for solving the
OPF. This technique is tested for 16 dissimilar cases in IEEE 30, 57, 118 bus systems.
In addition to the conventional fuel cost, multi-fuel options, valve point cause, and
other complications have been measured. Biswas et al. [12] studied and applied
numerous evolutionary algorithms (EAs) to find optimal OPF problem solutions.
Mohamed et al. [13] present a novel MSA, motivated by the direction of moths
near moonlight to answer the controlled OPF. The four heuristic algorithms and MSA
are applied on IEEE 30, 57, and 118 bus power systems. These methodologies are
used to adjust the regulated variables. Surender Reddy et al. [14] proposed swarm
intelligence (SI) methods to solve OPF problem, and they applied on IEEE 30 bus
and practical Indian 75 bus system for cost reduction as an objective function. Based
on the above review, most of the authors applied the OPF to the IEEE 30 bus system
and considered the true power generation cost as the main objective. Therefore in
Whale Optimization Algorithm Based Optimal … 263
this paper, authors present the well-known algorithm, WOA, and applied on IEEE 30
bus system. By this technique, improved convergence characteristics are obtained,
and the cost of the generation getting reduced.
2 OPF Problem Structure
Statistically, the OPF is denoted in Eq. 1 with:
Min : O(x, u) (1)
Focus to: I(x, u) ≤ 0.

E (x, u) = 0.
O (x, u): OPF objective function, I (x, u): set of disparity restrictions, E (x, u): set of
likeness restrictions.
Independent variables represented in Eq. 2
u =[PTG2, PTG5, PTG8, PTG11, PTG13, V TG1, V TG2, V TG5,

V TG8, V TG11, V TG13, QC10, QC12, QC15, QC17, QC20,
QC21, QC23, QC24, QC29, T 11, T 12, T 15, T 36] (2)
where PG i is the ith bus generator active power, VG i is the magnitude of the voltage
at ith PV bus, T j is the jth branch tap of the transformer, Q Ck is the shunt capacitor
at kth bus.
Dependent variables presented in Eq. 3
x = [P G1, V L1 . . . V LNL, QG1 . . . QG N G, Sl1 . . . S ln l] (3)
2.1 Cost Minimization
The goal of this is to reduce the cost. The total fuel cost function (F1 ) for a quantity
of thermal generating units can be illustrated by the following Eq. 4:
NTG

F1 = αi + βi PTGi + γi PTGi
2
$/hr (4)
i=1
2.2 Constraints
Equality OPF constraints are present in Eqs. 5 and 6
PGi − PDi − Pl = 0 (5)
Q Gi − Q Di − Q l = 0 (6)
Dissimilarity constraints.
The OPF inequality restrictions are
(a) Generator restrictions given in Eqs. 7–9:
min
VGi ≤ VGi ≤ VGi
max
; i ∈ Ng (7)
min
PGi ≤ PGi ≤ PGi
max
; i ∈ Ng (8)
Gi ≤ Q Gi ≤ Q Gi ; i ∈ N g
Q min max
(9)
(b) Transformer restrictions given in Eq. 10
Timin ≤ Tm ≤ Timax ; i ∈ Nc (10)
(c) Shunt compensator restrictions provided in Eq. 11:
ci ≤ Q ci ≤ Q ci ; i ∈ Nc
Q min max
(11)
3 Whale Optimization Algorithm
Nature-inspired metaheuristic algorithms can be classified into four main categories:

swarm, evolutionary, physics, and human-based algorithms. The third group of
nature-inspired techniques has mimicked the social behavior of groups of animals.
Some popular algorithms are ant colony optimization, particle swarm optimization,
bat algorithm [15], firefly algorithm, flower pollination algorithm [16], and WOA.
Among all, a new one called WOA is proposed by Mirjalili [17], and in this mimics,
the foraging humpback whales technique is used for solving optimization problems.
The attractive object about the whales is that they are considered as extremely smart
animals with sentiment [18].
(a) Searching the prey

Whale position updated by using Eqs. 12, 13, 14 and 15

H = D. Z rand − Z (12)
Z (iter + 1) = Z rand − B · H (13)
B = 2.
a .
r 1 − a (14)
= 2.
D r2 (15)
a is linearly decreased from 2 to 0, r 1 and r 2 E [0, 1].
(b) Encircling prey
In this stage, whale identifies its prey with Eqs. 16 and 17:

H = D × Z p (iter) − Z (iter) (16)
Z (iter + 1) = Z p(iter) − B × H (17)
(c) Bubble-net aggressive method is given in Eqs. 18 and 19,
Z (iter + 1) = H .ebl . cos(2πl) + Z (iter) (18)

H = D ∗ Z p (iter) − Z (iter) (19)
The measured model for this is given in Eq. 20:

⎧→ →
⎪
⎨ Z (iter) − B . H · · · · if · · · p < 0.5
→
Z (iter + 1) = → (20)
⎪
⎩ H.e cos(2πl) + Z (iter) · · · if · · · p ≥ 0.5
bl
p
p E [0, 1].
4 Simulation Results
In Table1, the Min and Max limits of the control variables of IEEE 30 bus system
are shown.
4.1 IEEE 30 Bus System
This method involves 30 buses with 24 load buses and 6 generators. Tap changing
transformers are linked amid lines 6–10, 4–12, 6–9, and 27–28. Shunt capacitors are
positioned at 9 buses. So, completely there are 19 regulated variables. The outcomes
of the regulated variables are given in Table 2, and its convergence characteristic is
displayed in Fig. 1. The true power generation cost obtained in WOA is compared
with MSA and GWO in terms of control variables and presented in Table 2. From
Table 2, it is identified that 19 regulated variables optimized effectively and produced
better results with WOA compared to GWO and MSA. The cost is 801.41 $/h, losses
are 9.30 MW, and total generation is 292.7 MW in GWO; these are reduced with
MSA in this total generation which is 292.4345 MW, losses are 9.0345 MW and
cost is 800.5099 $/h. By using WOA, these values are further reduced, losses are
8.8140 MW, the cost is 800.3196$/h, and total generation is 292.214 MW.
Table 3 presents the lowest, highest, mean, and standard deviation values of true
power generation cost for 20 runs. All the 20 run values are shown in Fig. 2. From
this, it is witnessed that the objective function value is more or less similar for all
the 20 runs. This indicates that WOA produces the best values for all trails. It is also
observed that the best value achieved at trail 9, and this value is 800.3196 $/hr, worst
value is obtained at third evaluation, this value is 801.3277$/hr. The mean value
of 20 runs is 800.658$/hr, and the standard deviation value for 20 trails is 0.2838.
Comparison with various algorithms available in the literature like MPSO, MFO,
MGBICA, GWO, HSFLA-SA, and TLO is shown in Table 4. It is found that the true
power generation cost is less with the implementation of WOA compared to other
techniques.
Table 1 Boundaries of
Variables Min in p.u Max in p.u
regulated variables
Voltages of generator buses 0.90 1.10
Transformers tap locations 0.95 1.05
Size of shunt capacitors 0.0 0.2
Table 2 Results for finest WOA for the reduction of fuel cost in IEEE 30 bus system
Control variables and parameters MSA [13] GWO [5] WOA
PTG1 177.2131 177.06 176.0386
PTG2 48.7326 49.02 48.5459
PTG5 21.4572 21.25 21.2817
PTG8 21.0638 21.71 21.6116
PTG11 11.9657 11.64 12.5939
PTG13 12.0021 12.02 12.1423
V TG1 1.0848 0.9910 1.1
V TG2 1.0653 1.0518 1.1
V TG5 1.03386 1.0665 1.1
V TG8 1.03823 0.9621 1.08869
V TG11 1.0927 0.9600 1.1
V TG13 1.04533 1.0262 1.1
QC10 2.37123 1.9793 4.32262
QC12 2.57918 4.7467 0
QC15 4.20734 2.9839 0
QC17 5 1.2097 2.57489
QC20 3.68771 4.2109 4.11584
QC21 4.95747 2.1081 2.5457
QC23 3.08148 3.6728 1.75619
QC24 4.98767 4.1593 3.97527
QC29 2.48706 3.2265 1.86436
T 11 1.04907 0.9875 0.983227
T 12 0.938762 0.9125 1.00358
T 15 0.970177 0.9875 0.992703
T 36 0.97498 0.9500 1.00521
Total power generation PG (MW) 292.4345 292.7 292.214
Total cost ($/h) 800.5099 801.41 800.3196
Ploss (MW) 9.0345 9.30 8.8140
QC in MVAR, V TG in p.u and PTG in MW
5 Conclusion
In this, WOA-based OPF is applied for the reduction of true power cost as an objective
function. The WOA is used for the optimization and found useful when related to the
different procedures like GWO, ABC, MSA, and DE, owing to its random possibility
and fast convergence. It has been executed on the IEEE 30 bus system, and cost has
been reduced. WOA implemented by the authors for OPF as a preface study, in future
820
815
True Power cost in $/hr

810
805
800
795
790
26
51
76
1
101
126
151
176
201
226
251
276
301
326
351
376
401
426
451
476
No of itera ons
Fig. 1 Convergence for the reduction of generation cost for IEEE 30 bus system using WOA
Table 3 Standard deviation and mean of true power generation cost in IEEE 30 bus system
Measure Min in $/hr Max in $/hr Mean value in $/hr Standard deviation
True power generation 800.3196 801.3277 800.658 0.2838
cost
800.387 1 800.5553
20 800.96 2
800.3203 19 801.3277
800.8 3
798.8 800.3463
18 4
800.7891 796.8
800.973
17 794.8 5
800.7601
792.8
800.7809
16 790.8 6
800.7565
800.7565
15 7
800.7809
800.7601
14 8
800.973
13 9 800.3196
800.3463 12 10 800.3203
800.5553 11 800.387
Fig. 2 Graph of generation cost for different evaluations

Table 4 Correlation of true

Method True power generation cost in $/hr
power generation cost with
different optimization MSA [13] 800.5099
approaches MPSO [13] 800.5164
MDE [13] 800.8399
MFO [13] 800.6863
FPA [13] 802.7983
RCBBO [13] 800.8703
GWO [5] 801.41
DE [5] 801.23
MGBICA [13] 801.1409
GBICA [13] 801.1513
ABC [13] 800.660
HSFLA-SA [13] 801.7
SFLA [5] 802.51
TLO [5] 801.99
WOA 800.3196
hybrid algorithms with FACTS devices are planned to use for OPF problems, which
may give improved outcomes.
References
1. D. Asija, P.V. Astick, P. Choudekar, Minimizing fuel cost of generators using. GA-OPF, in
Proceedings of First International Conference on Smart System, Innovations and Computing.
Smart Innovation, Systems and Technologies, vol. 79, ed. by A. Somani, S. Srivastava, A.
Mundra, S. Rawat (Springer, Singapore, 2018). https://doi.org/10.1007/978-981-10-5828-8_32
2. H.R.E.H. Bouchekara, Optimal power flow using black-hole-based optimization approach.
Appl. Soft. Comput. 24, 879–888 (2014). https://doi.org/10.1016/j.asoc.2014.08.056
3. H.R.E.H. Bouchekara, M.A. Abido, M. Boucherma, Optimal power flow using teaching-
learning-based optimization technique. Electr. Power Syst. Res. 114, 49–59 (2014). https://
doi.org/10.1016/j.epsr.2014.03.032
4. S. Duman, U. Güvenç, Y. Sönmez, N. Yörükeren, Optimal power flow using gravitational
search algorithm. Energy Convers. Manage. 59, 86–95 (2012). https://doi.org/10.1016/j.enc
onman.2012.02.024
5. A.A. El-Fergany, H.M. Hasanien, Single and multi-objective optimal power flow using grey
wolf optimizer and differential evolution algorithms. Electr. Power Components Syst. 43(13),
1548–1559 (2015). https://doi.org/10.1080/15325008.2015.1041625
6. H.-H. Bouchekara, M.A. Abido, A.E. Chaib, Optimal power flow using an improved
electromagnetism-like mechanism method. Electr. Power Compon. Syst. 44(4), 434–449
(2016). https://doi.org/10.1080/15325008.2015.1115919
7. M. Ghasemi, S. Ghavidel, S. Rahmani, A. Roosta, H. Falah, A novel hybrid algorithm of
imperialist competitive algorithm and teaching learning algorithm for optimal power flow
problem with non-smooth cost functions. Eng. Appl. Artif. Intell. 29, 54–69 (2014). https://
doi.org/10.1016/j.engappai.2013.11.003
8. A. Ramesh Kumar, L. Premalatha, Optimal power flow for a deregulated power system using
adaptive real coded biogeography-based optimization. Int. J. Electr. Power Energy Syst. 73,
393–399 (2015). https://doi.org/10.1016/j.ijepes.2015.05.011
9. R. Roy, H.T. Jadhav, Optimal power flow solution of power system incorporating stochastic
wind power using Gbest guided artificial bee colony algorithm. Int. J. Electr. Power Energy
Syst. 64, 562–578 (2015). https://doi.org/10.1016/j.ijepes.2014.07.010
10. S. Surender Reddy, P.R. Bijwe, A.R. Abhyankar, Faster evolutionary algorithm based optimal
power flow using incremental variables. Int. J. Electr. Power Energy Syst. 54, 198–210 (2014).
https://doi.org/10.1016/j.ijepes.2013.07.019
11. A.E. Chaib, H.R.E.H. Bouchekara, R. Mehasni, M.A. Abido, Optimal power flow with emission
and non-smooth cost functions using backtracking search optimization algorithm. Int. J. Electr.
Power Energy Syst. 81, 64–77 (2016). ISSN 0142-0615. https://doi.org/10.1016/j.ijepes.2016.
02.004
12. P.P. Biswas, P.N. Suganthan, R. Mallipeddi, G.A.J. Amaratunga, Optimal power flow solutions
using differential evolution algorithm integrated with effective constraint handling techniques.
Eng. Appl. Artif. Intell. 68, 81–100 (2018). https://doi.org/10.1016/j.engappai.2017.10.019
13. A.-A.A. Mohamed, Y.S. Mohamed, A.A.M. El-Gaafary, Optimal power flow using moth swarm
algorithm, in Electr. Power Syst. Res. 142, 190–206 (2017). https://doi.org/10.1016/j.epsr.2016.
09.025
14. S. Surender Reddy, R.C. Srinivasa, Optimal power flow using glowworm swarm optimization.
Int. J. Electr. Power Energy Syst. 80, 128–139 (2016). https://doi.org/10.1016/j.ijepes.2016.
01.036
15. B. Venkateswara Rao, R. Devarapalli, H. Malik, S.K. Bali, F.P.G. Márquez, T. Chiranjeevi,
Wind integrated power system to reduce emission: an application of Bat algorithm. J. Intel.
Fuzzy Syst. Preprint (Preprint) 1–9 (2021). https://doi.org/10.3233/JIFS-189770
16. D.L. Pravallika, B.V. Rao, Flower pollination algorithm based optimal setting of TCSC to
minimize the transmission line losses in the power system. Procedia Comp. Sci. 92, 30–35
(2016). https://doi.org/10.1016/j.procs.2016.07.319
17. S. Mirjalili, A. Lewis, The whale optimization algorithm. Adv. Eng. Softw. 95, 51–67 (2016).
https://doi.org/10.1016/j.advengsoft.2016.01.008
18. R. Devarapalli, B. Venkateswara Rao, B. Dey, K. Vinod Kumar, H. Malik, F.P.G. Márquez,
An approach to solve OPF problems using a novel hybrid whale and sine cosine optimization
algorithm. J. Intel. Fuzzy Syst. Preprint (Preprint) 1–11 (2021). https://doi.org/10.3233/JIFS-
189763
An Artificial Electric Field Algorithm
and Artificial Neural Network-Based
Hybrid Model for Software Reliability
Prediction
Ajit Kumar Behera, Mrutyunjaya Panda, Sarat Chandra Nayak,

and Ch.Sanjeev Kumar Dash
Abstract Artificial neural networks (ANNs) are popular nonlinear approximation

techniques for solving complex multimodal functions. Performance of such methods
is vastly depending on the learning method. In contrast to gradient-based ANN
learning, nature-inspired optimization algorithms-based ANN training are found
competent. Artificial electric field algorithm (AEFA) is a new optimization tech-
nique, needs fewer controlling parameters, and possesses robust learning ability. Its
application to data mining problems is not yet explored. This article used AEFA
to discover the best feasible weight and bias set as well as the number of hidden
neurons of ANN, thus crafting an optimal ANN structure on fly. The hybrid model
thus formed, i.e., AEFA + ANN is evaluated on modelling and predicting soft-
ware reliability datasets. Experiments are conducted on real software failure datasets
considering normalized root mean squared error statistics. Outcomes of result anal-
ysis, comparative studies, and statistical tests suggest that AEFA-ANN based model
is suitable for forecasting.
Keywords Software reliability prediction · Artificial neural network · Artificial

electric field algorithm · Gradient-descent
A. K. Behera (B) · M. Panda

Department of Computer Science and Application, Utkal University, Vani Vihar, Bhubaneswar,
Odisha 751004, India
M. Panda
e-mail: mrutyunjaya.cs@utkaluniversity.ac.in
A. K. Behera · C. K. Dash
Department of Computer Science and Engineering, Silicon Institute of Technology, Silicon Hills,
Patia, Bhubaneswar 751024, India
S. C. Nayak
Department of Computer Science and Engineering, CMR College of Engineering and Technology,
Hyderabad 501401, India
272 A. K. Behera et al.
1 Introduction
With increasing size and complexity, it is a big challenge for software developers for
developing good quality software systems with quick time. As the size increases, the
number of possible failures also increases in a software which impacts the quality of
a software product. Software reliability is defined as “the probability of a computer
program operating without failure for a particular period of time in a specified envi-
ronment” [1]. Software reliability is a crucial feature for software quality and an
important consideration when determining the length of time required for software
testing. Consequently, an efficient prediction instrument is desired for prediction of
software reliability. Assuming linear association of past and current data, several
parametric models are recommended in early days for forecasting software failures.
Parametric models are statistical methods based on certain assumption, so these
methods are not found promising in forecasting relationship between successive
failure time of software. On the other hand, non-parametric models such as ANNs
have been considered as better substitute to parametric models and successful ANN
applications are there in the literature for forecasting software reliability [2–6].
The parameter fine-tuning (i.e., finding the optimal weight and biases) of ANN
structure is a crucial aspect in ANN application that needs human expertise. Usually,
gradient-based methods are used to accomplish this, but associated with few draw-
backs such as slow convergence, may land at local optima etc. Later, many evolu-
tionary optimization techniques came forward which are inspired from natural
phenomena and have been used as better substitute of gradient-based methods [7,
8]. Evolutionary learning methods such as GA, PSO, DE etc. are more proficient
methods in searching the optimal parameters of ANN. Since no single technique
found suitable in solving all sort of problems, continuous improvements in existing
methods have been carried out by researchers through enhancement in an algorithm
[9, 10] or hybridization of them [11–13]. Recently, AEFA has been anticipated as
an optimization method inspired by the principle of electrostatic force [14]. AEFA
is based on strong theoretical concept of charged particles, electric field, and force
of attraction between two particles in the field of electric. The learning capacity,
convergence rate, and acceleration updates of AEFA have been established in [14]
through solving some benchmark optimization problems.
This work is an initiative toward investigating the potential of AEFA in fine-tuning
the parameters of an ANN, thus designing a hybrid model called as AEFA-ANN. It
is worth to mention that, along with the parameters (weight and bias), another crucial
factor such as deciding optimal number of hidden neurons of ANN are also carried
out by AEFA. The proposed AEFA-ANN is assessed through forecasting successive
failure time of software. Data pre-processing, input selection, model design steps are
also explained.
The article is structured into four parts: Sect. 2 presents short description about
methods and materials followed by Sect. 3 which discuss experimental outcomes
and Sect. 4 which gives the concluding remarks along with future scope.
An Artificial Electric Field Algorithm and Artificial Neural … 273
Fig. 1 Single hidden layer neural network
2 ANN
A typical ANN architecture is represented in Fig. 1. The ANN consists of three

layers. The first layer links input variables of the given problem. The second layer is
to capture nonlinear relationships between variables. The weighted output yi of each
neuron i in the hidden neuron is calculated using Eq. 1.

n
yi = f b j + wi j ∗ xi (1)
i=1
where , xi is the ith input component,wi j is weight value between ith input neuron
and jth hidden neuron and b j is the bias. f is a nonlinear activation function. Suppose
there are m numbers of nodes in this hidden layer, then for the next hidden layer these
m outputs become the input. Then for each neuron j of the next hidden layer, input
is as in Eq. 2.

m
yj = f bj + wi j ∗ yi (2)
i=1
This signal flows in the forward direction through each hidden layer until it reaches
the output layer. The output yesst is calculated using Eq. 3
⎛ ⎞

m
yesst = f ⎝bo + vj ∗ yj⎠ (3)
j=1
where, v j is the weight between jth hidden to output neuron, y j is the weighted
sum obtained in Eq. 1, and bo is the output bias. Given a set of training samples
S = {xi , yi }i=1
N
to train the ANN, let yi be the output of ith input sample, and yesst
is the computed output of the same ith input, then the error is calculated by using
Eq. 4.
Errori = yi − yesst (4)
3 AEFA-ANN-Based Forecasting
Here AEFA is designed on the principle of Coulomb’s law of electrostatic force [14].
It simulates the charged particles as agents and measures their strength in terms of
their charges. The particles are moveable in the search domain through electrostatic
force of attraction/repulsion among them. The charges possessed by the particles
are used for interaction and positions of the charges are considered as the potential
solutions for the problem. According to AEFA, the particle with the highest charge
is the best individual; it attracts particles with lower charges and goes through the
search domain. The mathematical justification of AEFA is illustrated in [14]. Here,
we simulate a potential solution of ANN as a charge particle and its fitness function
as the quantity of charge associated with that element. The velocity and position of
a particle at time instant t are updated as per Eqs. 5 and 6 respectively
Vid (t + 1) = randi ∗ Vid (t) + accelerationid (t) (5)
where, rand is a uniformly distributed random number in the range [0,1].
X id (t + 1) = X id (t) + Vid (t + 1) (6)
The overall AEFA steps are shown in Fig. 2 and Algorithm 1 presents high level
AEFA-ANN-based forecasting.
Fig. 2 The process of AEFA
Algorithm 1: AEFA-ANN-based forecasting

Begin
Step 1: Initialize AEFA and ANN parameters
Step 2: Choose input data from the set of software failure data
Step 3: Normalization of input data using sigmoid function
Step 4: With normalized input data and AEFA, train ANN
Step 5: Test the model and keep track of the error signal for performance analysis
End
4 Experimental Results and Analysis
The proposed approach is applied on three publicly available software failure

datasets: Musa-01, Musa-02 and Lee datasets containing 101, 163 and 191 number
of failures respectively, in the pair (t, Yt ), where, Yt represents the time between
successive failures after the tth modification has been made [4].
The raw data is normalized before being sent to the AEFA + ANN model. The
model is trained through AEFA-based learning and estimates an output. The deviation
from actual output is considered as the error generated by the model. To ensure AEFA
+ ANN performance, four comparative models such as gradient descent ANN (GD
+ ANN), genetic algorithm-based ANN (GA + ANN), differential evolution-based
ANN (DE + ANN), MLR and SVM are developed in similar manner. The model
Table 1 MAPE values from different forecasts

Data set MLR SVM GD + ANN GA + ANN DE + ANN AEFA + ANN
Musa-01 1.43272 0.27804 0.15327 0.09554 0.09832 0.06386
Musa-02 2.00375 1.04273 0.94388 0.87605 0.85036 0.25255
Lee 1.00736 0.74322 1.00074 0.79745 0.86605 0.17742
Table 2 ARV values from different forecasts

Data set MLR SVM GD + ANN GA + ANN DE + ANN AEFA + ANN
Musa-01 1.02768 0.90468 0.18335 0.08462 0.09635 0.05869
Musa-02 1.52045 0.73844 0.98037 0.88527 0.75211 0.09005
Lee 1.06247 0.77231 0.86299 0.90452 0.69755 0.08562
performance is accessed in terms of mean absolute percentage of error (MAPE) and

average relative variance (ARV) as in Eqs. 7–8. Mean error from twenty experiments
are summarized in Tables 1 and 2 for MAPE and ARV respectively. The best average
errors are shown in bold face. For all datasets, AEFA + ANN produced best average
errors. The AEFA + ANN model with estimated failure time against actual failure
time are plotted in Fig. 3. The corresponding error histograms are plotted in Fig. 4.
It can be observed from both tables that the proposed forecast generated lowest error
values compared to others. Also, the forecast plots show the closeness of the model
estimations toward the true observations. The error histograms from Lee and Musa
02 datasets show that most of the observations produced minimal errors whereas the
error signals are bit higher in case of Musa 01 dataset.
1 |Observedi − Estimatedi |
N
MAPE = × 100% (7)
N i=1 Observedi
N
(Observedi − Estimatedi )2
ARV = i=1
2
(8)
N
i=1 (Observedi − X)
5 Conclusions
In this article, an AEFA-ANN based hybrid model is proposed to predict the succes-
sive failure time in software. AEFA is used to discover the most feasible ANN param-
eters along with the number of hidden neurons of a single hidden layer ANN, thus
crafting an optimal ANN structure on fly. Four comparative forecasts are developed
in similar manner. To evaluate the proposed and comparative models, experiments
Musa-01
Musa-02
Lee
Fig. 3 Forecasting plots by AEFA + ANN
are conducted on real software failure datasets considering different forecasting hori-
zons. From exhaustive simulation studies, it is detected that AEFA-ANN model is
pretty efficient in catching the hidden patterns in the software failure series data than
others. The present study can be stretched with some improvised version of AEFA
and adopting other neural models.
Musa-01
Musa-02
Lee
Fig. 4 Error Histogram plots by AEFA + ANN
References
1. M.R. Lyu, Handbook of Software Reliability Engineering, vol. 222 (IEEE Computer Society
Press, McGraw-Hill, 1996)
2. A.K. Behera, S.C. Nayak, C.S.K. Dash, S. Dehuri, M. Panda, Improving software relia-
bility prediction accuracy using CRO-based FLANN, in Innovations in Computer Science
and Engineering. (Springer, Singapore, 2019), pp. 213–220
3. A.K. Behera, M. Panda, Software reliability prediction with ensemble method and virtual
data point incorporation, in International Conference on Biologically Inspired Techniques in
Many-Criteria Decision Making. (Springer, Cham, 2019), pp. 69–77
4. M.K. Bhuyan, D.P. Mohapatra, S. Sethi, Software reliability assessment using neural networks
of computational intelligence based on software failure data. Baltic J. Modern Comput. 4(4),
1016–1037 (2016)
5. K. Juneja, A fuzzy-filtered neuro-fuzzy framework for software fault prediction for inter-
version and inter-project evaluation. Appl. Soft Comput. 77, 696–713 (2019)
6. W.D. van Driel, J.W. Bikker, M. Tijink, Prediction of software reliability. Microelectron. Reliab.
119, 114074 (2021)
7. N. Shadbolt, Nature-inspired computing. IEEE Intell. Syst. 19(1), 2–3 (2004)
8. S.C. Nayak, B.B. Misra, Extreme learning with chemical reaction optimization for stock
volatility prediction. Financ. Innov. 6(1), 1–23 (2020)
9. K. Opara, J. Arabas, Comparison of mutation strategies in differential evolution–a probabilistic
perspective. Swarm Evol. Comput. 39, 53–69 (2018)
10. S. Jiang, Y. Wang, Z. Ji, Convergence analysis and performance of an improved gravitational
search algorithm. Appl. Soft. Comput. 24, 363–384 (2014)
11. S. Nayak, M. Ansari, Coa-honn: cooperative optimization algorithm based higher order neural
networks for stock forecasting. Recent Adv. Comput. Sci. Commun. 13(1), (2020)
12. S.C. Nayak, A fireworks algorithm based Pi-Sigma neural network (FWA-PSNN) for modelling
and forecasting chaotic crude oil price time series. EAI Endorsed Trans. Energy Web 7(28)
(2020).
13. A.K. Behera, M. Panda, S. Dehuri, Software reliability prediction by recurrent artificial
chemical link network. Int. J. Syst. Assur. Eng. Manage. 12, 1–14 (2021)
14. A. Yadav, AEFA: artificial electric field algorithm for global optimization. Swarm Evol.
Comput. 48, 93–108 (2019)
Disaster Event Detection from Text:
A Survey
Anchal Gupta, Monika Rani, and Sakshi Kaushal
Abstract With the advent of increasing online information, detecting and moni-
toring disaster events from textual data is challenging. The moment some disaster
event happens, the social media and online web are flooded with lots of informa-
tion about the event. Afterward, the quantity of articles about the event decreases
exponentially. In order to monitor the successive development and after effects of
the disaster events, detection of these events from online documents and tracking
the documents reporting similar events becomes crucial. The information mined can
be utilized to gain insight into the causes and preparing aftermath of the events. In
this paper, a survey of the utility of text published in social media and online news
articles has been carried out for disaster event detection. This survey aims to present
the machine learning approaches applicable and analysis of research studies focused
on disaster event detection from social media and online news articles.
Keywords Disaster event detection · Event detection · Disaster event classification
1 Introduction
Event detection is similar to topic detection. It helps in text differentiating and text
categorization. An event is something that happens at a particular place and at a time.
Detection of events from the input source is called event detection. It has many prac-
tical applications. Automatic event detection has many challenges due to different
perceptions of the same event by different people and the unavailability of a proper
definition of an event [1]. This paper deals with the detection of disaster-related
events. Disasters are of two types: natural and man-made. In natural disasters, nature
A. Gupta (B) · M. Rani · S. Kaushal

University Institute of Engineering and Technology, Panjab University, Chandigarh 160014, India
M. Rani
e-mail: monikaubs@pu.ac.in
S. Kaushal
e-mail: sakshi@pu.ac.in
282 A. Gupta et al.
is responsible for causing destruction. For instance, in floods, hurricanes, earth-

quakes, thunderstorms, tsunami, fires, etc., large-scale destruction of precious things
happens. In man-made disasters, humans are responsible for causing destruction.
These include attacks, civil unrest, wars, etc. They can be detected from various
input sources like textual input sources or physical sensors or images, etc. Due to
the emergence of technology and many other factors network accessibility increases.
Common people, as well as news organizations, update the population with these
events happening through texts. In this paper, our main attention is to detect disaster
events from textual inputs.
In order to process text and classify it, some basic steps are implemented like data
collection, data preprocessing, feature extraction and feature vector representation,
disaster event detection, and performance evaluation [2]. In data collection, data can
be collected from different sources. In this paper, since the focus is on text, major data
input sources are social media and news channel platforms. After data is collected,
its preprocessing is done to clean it and remove unnecessary symbols, words, and
numbers. For example, in the case of tweets, it contains abnormal spaces, special
symbols, words from a mixture of languages, etc. In data preprocessing general steps
used are: tokenization, converting to lower case letters, stop-word removal, stem-
ming, lemmatization. Researchers modify the data according to their requirements.
After data preprocessing, it is important to convert text to machine-understandable
format. To serve this purpose, features are extracted which can be statistical or textual
or both. Statistical features [3–5] are of various types for instance length of a tweet,
the position of query words, etc. In textual features, words are used as features. Word
vectors are created using n-grams, tf-idf, bag-of-words, etc., or word embeddings
are used. Feature extraction is sometimes followed by feature selection to select only
those features that contribute more to predicting results. These methods can be corre-
lation measures, evaluation using mutual information (MI), cosine similarity, Jaccard
similarity, etc., and selecting top k best features [6]. In the next step, disaster events
are detected by applying event detection techniques. Then performance is measured
using different measures like accuracy, precision, recall, F1-score, etc.
This paper is organized in the following manner: Sect. 2 discusses an analysis
of disaster event detection from the perspective of input source, Sect. 3 represents
approaches used in disaster event detection, Sect. 4 represents discussion, and Sect. 5
represents the conclusion of this paper.
2 Analysis of Disaster Event Detection from Perspective

of Input Source
Input source plays a great role in event detection. The first and foremost step in any
event detection is to collect data. When we think upon textual input sources, we have
majorly two sources of input: social media and news articles. Each of these datasets
has its own importance and use cases.
Disaster Event Detection from Text: A Survey 283
2.1 Social Media
Social media is a great source for data collection. It helps in speedy information
exchange and reporting of an event. Several social media platforms such as Twitter,
Facebook, and other microblogging services such as Sina Weibo are available. These
social media networks are used to exchange information that becomes a source of
input for researchers. Twitter is widely used by researchers because of its worldwide
availability and popularity and easy accessibility. Messages sent on Twitter are called
tweets and the character limit for it was 140 characters previously but since Nov.
2017 the character limit has doubled to 280 characters [7].
Yun [8] used a bag of disaster event words to detect disaster events from tweets
on trees. Li et al. [9] focused on the detection of crime and disaster-related events
(CDE) and created a system named TEDAS. They developed a CDE-focused crawler
to classify and rank tweets and location estimation of tweets. Guan and Chen [10]
confirmed the use of social media in damage assessment. According to Xiao et al. [11]
social media helps in the real-time collection of information, the establishment of
situational awareness, and support informal public communications. They examined
the relationship between the dependent variable, number of tweets generated during
a disaster situation, and independent variables mass, material, access, and motivation
(MMAM model) during hurricane Sandy.
2.1.1 Different Types of Works Done on Social Media Dataset

for Disaster Event Detection
Work done on social media datasets for event detection can be broadly classified into
three categories. Researchers had worked on any one of these or a combination of
these.
To Classify Disaster-Related Events from Others
The main task of disaster event detection is to detect disaster-related events from
the whole dataset. In the paper [4, 12, 13] disaster-related messages are separated
from negative ones. Olteanu et al. [14] presented a method to generate and expand
the query effectively. They collected keywords-based and location-based samples of
tweets related to six disasters and their task is to distinguish tweets into relevant and
non-relevant classes. This dataset is publicly available.1
1 https://crisislex.org/data-collections.html.
284 A. Gupta et al.
To Gain Information from Disaster Events
In this category, researchers try to find what types of tweets the user does in a disas-
trous situation. Different researchers predict different information from the datasets.
Olteanu et al. [15] created a dataset of 26 disasters that happened during 2012 and
2013 and made it available publicly [Ref. Footnote 1]. They viewed content from
three dimensions- informativeness, information type, and sources to gain insight into
the situation during different disasters. Informativeness is a binary classification task.
Information type and sources have six categories each. Labeling tweets is done using
crowdsourcing. This dataset is used further by many researchers. Pekar et al. [16]
used the above created dataset and had explained four classification problems namely,
relatedness, informativeness, eyewitness, and topics. The first three are binary clas-
sification problems while the last one has six labels. Zahra et al. [17] further gain
insights into eyewitness-related tweets for three natural disasters.
Imran et al. [18] created a dataset using tweets of 19 disaster events that happened
from 2013 to 2015 and performed multiclass classification for topic categorization.
Data annotation is performed through manual labeling as well as crowdsourcing.
Word-embeddings of crisis-related tweets are created and also out-of-vocabulary
words are identified from tweets and annotated using crowdsourcing. This dataset2
is available for future research works. Alam et al. [19], created a human-labeled
multimodal (text and images) dataset for seven different disasters for three classi-
fication tasks namely, informative, humanitarian, and damage severity assessment.
They collected tweets by using keywords. They performed classification by manual
annotation using crowdsourcing.
In [20, 21], authors used different approaches to classify tweets into 25 classes,
a task provided by the 2018 Text REtrieval Conference on Incident Streams track
(TREC-IS).3 Yu et al. [22] classified tweets into five information classes namely,
caution and advice, casualties and damage, information sources, infrastructure and
resource, donation and aid for three different disaster events hurricane Sandy, Harvey,
and Irma. They performed two types of experiments: training and testing with the
same disaster and training and testing with different disasters.
To Extract Location
Location detection is another important aspect of disaster event detection.

Researchers experiment to find a location from the social media dataset so that
they locate people demanding help in disaster situations. Identification of location is
a very critical task. Different users adopted different methods to locate a situation.
According to Kumar and Singh [23], Twitter provides three ways for the user to share
location information: user location, place names, and geo-coordinates. In this paper,
authors extracted location from place names mentioned in a tweet. So, they created
2 https://crisisnlp.qcri.org/
3 http://dcs.gla.ac.uk/~richardm/TREC_IS/
a CNN model to find location indicative words. Sakaki et al. [24] firstly separated
earthquake-related tweets from others and then obtained the longitude and latitude
value of tweets from GPS data and the registered location of the user. They created
a real-time earthquake reporting system after using particle filtering for location
estimation of events from tweets.
In [3], the authors have the main problem of finding the location of disaster victims.
If the user mentioned an address in a tweet then that location is used, else if the user
geo-tagged the tweet then that location is used, else created the Markov model to
uncover the user’s location from the user’s previous tweets.
Unankard et al. [25] created a system named location-based emerging event detec-
tion (LEED) and in this, they found a correlation between user location and event
location by calculating score value to detect a strong correlation between location
and emerging events.
2.1.2 Limitations in Social Media Dataset
The social media dataset has many limitations which have to be tackled by the
researchers. It has non-standard English words used, grammatical mistakes, spelling
mistakes, a mixture of different dialects, non-standard abbreviations, improper
sentence structure, a mixture of languages, abnormal spaces, and characters, etc.
[12, 23, 26]
2.2 News Articles
News is a good source of information. According to Nugent et al. [27], news channels
help people in understanding the situation by reporting in two phases from which the
initial is the breaking news phase and the other is the situations that arise aftermath.
Also, there is a tendency of people to trust more on news channels. In this paper, the
author used the news dataset to perform the classification of news articles into seven
natural disasters and critical event types. News channels are an authentic source of
information, yet very small no. of research work is available by using news datasets
as input. Ahmad et al. [28] created a news dataset of three languages and proposed a
separate multi-layer perceptron (MLP) layer for each language, and used an ensemble
of CNN and Bi-LSTM and observed that the proposed method increased performance
in multi-lingual disaster event identification.
Lee et al. [29] constructed a news dataset to detect bursty terms from text and iden-
tify disaster events. They developed a term weighting scheme to score the burstiness
of a term. It helped in increasing situational awareness during events happening.
Tanev et al. [30] represented an automatic grammar learning algorithm to detect
micro-events from the news corpus. They manually created a dataset and then anno-
tated it with micro-events present. Also, in this paper, they classified tweets related
to eyewitnesses by extracting various features from them. Min et al. [31] extracted
286 A. Gupta et al.
fire-related sentences from Chosun news articles and estimated location from these
sentences using named entity recognition.
3 Disaster Event Detection Approaches
The following are the different disaster event detection approaches.
3.1 Supervised Approach
In a supervised approach, the machine learns from the labeled dataset. The main
objective of this approach is to learn from previous data and apply knowledge to
new data. This approach has two main categories: regression and classification. In
the regression, an equation between dependent and independent variables is created.
In classification, class labels are learned from the input labeled dataset and then
classes are predicted for new data. In disaster event detection, classification is mainly
used. Various machine learning techniques like support vector machine (SVM), naïve
Bayes (NB), k-nearest neighbors (KNN), decision trees (DT), random forest (RF),
and deep learning techniques like convolution neural network (CNN), bidirectional
long short term memory (Bi-LSTM), hierarchical attention network (HAN) comes
under this category. Different researchers use different techniques and compare them.
Table 1 lists the papers that used a supervised approach.
3.2 Unsupervised Approach
In an unsupervised approach, the machine is trained with the unlabeled dataset. The
machine itself finds the similarities between data items and clusters them accordingly.
Clustering and association are two types of unsupervised learning. But in disaster
event detection clustering is majorly used. There are many clustering techniques
for instance hierarchical, density-based, k-means, etc. In disaster event detection,
clusters can be made based on keyword or topic similarity between them or by
bounding them in space and time which is called spatio-temporal clustering. Also, in
clustering, topic modeling techniques for instance latent dirichlet allocation (LDA),
Latent semantic analysis (LSA), etc. is used because it helps in finding hidden topics
from the textual document that best represents the information in it. Table 2 lists the
papers that used an unsupervised approach. Since, clustering approaches have not
labeled dataset, therefore performance cannot be quantitatively measured.
Table 1 Supervised techniques used by different researchers
Author and Application/Event Event detection Performance Pros Cons
reference detected and dataset technique/Approach
Madichetty and Detection of tweets Stacked CNN with The combination of CNN Proposed approach with Used domain-specific
Sridevi [32] related to need and traditional classifiers and KNN at the first level domain-specific features. features
availability of resources (SVM, KNN, NB, DT) (base level) and SVM at Improved performance
(NAR) by organizations the second level (meta
and victims during the level) gained the highest
disaster from tweets average accuracy of 77.05
related to Nepal and Italy percent
earthquake 2015 and
2016
Azlan et. al. [33] Predict disaster events KNN, SVM, NB KNN achieved an Introduced a The severity of a disaster
from social media Twitter accuracy of 0.79 fuzzy-logic-based event is based on no. of
Disaster Event Detection from Text: A Survey
dataset system to categorize the keywords related to the

severity of a disaster disaster present in the text
event
Sreenivasulu and To detect a target disaster KNN, decision tree, SVM classifier with Statistical features The selection of
Sridevi [4] event from the dataset of ANN, SVM, and statistical features reduced time in event appropriate keywords is
tweets related to Nepal random forest achieved F1-score value detection difficult
earthquake and landslide greater than 70%
dataset, 2015
Spiliopoulou et. al. Classification of tweets Bi-LSTM (Basic, Adversarial neural nets Adversarial neural net If events are mixed then
[34] into critical and multitask, adversarial) have macro F1-score improved classification adversarial neural nets do
non-critical classes greater than 0.60 performance and not perform better
reduced the bias of the
network toward the
specificity of an event
(continued)
287
Table 1 (continued)
288

Ameen et. al. [13] 1. Classified flood-related Supervised (SVM, J48, 1. Random forest Damage-related tweets Need to improve
tweets from others NB, RF) achieved the highest are separated from others performance
2. Separated accuracy greater than 80% for Arabic flood tweets
damage-related tweets 2. SVM achieved the
from other flood-related highest accuracy greater
tweets than 90% in damage
detection
Kumar and Singh Extraction of location CNN Achieved F1-score value The proposed method Use of 5-g or more filter
[23] information from tweets 0.96 when applied 3 CNN detected location size has no effect on
related to earthquake layers and 2 Dense layers information of several performance
with dropout and granularities
combination of filter sizes
2, 3, and 4 is applied
Singh et. al. [3] 1. Separating tweets 1.Supervised 1. Random forest achieves Location can be The Markov model
asking for help from classification algorithm F1-score greater than 0.75 predicted from user method is not for the
others during floods (for tweet classification) 2. Location prediction history first-time user. This
2. To estimate the 2. Markov model for success ratio is 87% technique is also not for
location of a user location estimation the user who had disabled
their location information
Pekar et. al. [16] Four classification tasks Classification methods Proposed disaster-based Proposed method used No improvement in
on the Twitter dataset ensemble method information of disasters precision or lower
increased recall value available in the training precision
dataset
(continued)
A. Gupta et al.
Table 1 (continued)
Yu et. al. [22] Classify tweets during CNN, SVM, LR CNN achieved the highest CNN outperforms others CNN is more
Hurricane Sandy, Harvey, accuracy near about 0.80 in cross-event detection time-consuming than
and Irma into five classes others in cross-event
detection
Nugent et. al. [27] Classify disaster events in SVM, CNN, HAN, RF SVM achieved the highest The performance of The performance of CNN
one of eight categories in performance up to 77% HAN increased when the decreased when the
the news dataset related length of the document length of the document
to seven different disaster is increased increased
events
Lin et al. [12] Filtering of SVM, CNN CNN achieved accuracy Performance increased Performance is lower
earthquake-related up to 91% and SVM with with CNN when the dataset is
Disaster Event Detection from Text: A Survey
messages from others radial kernel function unbalanced

collected from Chinese achieved an average
microblogging service accuracy of 87%
Sina Weibo
Kuila et. al. [35] Sentence level event Combination of Proposed method Events as well as event Performance is very poor,
extraction from Bi-LSTM and CNN achieved F1-score near arguments are detected and sentence level event
multi-lingual news about 40% and links between them extraction is not for news
dataset are found documents
Nguyen et. al. [36] Temporal event detection Use of multi-embedding Proposed method achieves Proposed method Tweets are collected
from disaster-related and deep learning an accuracy up to 86% decreases delay in using relevant keywords
social media dataset temporal event and are focused on
extraction single-topic
289
290 A. Gupta et al.
Table 2 Unsupervised techniques used by different researchers

Author and Application/Event Event detection Pros Cons
reference detected and technique/Approach
dataset
Cheng and To identify disaster Space–time scan STSS creates Need to improve
Wicks [37] events from tweets statistics (STSS) and clusters the technique and
related to 2013 LDA irrespective of the use of geo-tagged
London helicopter content of the tweets bounds the
crash disaster cluster across dataset size
both space and
time dimensions
Huang To identify STDBSCAN and The proposed Parameter values
et al. [7] small-scale events LDA method detected for STDBSCAN
in a specified known, unknown is hard to decide
location and time events as well as
from Twitter recurring events
dataset from the dataset
Wang et. al. The adequacy of Kernel density Clusters created 1.Used geo-tagged
[38] social media in estimation (KDE), by the K-means tweets
disaster event K-means clustering approach helps in 2. Social network
detection from and social network increasing analysis is only
wildfire related analysis situational based on retweets
tweets awareness
4 Discussion
This paper presents a review of techniques used for disaster event detection. The
main challenges involved in disaster event detection are correctly classifying disaster
events from others, getting insights into them, and finding location information.
Experimental results by different researchers show that there is a strong relation
between disaster events and Twitter activities. Many labeled twitter datasets are
available to use. News dataset are different from social media dataset in terms
of content, length, etc. Social media data contains many mistakes in the form of
spelling, sentence formation, abbreviations, grammar, etc. While using social media
as a dataset, researchers have to take care of them. The performance of techniques
depends largely upon used datasets, parameters chosen, etc. Before the popularity
of deep learning techniques, traditional machine learning classifiers such as SVM,
KNN, RF, etc. were very popular. But, with time, researchers’ interest in state-of-
the-art deep learning techniques increases. These techniques like CNN show better
results than SVM. But the time required in these techniques is more because the
network has to learn more parameters. Many research works used a combination of
different techniques that result in increased performance. The use of word embed-
dings other than bag-of-words, tf-idf, etc. also helped in increasing performance.
Also, if the dataset is balanced, then techniques output good performance values, but
in real-world applications, we encounter unbalanced data. There is a need to develop
good techniques to tackle this problem.
5 Conclusion
The rapid spread of information during catastrophe has increased many researchers’
interest in disaster event detection. This information is used to classify disaster-related
events from others and to get situation awareness during disasters or to predict the
location of disasters or victims. In this study, a survey about disaster event detec-
tion from text data is presented. Analysis of previous works is done based on input
source and techniques used. This study shows that work done on the Twitter dataset
is more prominent than other social media platforms and news datasets. To get situ-
ation information like need and help related messages during disasters social media
dataset is used. Location information extraction from the text is also important to
help someone and to locate disasters. Several classification and clustering techniques
are used to detect disaster-related events. Classification is widely used in detecting
disaster events. The main limitation of classification techniques is that they perform
well only for those events that are in the training dataset. Thus, there is a need to
put more emphasis on clustering techniques or to develop the techniques which will
help in finding those events that are not known in advance. Clustering events within
space and time dimensions are a good idea, but only using geo-tagged data limits the
dataset size and this may result in different results than actual. So, there is a need to
discover more techniques to detect the location and temporal expressions from the
text data.
References
1. C.-C. Pan, P. Mitra, Event detection with spatial latent Dirichlet allocation, in Proceedings of
the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries (Association
for Computing Machinery, 2011). https://doi.org/10.1145/1998076.1998141
2. W.Z. Aldyani, F.K. Ahmad, S.S. Kamaruddin, A survey on event detection models for text data
streams. J. Comput. Sci. 16(07), 916–935 (2020)
3. J.P. Singh, Y.K. Dwivedi, N.P. Rana, A. Kumar, K.K. Kapoor, Event classification and location
prediction from tweets during disasters. Ann. Oper. Res. 283(12), 21 (2019)
4. M. Sreenivasulu, M. Sridevi, Comparative study of statistical features to detect the target event
during disaster. Big Data Min. Analytics 3, 121–130 (2020)
5. T. Sakaki, M. Okazaki, Y. Matsuo, Earthquake shakes Twitter users: Real-time event detection
by social sensors, in Proceedings of the 19th International Conference on World Wide Web,
WWW ’10, vol. 01 (2010), pp. 851–860
6. A.H. Hossny, L. Mitchell, N. Lothian, G. Osborne, Feature selection methods for event detection
in Twitter: A text mining approach. Soc. Netw. Anal. Min. 10, 12 (2020)
7. Y. Huang, Y. Li, J. Shan, Spatial-temporal event detection from geo-tagged Tweets. ISPRS Int.
J. Geo-Inf. 7(04), 150 (2018)
8. H. Yun, Disaster events detection using Twitter data. J. Inf. Commun. Convergence Eng. 9, 02
(2011)
9. R. Li, K.H. Lei, R. Khadiwala, K.C.-C. Chang, TEDAS: A Twitter-based event detection and
analysis system, in 2012 IEEE 28th International Conference on Data Engineering (2012),
pp. 1273–1276
10. X. Guan, C. Chen, Using social media data to understand and assess disasters. Nat. Hazards
74, 11 (2014)
292 A. Gupta et al.
11. Y. Xiao, Q. Huang, K. Wu, Understanding social media data for disaster management. Nat.
Hazards, 79(09), 17 (2015)
12. Z. Lin, H. Jin, B.F. Robinson, X.G. Lin, Towards an accurate social media disaster event
detection system based on deep learning and semantic representation 12, 6–8 (2016)
13. Y.A. Ameen, K. Bahnasy, A. Elmahdy, Classification of Arabic tweets for damage event
detection (2020)
14. A. Olteanu, C. Castillo, F. Diaz, S. Vieweg, CrisisLex: A lexicon for collecting and filtering
Microblogged communications in crises, in Proceedings of the 8th International Conference
on Weblogs and Social Media, ICWSM 2014 (2014), pp. 376–385
15. A. Olteanu, S. Vieweg, C. Castillo, What to expect when the unexpected happens: Social
media communications across crises, in Proceedings of the 18th ACM Conference on Computer
Supported Cooperative Work & Social Computing (Association for Computing Machinery,
2015), pp. 994–1009
16. V. Pekar, J. Binner, H. Najafi, C. Hale, V. Schmidt, Early detection of heterogeneous disaster
events using social media. J. Am. Soc. Inf. Sci. 71, 03 (2019)
17. K. Zahra, M. Imran, F. Ostermann, Understanding eyewitness reports on twitter during disasters
05 (2018). https://doi.org/10.5167/uzh-161922
18. M. Imran, P. Mitra, C. Castillo, Twitter as a lifeline: Human-annotated Twitter Corpora for
NLP of crisis-related messages. CoRR abs/1605.05894 (2016)
19. F. Alam, F. Ofli, M. Imran, Crisismmd: Multimodal twitter datasets from natural disasters, in
Proceedings of the International AAAI Conference on Web and Social Media, vol. 12(1) (2018)
20. A. Kruspe et al., Classification of incident-related tweets: Tackling imbalanced training data
using hybrid CNNs and translation-based data augmentation, in Proceedings of the 27th Text
Retrieval Conference (TREC 2018) vol. 16, (Gaithersburg, Maryland, 2018), Nov 14
21. W.G. Choi, S.-H. Jo, K.-S. Lee, CBNU at TREC 2018 incident streams track, in TREC (2018)
22. M. Yu, Q. Huang, H. Qin, C. Scheele, C. Yang, Deep learning for real-time social media
text classification for situation awareness—using Hurricanes Sandy, Harvey, and Irma as case
studies. Int. J. Digit. Earth, 12(02), 1–18 (2019)
23. A. Kumar, J.P. Singh, Location reference identification from tweets during emergencies: A
deep learning approach. Int. J. Disaster Risk Reduction 33, 01 (2019)
24. T. Sakaki, M. Okazaki, Y. Matsuo, Tweet analysis for real-time event detection and earthquake
reporting system development. IEEE Trans. Knowl. Data Eng. 99, 11 (2013)
25. S. Unankard, X. Li, M.A. Sharaf,Location-based emerging event detection in social networks,
in Web Technologies and Applications (Springer, Berlin, 2013). https://doi.org/10.1007/978-3-
642-37401-2_29
26. J. Kersten, A. Kruspe, M. Wiegmann, F. Klan, Robust filtering of crisis-related Tweets 05
(2019). https://elib.dlr.de/127586/
27. T. Nugent, F. Petroni, N. Raman, L. Carstens, J.L. Leidner,A comparison of classification
models for natural disaster and critical event detection from news, in 2017 IEEE International
Conference on Big Data (Big Data) (2017), pp. 3750–3759
28. Z. Ahmad, D. Varshney, A. Ekbal, P. Bhattacharyya, Multi-lingual event identification in
disaster domain 4 (2019)
29. S. Lee, S. Lee, K. Kim, J. Park, Bursty event detection from text streams for disaster manage-
ment, in Proceedings of the 21st International Conference on World Wide Web (Association
for Computing Machinery, 2012). https://doi.org/10.1145/2187980.2188179
30. H. Tanev, V. Zavarella, J. Steinberger, Monitoring disaster impact: detecting micro-events and
eyewitness reports in mainstream and social media, in ISCRAM (2017)
31. K. Min, J. Lee, K. Yu, J. Kim, Geotagging location information extracted from unstructured
data, in 10th International Conference on Geographic Information Science (GIScience 2018)
(2018). https://doi.org/10.4230/LIPIcs.GISCIENCE.2018.49
32. S. Madichetty, M. Sridevi, A stacked convolutional neural network for detecting the resource
tweets during a disaster. Multimedia Tools Appl 80 (2021)
33. F.A. Azlan, A. Ahmad, S. Yussof, A.A. Ghapar, Analyzing algorithms to detect disaster events
using social media, in 2020 8th International Conference on Information Technology and
Multimedia (ICIMU) (2020), pp. 384–389
34. E. Spiliopoulou et al., Event-related bias removal for real-time disaster events. arXiv preprint
arXiv:2011.00681 (2020)
35. A. Kuila, S. c. Bussa and S. Sudeshna, A neural network based event extraction system for
Indian languages, in FIRE (2018), pp 291–301
36. V. Nguyen, T.N. Anh, H.-J. Yang, Real-time event detection using recurrent neural network in
social sensors. Int. J. Distrib. Sens. Netw. (2019). https://doi.org/10.1177/1550147719856492
37. T. Cheng, T. Wicks, Event detection using Twitter: A spatio-temporal approach. PLoS ONE 9,
06 (2014)
38. Z. Wang, X. Ye, M.-H. Tsou, Spatial, temporal, and content analysis of Twitter for wildfire
hazards. Nat. Hazards 83 (2016)
Context-Adaptive Content-Based
Filtering Recommender System Based
on Weighted Implicit Rating Approach
K. Navin and M. B. Mukesh Krishnan
Abstract Recommender systems’ job is to churn and filter out desired relevant
information from an unorganized pile of data available. Ratings score data are the
key parameter to recommender engine computations. Rating score data can be either
explicit where users directly give the preference score or implicit where user behav-
iors are to be captured to compute the score. In many applications where user’s explicit
rating is not possible, the implicit rating is computed. Computing implicit rating needs
to be carefully modeled to achieve the best relevant recommendations. A context-
aware feature is needed in some applications to tune the filtering model to extract rele-
vant details to address the context. This paper proposes a recommendation application
to address the mapping of call for research projects published by various funding
agencies to the aspiring researchers who wait and look for applying to such call for
proposals relevant to their domain interest. This paper proposes a context-adaptive
content-based filtering recommendation system supported by service-oriented on-
demand cloud-based architecture. The proposed recommendation system algorithm
is adaptive to address the different call for proposals published to achieve the level
of mapping of relevant research paper abstracts by allowing it to tune appropriate
weight parameters.
Keywords Recommendation system · Content-based filtering ·

Context-awareness · Cloud-based recommender engine model
K. Navin (B) · M. B. M. Krishnan

Computer Science Department, SRM Institute of Science and Technology, Chennai 603203, India
e-mail: nk3857@srmist.edu.in
M. B. M. Krishnan
e-mail: mukeshkm@srmist.edu.in
296 K. Navin and M. B. M. Krishnan
1 Introduction
Recommender systems have wide potential, and it has become an irreplaceable

module in recent years to be utilized in a variety of areas including search queries,
social tags, news, music and movies, books, products, and also research articles [1].
Many people consider information filtering is a time consuming process and some-
times can be exhaustive [2, 3]. A growing expectation to have a system with an
effective information filtering mechanism that provides optimal recommendations
for a solution is among the engineers, scientists, and common men. For carrying out
an effective search on related work, context-aware recommendation system needs to
be built based on the domain area it addresses to serve [4, 5]. One of the concepts
of artificial intelligence is recommendation engines. The job of such a system is to
provide searching and mapping and provide users with the relevant pieces of informa-
tion according to their preferences and priorities [6, 7]. These systems may comprise
simple algorithms or a complicated one depending upon the dimensions of data
involved [2, 8]. Collaborative filtering (CF) and content-based filtering (CBF) are
the two most distinctive fundamental approaches in designing recommender systems
[3, 9]. The research area recommendation system is around for 20 years [10]. Though
this paper falls under the category of research area recommendation system, it focuses
on the untapped territory of problem area which involves mapping of the research
project call for proposals from different funding agencies to the relevant researchers
with project proposal ideas aspiring to find the relevant funding agencies. Research
project funding agencies are fragmented across different domains of various fields.
Usually, call for proposals is available on the funding agency Web sites which are
visited by aspiring applicants to choose the topic of interest manually. This paper
addresses an architectural design for such a recommendation system suitable to bring
effective collaboration between funding agencies publishing call for proposals and
aspiring researchers willing to contribute their work in their interested domain. The
proposed model is a content-based filtering recommendation system that is context
aware [11, 12]. Content-based recommenders deliver recommendations to the interest
of the user (user’s profile featuring their interest) by comparing the representation
of contents describing an item [13–15]. The proposed work focuses on the principle
of extracting key phrases from documents and analyzes the rich description of both
resources and user interests to derive recommendations [8, 16, 17]. The tuning func-
tion in the algorithm can be set as per the desired context where the recommendation
engine works in the strict mode with only very relevant paper abstracts mapped to
the relevant domain of call for proposals, whereas another tuning function adopts a
normal approach and avoids false rejections.
Context-Adaptive Content-Based Filtering Recommender System … 297
2 CA-CBF Recommendation Engine
This algorithm uses document classification techniques to recommend the submitted

proposal abstracts (user) d p [1..n] to the relevant domain areas to call for proposals
by publishers d k [1…m] classes, where d k [1..m] are the call for proposals documents
containing klen number of keyword phrases. These keyword phrases are the profile
feature of the domain (item) assigned with weight values d w [1…m] set appropriately
for recommendations. The proposed algorithm assumes n number of proposal docu-
ments are to be relevantly classified against m number of call for proposals at a time.
Stop words removal and validating key phrases are done as preprocessing steps. The
relevant ranking logic in the algorithm takes each processed submitted proposal docu-
ments d ps against each submitted call for proposals d kv and computes rating weight
W tc [m][n]. For classifying a proposal document against a call for proposal to identify
if it matches the domain requirement, each keyword phrase in the call for proposals
is taken one at a time and parsed in the proposal document for finding occurrences as
well as relevancy. A dictionary corpus d c is used to compare the relevancy of equiv-
alent words of an identified keyword phrase. A weight of 1.. N value is assigned
from the data set by relevance metric function f (x) if it identifies the keyword phrase
available in the set. The computed value is set to 0 if the keyword phrase is not iden-
tified in the document. The function f (x) is multiplied with the weight value assigned
for the particular keyword phrase in the set d w [1…m]. The procedure recommen-
dation relevant sorts W tc [m][n] through summed up weighted values assigned for
proposal abstract documents against each call for proposal. Filtering relevant docu-
ments is done by eliminating irrelevant proposal documents for a particular call for
proposal by comparing the computed weight rating value against a threshold value
T. The threshold value is set for each call for proposals based on the number of
keyword phrases and corresponding weights assigned. The weight distribution func-
tion distributes the weight values for key phrases as the tuning parameter of the
filtering model of the recommender along with threshold parameter.
2.1 Algorithm
The following is the algorithm of the context-adaptive content-based filtering

recommendation engine:
Context-adaptive content-based filtering (CA-CBF) recommendation algorithm

Input : dp [1...n]; d k 1...m]; d W [1...m]; d c[];T→ α
dp [1..n] refers to ’n’ number of
proposal documents to be classified and
recommended.
dk [1..m] refers to ’m’ number of
call for proposal documents containing keyword
phrases
dW [1..m] refers to ’m’ number of
call for proposal documents containing weight
values for keyword phrases set through
weight distribution function.
dc refers to dictionary corpus
containing domain-specific keywords and their
relational mapping words.
T is a threshold which ca n be varied for filtering relevant
proposal
Output : Relevantly ranked proposal documents against
call for proposals: Reld[m][nsel].
nsel refers to filtered proposal documents
recommended for the particular domain m.
STEP 1 : procedure: PREPROCESS (dp[1..n],d k[1..m])

for i←1 to n do
dps[i] ← STOPWORDREMOVAL(dp[1..n])
end for
for i←1 to m do
dkv[i] ←VALIDATEKEYWORD(dp[1..m])
end for
end procedure
STEP 2 : procedure: RANKINGDOCUMENTS (dps[n] , d kv[m])
Initialize i = 0, j = 0
Foreach k ϵ dkv[m] do
Klen ← Count (number of key phrases
in the k)
For each p ϵ dps[n] do

=
Wtc[i] j] ← ∑ =1 ∗ ( )
end foreach
end foreach
end procedure
Where f (x) ← RelevenceMetric(d c,k,p)
0, if word doesn′t occurs
1, if word exits and compute
Where f(x) =
weight
STEP 3 : procedure: FILTER_RECOMEND (T)

foreach d[n] ϵ Wtc[m][n] do
Wtc[m]← SortDesending(d[n])
end foreach
Reld[m][nsel] ←FilterRelevent(W tc[m][n],T)
end procedure
3 Experimental Setup and Evaluation
This section describes about the data set, experimental setup, and evaluation metrics
used in measuring the performance of the recommendation system.
3.1 Data Set
A data set of 500 mock proposal documents is collected from research abstracts
submitted for internal research grants from five disciplines, namely ‘smart farming,’
‘innovative data science projects,’ ‘health care for TB program,’ ‘cyber security
projects,’ and ‘machine learning applications.’ Keyword phrases for each call for
proposals were framed, and weight values for key phrases are set according to the
domain context, i.e., specific to call for proposal area for evaluation purposes. For
example, a sample call for proposal with the title smart farming had keyword phrases
framed like ‘agriculture’, IoT, ‘sensors,’ ‘temperature sensors,’ ‘humidity sensors,’
etc., which have more weight values, but key phrases like ‘data collection’ and
‘analytics’ will have less weight value. Similar contextual meaning-based key phrases
like ‘sensors,’ ‘temperature sensors,’ and ‘humidity sensors’ were given with the same
weight values. Dictionary corpus was also made enriched with adequate reserve
words for popular key phrases.
3.2 Methodology of Evaluation
To assess the performance of the proposed recommendation system, a heuristic

approach is followed to test the algorithm [18, 19]. The module implemented in
the Python script was run for the fixed set of call for proposal documents and mock
proposal documents by varying the count of keyword phrases set in the call for
proposal documents. Expert manual classification of proposal documents against
each call for proposal was done. The ranking satisfaction factor is the measure
of accuracy between automated recommendation classification and expert manual
recommendation classification. Keyword phrases of call for proposal documents are
varied by step of 3, and the accuracy of the recommendation system is verified.
3.3 Evaluation Metrics
For the sample of five call for proposals, 100 relevant proposal documents papers are
allotted for each call for proposals totaling up to 500. The recommendation system
was tested for each call for proposals with the data set of 100 proposal documents
containing abstract portraying ideas for the proposal. Identifying false selection and
false rejections of proposal documents was done as a measure of a performance metric
to evaluate the proposed recommendation system. Recommendation effectiveness
factor (REF) is based on the F1 score, since class distribution is imbalanced [20,
21]. It is calculated to provide a metric for the effectiveness of the recommendation
system. The following Eq. (1) to (3) are used to measure the effectiveness of the
recommendation system.
TRP → total relevant selected papers of recommendation system
FSP → false selected papers of recommendation system
FRP → false rejected papers of recommendation system

TRP
Precision = (1)
TRP + FSP

TRP
Recall = (2)
TRP + FRP
Precision * Recall
F1 score = 2 ∗ (3)
Precision + Recall
3.4 Testing and Evaluation
The testing of the recommendation engine was carried out with a given set of sample
data. The result implies a minimum threshold of five key to ten key phrases per call for
proposal which is required for minimum effective functioning of recommendation
system. Similarly, it is observed that 15 to 20 key phrases for a call for proposal
are the upper threshold limit. Beyond that, little impact in performance is observed.
Results reflect fewer the keyword phrases set for the selection of relevant proposal
documents resulted in false rejection (FRP) of papers by the recommendation engine
while too many keyword phrases resulted in false selection (FSP) of irrelevant papers
by recommendation engine. The result implies too less the keyword phrases resulted
in missing key information for processing and resulted in more false rejection case.
More keyword phrases reflect the inclusion of overlapping and less impactful
information resulting in a slight increase in false selection cases. It is observed from
the results that the performance of the recommendation engine is affected due to
the irrelevant setting of keyword phrases and improper setting of weights for key
phrases. From the observation of results through the prescribed evaluation process, it
is observed that the distribution of weights for key phrases has a part in designing the
required evaluation metrics for each call for proposals. If the distribution of weights
for key phrases between primary key phrases to secondary key phrases is more or
like the monotonically decrementing model, then both false rejection count and false
selection count can be kept optimally minimal, and F1 score can be maintained high
with high values for precision and recall. For the sample call for proposal, ‘smart
farming’ which is a narrowed domain, the intention was to retrieve more accurate
recommendation papers using monotonically decreasing model which followed key
phrase weight distribution.
Table 1 and Fig. 1a, b show performance metrics through precision, recall, and F1
score which are relatively high and reflect accurate retrieval of papers. Table 2 and
Fig. 1c, d show the performance metrics through precision, recall, and F1 score for
sample call for proposal, ‘innovative data science projects’ which is a broad domain.
Here the distribution of weights for key phrases between primary key phrases to
secondary key phrases is more or like the right-skewed exponentially decrement
model. False rejection count is taken care to be almost null, and false selection count
Table 1 ‘Smart farming’ call

Key phrases Precision Recall F1 score
for proposals
3 0.74 0.81 0.85
6 0.84 0.89 0.91
9 0.99 0.96 0.99
12 0.77 0.99 0.87
15 0.68 1 0.81
18 0.66 1 0.80
20 0.65 1 0.79
Fig. 1 Performance of the recommender engine through precision, recall and F1 score on a given
data set with a varying number of key phrases for various call for proposals
Fig. 1 (continued)
Table 2 ‘Innovative data

science projects’ call
3 0.74 0.81 0.85
6 0.84 0.89 0.91
9 0.99 0.96 0.99
12 0.77 0.99 0.87
15 0.68 1 0.81
18 0.66 1 0.80
20 0.65 1 0.79
Table 3 ‘Health care for TB

program for proposals
3 0.73 0.88 0.80
6 0.82 0.87 0.84
9 0.96 0.94 0.85
12 0.84 0.97 0.90
15 0.72 1 0.83
18 0.71 1 0.83
20 0.70 1 0.82
Table 4 ‘Cyber security

projects’ call for call
proposals 3 0.82 0.93 0.87
6 0.85 0.94 0.89
9 0.90 0.98 0.94
12 0.91 1 0.95
15 0.85 1 0.92
18 0.86 1 0.92
20 0.89 1 0.94
is expected to be tolerable to be more. The F1 score is slightly lower in this model. It

can be observed from Table 2 very good recall value at the cost of increased precision.
This ensures focus on avoiding false rejection of proposals as a primary objective.
‘Health care for TB program’ and ‘cyber security projects’ call for proposals also
follow monotonically decreasing key phrase weight distribution. Tables 3 and 4,
Fig. 1e, f, and Fig. 1g, h show the performance metrics precision, recall, and F1
score for sample call for proposal. Table 5 and Fig. 1i, j show the performance
metrics precision, recall, and F1 score for sample call for proposal ‘machine learning
applications’ which follows right-skewed exponentially decrement model key phrase
weight distribution.
Table 5 ‘Machine learning

applications’ call for
proposals 3 0.65 0.98 0.78
6 0.66 0.99 0.89
9 0.75 1 0.86
12 0.78 1 0.88
15 0.78 1 0.88
18 0.78 1 0.88
21 0.78 1 0.88
4 Implementation Model
Figure 2 shows the architecture model for implementing the research proposals
recommendation system. It follows service-oriented architecture (SOA) design
approach implemented using resources of cloud service providers [22, 23]. Web
modules are hosted in Amazon Web Services (AWS) [24, 25] and Elastic Compute
Cloud (EC2) Bean Stack with LAMP stack [20]. The database for the web portal is
maintained in scalable Amazon Relational Database Service (RDS). The Elastic Bean
stack handles auto-scaling and load balancing of web applications. Amazon RDS
gives a scalable database to store entries taken from the form that is stored in DB and
listed in the queue module. The files submitted through fund seeker’s web portal and
fund providers’ web portals are stored in Amazon S3 scalable cloud storage to canter
the growing need for accommodating submitted proposals [24]. Amazon Elastic
Container Service (ECS) is suitable for long-running tasks, and batch jobs are used
to host the REST API [26]-based NLP recommendation system service module which
is deployed into Docker container. Each submitted fund seekers’ proposal and newly
fed call for proposals are fetched from cloud storage and compared by the recom-
mendation system to compute and send recommendation results to the cloud DB.
Amazon Simple Queue Service (SQS) serves as the bridge for connecting between
web application module and service-oriented recommendation engine module to
implement batch processing of computing recommendations for newly submitted
call for proposals and newly submitted proposals. Google firebase [27] is used for
push notifications to notify mobile clients as well as desktop clients.
Fig. 2 Architecture of research proposals recommendation system with CA-CBF recommender

engine
5 Conclusion and Future Work
The proposed recommendation system, a tool for researchers to map their research
interest to relevant call for proposals, employs context-aware content-based tech-
nique that helps to canter the need of collaborating researchers and funding institu-
tions. The proposed implementation model featuring a service-oriented architecture-
based design approach can be easily modified to develop a model for similar types of
recommendation systems applications like bibliography reference mapping, e-portals
system, etc. Future work would involve designing a more sophisticated recommen-
dation system which would involve training of the system and deep learning to play
a role in fine-tuning the algorithm to achieve the best possible results.
Acknowledgements Thanks to the department of computer science SRM IST for providing the
data sets of research abstracts to test our recommender engine product.
References
1. F. Ricci et al., Introduction to recommender systems handbook, in Recommender Systems

Handbook, ed. by F. Ricci et al. (Springer, US, 2011), pp. 1–35. https://doi.org/10.1007/978-
0-387-85820-3_1
2. F.O. Isinkaye et al., Recommendation systems: principles, methods and evaluation. Egypt. Inf.
J. 16(3), 261–73 (2015). DOI.org (Crossref), https://doi.org/10.1016/j.eij.2015.06.005
3. J. Bobadilla et al., Recommender systems survey. Knowl.-Based Syst. 46, 109–132 (2013).
DOI.org (Crossref), https://doi.org/10.1016/j.knosys.2013.03.012
4. S. Raza, C. Ding, Progress in context-aware recommender systems—An overview. Comput.
Sci. Rev. 31, 84–97 (2019). DOI.org (Crossref), https://doi.org/10.1016/j.cosrev.2019.01.001.4
5. Q. He et al., Context-aware citation recommendation, in Proceedings of the 19th International
Conference on World Wide Web, (ACM, 2010), pp. 421–430. ACM Digital Library, https://doi.
org/10.1145/1772690.1772734
6. L. Chen, M. Xia, A context-aware recommendation approach based on feature selection. Appl.
Intell. 51(2), 865–875 (2021). DOI.org (Crossref), https://doi.org/10.1007/s10489-020-018
35-9
7. R.S. Kanmani, B. Surendiran, Context-based social media recommendation system, in Recom-
mender System with Machine Learning and Artificial Intelligence, ed. by S.N. Mohanty et al.,
1st ed. (Wiley, 2020), pp. 237–249. DOI.org (Crossref), https://doi.org/10.1002/978111971
1582.ch12
8. A. Naak, H. Hage, E. Aïmeur, A Multi-criteria collaborative filtering approach for research
paper recommendation in papyrus, in E-Technologies: Innovation in an Open World.
MCETECH 2009. Lecture Notes in Business Information Processing, vol. 26, eds. G. Babin,
P. Kropf, M. Weiss (Springer, Berlin, 2009)
9. P.B. Thorat, R.M. Goudar, S. Barve, Survey on collaborative filtering, content-based filtering
and hybrid recommendation system. Int. J. Comput. Appl. 110(4), 0975–8887 (2015)
10. J. Beel et al., Research-paper recommender systems: a literature survey. Int. J. Digit. Libr.
17(4), pp. 305–338 (2016). DOI.org (Crossref), https://doi.org/10.1007/s00799-015-0156-0
11. T.T. Chen, M. Lee, Research paper recommender systems on big scholarly data, in Knowledge
Management and Acquisition for Intelligent Systems. PKAW 2018. Lecture Notes in Computer
Science, vol. 11016, eds. by K. Yoshida, M. Lee (Springer, Cham, 2018)
12. K. Haruna, M.A. Ismail, D. Damiasih, J. Sutopo, T. Herawan, K. Haruna et al., A collaborative
approach for research paper recommender system. PLOS ONE 12(10), e0184516 (2017), by
F. Xia. DOI.org (Crossref), https://doi.org/10.1371/journal.pone.0184516
13. U. Javed, et al., A review of content-based and context-based recommendation systems. Int.
J. Emerg. Technol. Learn. (IJET) 16(03), 274 (2021). DOI.org (Crossref), https://doi.org/10.
3991/ijet.v16i03.18851
14. C. Nascimento et al., A source independent framework for research paper recommendation, in
Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries
(ACM, 2011), pp. 297–306. ACM Digital Library, https://doi.org/10.1145/1998076.1998132
15. S. Philip et al., Application of content-based approach in research paper recommendation
system for a digital library. Int. J. Adv. Comput. Sci. Appl. 5(10), 2014. DOI.org (Crossref),
https://doi.org/10.14569/IJACSA.2014.051006
16. F. Ferrara, N. Pudota, C. Tasso, A keyphrase-based paper recommender system, in Digital
Libraries and Archives. IRCDL 2011. Communications in Computer and Information Science,
vol. 249, eds. by M. Agosti, F. Esposito, C. Meghini, N. Orio (Springer, Berlin, 2011)
17. M.K. Najafabadi et al., A survey on data mining techniques in recommender systems. Soft
Comput. 23(2), 627–654 (2019). Springer Link, https://doi.org/10.1007/s00500-017-2918-7
18. J. Shu, X. Shen, H. Liu, B. Yi, Z. Zhang, A content-based recommendation algorithm for
learning resources, in Multimedia Systems (Springer, 2017) https://doi.org/10.1007/s00530-
017-0539-8
19. Y. Gu et al., Learning global term weights for content-based recommender systems, in Proceed-
ings of the 25th International Conference on World Wide Web, International World Wide Web
Conferences Steering Committee (ACM Digital Library, 2016), pp. 391–400, https://doi.org/
10.1145/2872427.2883069
20. S.A. Gunawardana, G. Shani, A survey of accuracy evaluation metrics of recommendation
tasks. J. Mach. Learn. Res. 10, 2935–2962 (2009)
21. M. Ge, C. Delgado-Battenfeld, D. Jannach, Beyond accuracy: evaluating recommender systems
by coverage and serendipity, in Proceedings of the Fourth ACM Conference on Recommender
Systems (2010), pp. 257–260
22. T. Lorido-Botran et al., A review of auto-scaling techniques for elastic applications in cloud
environments. J. Grid Comput. 12(4), 559–592 (2014). Springer Link, https://doi.org/10.1007/
s10723-014-9314-7
23. A. Biswas, S. Majumdar, B. Nandy, A. El-Haraki, An auto-scaling framework for controlling
Enterprise resources on clouds, in Proceeding of 15th IEEE/ACM International Symposium on
Cluster, Cloud and Grid Computing (CCGrid), (C4BIE workshop, Shenzhen, 2015), pp. 971–
980
24. S. Afzal, G. Kavitha, Load balancing in cloud computing—A hierarchical taxonomical clas-
sification. J. Cloud Comput. 8(1), 22 (2019). DOI.org (Crossref), https://doi.org/10.1186/s13
677-019-0146-7
25. Deploying a High-Availability PHP Application with an External Amazon RDS Database to
Elastic Beanstalk—AWS Elastic Beanstalk. https://docs.aws.amazon.com/elasticbeanstalk/lat
est/dg/php-ha-tutorial.html. Accessed 31 Jan 2021
26. V. Padghan, Amazon S3 tutorial: Everything about S3 bucket storage, in Great Learning Blog:
Free Resources What Matters to Shape Your Career!, 10 Sept 2020, https://www.mygreatlearn
ing.com/blog/amazon-s3/
27. “Documentation.” Firebase, https://firebase.google.com/docs: Dec 2020. Accessed 31 Jan 2021
A Deep Learning-Based Classifier
for Remote Sensing Images
Soumya Ranjan Sahu and Sucheta Panda
Abstract Artificial intelligence is the blessing to modern computer science which

deals with making a machine intelligent. Many technologies are rising rapidly with
the evolution of AI. Machine learning and deep learning has become two powerful AI
technologies, widely used in research field. Deep learning is a revolution in the field of
image processing which can became a popular method for research in current decade.
Deep learning can be easily implemented on medical image analysis, under water
image analysis, remote sensing images for detection and classification. This paper
aims to design a modified convolutional neural network model for classification using
RSSCN7 image dataset. The result of the classification is shown with a confusion
matrix which indicates the classification performance.
Keywords Artificial intelligence · Remote sensing · Deep learning · Neural

network · CNN
1 Introduction
Classification is the act or the process to group something depending upon its char-
acteristics or features. Classification in image processing plays an important role to
characterize and group the pixels based on their specific features. With the emer-
gence of computer science technology, the classification methods can be applied
in various fields such as remote sensing images and under water images. Remote
sensing images are the part of earth surface which are taken from space, which may
be analog or digital. Classifications in remote sensing images means labeling the
images according to the semantic classes like grass, farm, industry, and resident
[1]. Feature extraction is the important part in image classification problem. Image
classification dataset contains two types of data, train data and test data. The train
dataset is used for learning purpose. In learning process, feature extraction is done
S. R. Sahu (B) · S. Panda

Department of Computer Applications, VSSUT, Burla, Odisha 768018, India
S. Panda
e-mail: suchetapanda_mca@vssut.ac.in
310 S. R. Sahu and S. Panda
and it is fed to the machine. With the advancement of machine learning technology,
feature extraction is always carried out prior to classification. But it is quite lengthy
process and also consumes much memory. It is also difficult to extract features in
high dimensions dataset.
To overcome the high dimensionality problem, deep learning is introduced as an
effective method for multi-layer feature extraction [2]. Classification in deep learning
method is pretty good as compared to the traditional method like minimum distance
supervision classification [3], iterative self-organization (ISO) cluster unsupervised
classification [4], support vector machine [5], and random forest classification [6].
Deep learning works on the basics of neural network which can act like human and
mimics the human brain. Deep learning-based applications can also relate to the
human behavior, predict the things, and are capable of taking the decisions. Deep
learning method offers various neural network models in the field of high spatial
resolution remote sensing image applications [2]. With the development of deep
learning, the important demand for research in remote sensing field is observation of
Earth through remote sensing, which can identify and classify the land used and land
covered (LULC) scene from space [7, 8]. Using the developed techniques, many earth
surfaces are taken into focus for research purpose. The technology can be utilized in
many remote sensing applications like urban green space detection [9], hard target
detection [10], urban flood [11], and so on. Convolutional neural network (CNN) [12]
is the very well-known and popular model used in classification problems. Besides
this, deep belief network (DBN) [13] and recurrent neural network (RNN) [14] are
also popular in deep learning mechanism. These are the basics of deep learning
algorithm, which provides the neural network architectures like VGG16, ResNet-
50, MobileNetV2 etc. These architectures are capable to learn features from input
data automatically. These algorithms are heavily used in detection and classification
problems.
In this paper, we have demonstrated the classification of remote sensing image
dataset, RSSCN7 [15] using deep learning concept. As we know that deep learning
is an advanced technology which works on principle of neural network, the concepts
of neural network are derived from neuron of human brain. In human brain, neuron
consists of multiple dendrites that provide input signal to the neuron. Inside the
neuron, there is a cell face containing a nucleus which perform as a functional element
and through the axon, the signal reach to nerve end. Thus, a desire output is generated.
Artificial neuron also works like similar manner. Generally, it consists of three layers
called input layer, hidden layer, and output layer. Input layer provides the input data
to the neuron. Hidden layer performs the calculation, and output layer generates the
expected output.
2 Related Works
The important and interesting research area in remote sensing is scene classification
and detection. The main motive of classification is to determine label correctly with
A Deep Learning-Based Classifier for Remote Sensing Images 311
their belonging classes according to their features. There are many datasets like UC
Merced land used dataset [16], RSSCN7 dataset [2], Areal Image Dataset (AID)
dataset [5], SpaceNet dataset [17] etc., are considered in the research field for classi-
fication. Since deep learning is an advanced technology, it can fulfill the requirements
of feature extraction and accuracy performance over the traditional method. In current
decade, a large amount of research has been done by deep learning concept on the
area of detection and classification.
Zou et al. [15] proposed a method on the basics of feature selection by deep
learning concept for remote sensing scene classification using RSSCN7 dataset. They
trained the data by 100 numbers of epochs. In result, training accuracy was found
to be 77%. On classification testing, it is clearly seen from confusion matrix that
maximum of 93.5% accuracy gave the best result for forest class and a minimum of
65% accuracy belongs to river class. Tayara et al. [18] used NWPU VHR-10 dataset
and presented a uniform one stage model based on CNN architecture to detect the
object and compared with VGGG-16, ResNet-50, and ResNet-101. They obtained a
better mean average precession result as compared to these models. Cheng et al. [7]
demonstrated different type of transfer learning methods like AlexNet, VGGNet-16,
and GoogLeNet to determine the classification performance using RESICS45 remote
sensing dataset. The dataset consists of 31,500 images with 45 different classes. Each
class contains 700 images with an image size of 256 * 256 pixels. Cheng et al. [19]
designed for learning a new rotation invariant layer with CNN using NWPU VHR-10
dataset to improve the performance of multi-class object detection.
In another work, Scott et al. [20] employed CaffetNet-derived DCNN and
GoogleNet-derived DCNN using UCM dataset to resolve the performance of land
cover classifications. They reported the classification accuracy of 95% for both the
models at 90% of confidence level when using the augmented dataset with fine-tuned
feature extraction. Zhai et al. [21] developed two methods of object detection; one
is position-sensitive balancing (PSB) framework and another is the residual network
which works in accordance with fully connected network that can detect 10 multi-
class objects. Li et al. [2] aimed on classification employing with the urban built up
areas. They developed a hybrid model named Same Model with a Different Training
Rounding (SMDTR) using CNN and CapsNet. They compared the accuracy perfor-
mance of training testing and classifications by SMDTR-CNN and SMDTR-CapsNet
with CNN and CapsNet by taking High Spatial Resolution Remote Sensing (HSRRS)
image dataset. They concluded that the accuracy of SMDTR-CNN is 0.2% more than
CNN, and accuracy of SMDTR-CapsNet is 0.6% lower than CapsNet. Pal [6] used
a random forest classifier to classify the agricultural area using Landsat-7 Enhanced
Thematic Mapper (ETM+ ) dataset, consists of 7 classes with 2700 training and
2037 testing pixels. The model achieved maximum accuracy of 97.31% matched
with wheat class and minimum of 81.9% belongs to lettuce class. The overall accu-
racy of testing was found to be 88.37%. Yang et al. [22] approached a hierarchical
deep learning framework to classify the land use objects in geospatial database.
They used encoder–decoder CNN based to classify the land used in multiple levels
like hierarchically and simultaneously. They introduced joint optimizer which can
predict by selecting the hierarchical tuple over all levels in which the joint class score
is maximum. It can provide consistent results across different levels. In result, they
achieved overall accuracy up to 92%.
So, many researchers have done research in the field of remote sensing using
traditional and new technologies with different kinds of remote sensing datasets.
Some have also developed their new hybrid methods to perform more accurately
than existing. Our research work is based on deep learning methods which is very
advanced and popular technology since some years.
3 Methods
In this section, we have briefly explained about principles, procedures, and method-
ology used to achieve the expected result and outcomes of the research including
suitable flowchart diagram and architecture of entire research work. The procedure
of our research work contains description about the dataset, the procedure for data
augmentation, and how to design the model to train data.
3.1 Dataset Used
In our study, we have used RSSCN7 remote sensing dataset which was released by
Wuhan University in 2015 [23], collected from Google Earth (Google Inc.) [7]. A
total numbers of 2800 images are present in the dataset and categories as 7 different
remote sensing image classes; these consists of grass, farm, forest, industry, river,
parking, and residential. Each class is having 400 images of size 400 * 400 pixels.
(a) (b) (c) (d) (e) ( f) (g)

Fig. 1 Sample images from RSSCN7 dataset, from left to right- a farm, b forest, c grass, d industry,
e parking, f resident, g river
Figure 1 shows the different types of images in RSSCN7 dataset. From the whole
dataset, we have taken a ratio of 80:20 for training and testing.
3.2 Data Augmentation
Deep learning is a breakthrough technology that has potentiality to train a copious

amount of data. For very large size of data, we must use a technique to keep away
from over-fitting in duration of training. Over-fitting is also known as error present
in a model. When the model is complex with too many parameters corresponding to
the number of observations, then there will be chance of over-fitting. The over-fitting
in a model leads to unsatisfactory predictive performance because it heavily reacts to
small fluctuation during training. To handle these types of error, we used some strate-
gies in deep learning like dropout, batch normalization, early stopping, data augmen-
tation etc. Among these strategies, data augmentation is a general solution for over-
fitting of a model during training because it enhances the performance and results of
a model by creating a new and different sample to train dataset. Data augmentation
is a way of creating a modified version of train data from existing dataset. In this
method, the images are represented as various forms. It means, augmentation can
create the variations of an image by modifying its internal properties. Internal prop-
erties like rotation, resize, scale, shift etc., are included in augmentation approach.
These properties can strengthen the learning ability of the model.
3.3 Convolutional Neural Network
Convolutional neural network is the fundamental supervised learning model in

deep learning. By using this model, many deep learning architectures are designed.
Broadly, CNN is layered into 3 types of layer known as convolutional layer, pooling
layer, and fully connected layer. Each layer performs different working principles
and aimed to learn features from input data and make the model more accurate
to predict. A simple two-dimensional convolutional neural network architecture is
shown in Fig. 2.
3.3.1 Convolutional Layer
Convolutional layer is the first layer in CNN. This layer is responsible for extracting
the features of an input data with the help of a weight matrix. The weight matrix
is also known as kernel having a given size. It can be argued the kernel as a slider
because it slides or moves all over the image at least once to extract the feature of
that image. Convolutional layer convolves the input data and calculate the result for
it and pass the result to the next layer. Consider we have an input image I (M, N)
Input data Convolution Pooling Layer

layer
Flatten Hidden Classification Output
layer layer layer layer
Feature extraction Fully connected layer
Fig. 2 A simple two-dimensional CNN architecture
where M represent the number of rows and N represents the number of column.
For convolution, we need a weight matrix that is W (m, n) where m, n is number of
rows and column, respectively. We can obtain the size of extracted feature matrix by
(M-m + 1, N–n + 1). Let R (i, j) is the extracted feature matrix. Now the extracted
component for each (i, j) can be calculated by given formula as shown in Eq. 1,

a
b
R(i, j) = W (m, n).I (i − m, j − n) (1)
m=−a n=−b
where a and b are the constant integers. For example, if we have input image of size
6 * 6 and a weight matrix of size 3 * 3, it will fit on the image from starting co-ordinate
and calculate the feature for that mask area. It will move on entire image, and the
process will stop after calculating all the features. Thus in result, we will get a 4 * 4
matrix which is the extracted features of the image. After convolution operation is
done, the extracted features are passed to the next layer for classification. Figure 3
shows the function of convolution operation.
Convolution Sub Convolution Sub Convolution

Input data layer Sampling layer Sampling layer
layer layer Fully connected
layer Output
Fig. 3 Feature extraction using convolution operation

3.3.2 Pooling Layer
Pooling layer is placed in between the convolutional layer and fully connected layer.
Pooling is a type of operation which reduces the image dimensions like height and
width and stores their important characteristics. Sometimes, the input images are too
large that we need to reduce the number of trainable parameters. Pooling layer is
responsible to reduce the number of parameter and calculations used in the network
model. Basically, polling layer is of two types: max pooling and average pooling.
Max pooling can operate faster and have better accuracy performance than average
pooling. It is a superior operation to select the invariant features with improving the
generalization. Max pooling is the popularly referral pooling operation due to its
better performance and minimizing capability of calculations than average pooling.
In our proposed work, we have used max pooling operation for smooth calculation
and good performance.
3.3.3 Fully Connected Layer
Fully connected layer is the last phase in CNN architecture which is followed by
convolutional layer. As the name suggest, here, the neurons are fully connected
with the previous layer neuron. This layer used activation function which helps the
network to learn, if the model is complexly designed. The activation function in neural
network model is similar with human brain system which is capable of deciding and
predicting the things. It helps to decide which data are to be accepted and which
data to be fired to the next neuron. Every neuron is connected with a link which
is assigned with a weight. It accepts the output signal from the previous layer and
converts the signal into some understandable form which will again fed as input to
the next neuron. If the neuron has not got the proper signal, then it can go back for
update their weight. This method is known as back propagation.
Activation function has the ability to add non-linearity into a neural network
model. Mostly used activation functions in neural models are sigmoid, ReLU, and
softmax. Sigmoid is used to classify the binary class. Softmax is a multi-class classi-
fier helps to classify having more than two classes. ReLU stands for rectified linear
activation function. It is a piecewise linear function which generates the output as
positive for valid input otherwise it will generate zero.
4 LeNet-5
LeNet-5 [24] is a pre-trained CNN model, introduced by LeCun in 1998. The model
consists of three set of convolution layers and designed such a way that the average
pooling layer is placed after one convolution layer. But the last convolution layer has
not any pooling layer. First, second, and third convolution layer have the filter size of
6, 12, and 120, respectively. Then, fully connected layers having 84 and 7 nodes are
Fig. 4 LeNet-5 architecture
added after the flatten layers. The last fully connected layer used softmax function
for classification of dataset. The architecture of LeNet-5 is shown in Fig. 4.
5 Proposed CNN Model
In our work, we have proposed a modified version of CNN architecture by adding

some extra layer in neural model. We resized the input image into 224 * 224 * 3,
which represents the image height, width, and number of color channels. Basically,
the remote sensing images are RGB in color format, so the number of color channel
is 3. We have added six convolutional layers of different filter sizes with ReLU
activation function. We have arranged first two convolution layers with filter size
128, then two convolutional layers with filter size 64, and then two more are added
with size 32. A max pooling layer is added after every two convolution layers. This
arrangement constitutes a feature extraction part in our model. Then, the extracted
features are fed to flatten layer. Flatten layer is a function which converts the feature
maps to one dimensional array. It converts the pooled feature map to a single column
and transferred to the hidden layer. In hidden layer, neurons are inter-connected with
several layers. We used four hidden layers with 128, 64, 32, 7 neurons. The last
layer gives the desired output of the problem, so the number of classes must be
equal to the number of neurons in last layer. First three hidden layer used ReLU as
activation function and last layer used softmax for multiple classification. Figure 5
shows the proposed method for CNN architecture to classify the RSSCN7 remote
sensing image dataset.
6 Result and Analysis
In this section, we have divided the result into three sections. The first section of our
work is to design a robust model which can train the data and give accurate result.
Fig. 5 Proposed CNN architecture for RSSCN7 dataset
On the second section, we have shown the classification result with the help of a
confusion matrix. On third section, we have discussed the accuracy of our model.
The confusion matrix shows the percentage of matching class between train data and
test data.
All the implementations are done by using Python 3.8.3 and Jupyter notebook-
6.0.3 environment on Dell Intel(R) Core (TM)—i3 6100U processor with 4 GB
RAM.
6.1 Model Validation
We have briefly discussed about our proposed CNN model in previous section.
Figure 6a shows training accuracy and validation accuracy of LeNet-5 model and
Fig. 6 Accuracy and loss curve of LeNet-5 model. a training versus validation accuracy, b training
versus validation loss
Fig. 7 Accuracy and loss curve of proposed model. a training versus validation accuracy, b training
versus validation loss
Fig. 6b shows the training and validation loss of the model. The highest training accu-
racy of LeNet-5 model is 86% and a highest of 87% validation accuracy. Figure 7
shows the accuracy and loss curves of our proposed model. The graph clearly indi-
cates that we obtained a highest of 90% training accuracy and 94% validation accu-
racy on 100 numbers of epochs. However, the validation accuracy fluctuated suddenly
and unexpectedly reached to 84%. The fluctuation of validation model occurs because
of over-fitting of a model or for small size of validation dataset.
6.2 Confusion Matrix
Confusion matrix is the visual representation of classification of multi-class problem.

It shows how many of data are correctly classified and how many are misclassified.
The diagonal element shows the maximum classification rate in the figure. Figure 8a
shows the confusion matrix based on LeNet-5 model. LeNet-5 model classified the
parking class with maximum, but this model wrongly classified so many classes.
Figure 8b shows our confusion matrix based on CNN model. From figure, we can
clearly see that we achieved a maximum of 98% matched accurately which belongs
to the class industry and a minimum of 62% which matched with grass category. The
grass class is wrongly matched with farm and forest with 15% and 14%, respectively.
6.3 Accuracy of the Model
The accuracy of pre-trained LeNet-5 model was found to be 84%, whereas our
proposed method gave the accuracy of 89%. The accuracy of a model can be deter-
mined by four parameters, i.e., true positive (TP), true negative (TN), false positive
Fig. 8 Confusion matrix of RSSCN7 dataset a using LeNet-5, b using our proposed CNN model
(FP), and false negative (FN). We can also determine precession, recall, and F1-score
from there parameters.
True Positive (TP)—It is the correctly predicted positive value that means the
model predicted the actual class correctly as yes.
True Negative (TN)—It is the correctly predicted negative value that means the
model predicted the actual class correctly as no.
False Positive (FP)—It is the falsely predicted classes where the model predicts
the class which not belongs to the actual class.
False Negative (FN)—It is the value where the actual class is yes but predicted
class is no.
Accuracy of a model determines the testing performances of a model. How effec-
tively our model is working can be known by the accuracy matrices. Accuracy of a
model is calculated by using formula as shown in Eq. (2).
Accuracy = (TP + TN)/(TP + FP + FN + TN) (2)
The remarkable development of remote sensing technology provided the huge

amount of remote sensing data. In this paper, first of all, we described a brief review
about some traditional method and some newly developed methods which are used
in remote sensing applications. Then, we have focused into the CNN architecture and
designed our own model which is based on neural network technique. The model
performed well and gave a suitable classification result as compared to LeNet-5
model.
However, there are still some problems like misclassifications of data. In the
future, we will design a model which can give more accurate result and also can
detect the object. In addition, we can use different types of datasets to check our
testing performance.
References
1. Y. Gao, J. Shi, J. Li, R. Wang, Remote sensing scene classification based on high-order graph
convolutional network. Eur. J. Remote. Sens. 54(S1), 141–155 (2021)
2. W. Li, H. Liu, Y. Wang, Z. Li, Y. Jia, G. Gui, Deep learning-based classification methods for
remote sensing images in urban built-up areas. IEEE Access 7, 36274–36284 (2019)
3. M.E. Hodgson, Reducing the computational requirements of the minimum-distance classifier.
Remote Sens. Environ. 25(1), 117–128 (1988)
4. K.-Y. Huang, The use of a newly developed algorithm of divisive hierarchical clustering for
remote sensing image analysis. Int. J. Remote Sens. 23(16), 149–168 (2006)
5. A. Chambolle, An algorithm for total variation minimization and applications. J. Math. Imag.
Vis. 20(1), 89–97 (2004)
6. M. Pal, Random forest classifier for remote sensing classification. Int. J. Remote Sens. 26(1),
217–222 (2007)
7. G. Cheng, J. Han, X. Lu, Remote sensing image scene classification: Benchmark and state of
the art. Proc. IEEE 105(10), 1865–1883 (2017)
8. L. Gómez-Chova, D. Tuia, G. Moser, G. Camps-Valls, Multimodal classification of remote
sensing images: a review and future directions. Proc. IEEE 103(9), 1560–1584 (2015)
9. A. Canetti, M.C. Gárrastazu, P.P. de Mattos, E.M. Braz, S.P. Netto, Understanding multi-
temporal urban forest cover using high resolution images. Urban For Urban Greening 29,
106–112 (2018)
10. A. Milan, An integrated framework for road detection in dense urban area from high-resolution
satellite imagery and Lidar data. J. Geograph. Inf. Syst. 10(2), 175–192 (2018)
11. Y. Wang, A.S. Chen, G. Fu, S. Djordjevi, C. Zhang, D.A. Savić, An integrated framework
for high-resolution urban flood modelling considering multiple information sources and urban
features. Environ. Model. Softw, 107, 85–95 (2018)
12. S.-C.B. Lo, H.-P. Chan, J.-S. Lin, H. Li, M.T. Freedman, S.K. Mun, Artificial convolution neural
network for medical image pattern recognition. Neural Netw. 8(7–8), 1201–1214 (1995)
13. R. Zand, K.Y. Çamsarı, I. Ahmed, S.D. Pyle, C. Kim, S. Datta, R. Demara, R-DBN: A resistive
deep belief network architecture leveraging the intrinsic behavior of probabilistic devices, in
Proceeding of the ACM Great Lakes Symposium VLSI (GLSVLSI) (2018), abs/1710.00249,
pp. 1–8
14. M.Y. Miao, M. Gowayyed, EESEN: End-to-end speech recognition using deep RNN models
and WFST-based decoding, in Proceeding of the Automatic Speech Recognition Understand
(2016), pp. 167–174, Dec 2016
15. Q. Zou, L. Ni, T. Zhang, Q. Wang, Deep learning based feature selection for remote sensing
scene classification. IEEE Geosci. Remote Sens. Lett 12(11), 2321–2325 (2015)
16. Y. Yang, S. Newsam, Bag-of-visual-words and spatial extensions for land-use classification.
in Proceeding of the 18th SIGSPATIAL International Conference on Advances in Geographic
Information Systems (2010), pp 270–279
17. D. Lindenbaum, T. Bacastow, SpaceNet: A remote sensing dataset and challenge series, in
Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2018), pp. 1–10, Jul 2018
18. H. Tayara, K.T. Chong, Object detection in very high-resolution aerial images using one-stage
densely connected feature pyramid network. Sensors 18(10), 3341, (2018)
19. G. Cheng, P. Zhou, J. Han, Learning rotation-invariant convolutional neural networks for object
detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 54(12),
7405–7415 (2016)
20. G.J. Scott, M.R. England, W.A. Starms, R.A. Marcum, C.H. Davis, Training deep convolutional
neural networks for land-cover classification of high-resolution imagery. IEEE Geosci. Remote
Sens. Lett. 14(4), 549–553 (2017)
21. H. Zhai, H. Zhang, L. Zhang, P. Li, Cloud/shadow detection based on spectral indices for
multi/hyperspectral optical remote sensing imagery. ISPRS J. Photogramm. Remote Sens.
144, 235–253 (2018)
22. C. Yang, F. Rottensteiner, C. Heipke, A hierarchical deep learning framework for the consistent
classification of land use objects in geospatial databases. ISPRS J. Photogramm. Remote. Sens.
177, 38–56 (2021)
23. S. -C. Hung, H. -C. Wu, M.H. Tseng, Remote sensing scene classification and explanation
using RSSCNet and LIME. Appl. Sci. 10, 6151 (2020)
24. Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient based learning applied to documentation
recognition, in Proceedings of the IEEE (1998), pp. 2278–2324 Dec 1998
Performance Evaluation of Machine
Learning Algorithms to Predict Breast
Cancer
S. Siva Sunayna, S. N. Thirumala Rao, and M. Sireesha
Abstract Breast cancer is increasing in all over the world year by year [2, 13],
and it is a dominant cancer in the worldwide. Due to the lack of medical facilities,
these cases are not early diagnosed, and the early detection will help to lower the
death rates. This paper proposes machine learning (ML) methods to foretell whether
the person is suffering with disease or not. It compares results of various ML algo-
rithms: decision tree (DT), logistic regression, random forest classifier (RF), LGBM
classifier, support vector machine classifier (SVC), and K-nearest neighbor (KNN).
The above-mentioned ML algorithms are applied on the data after filling the missing
values, after removing outliers, after applying correlation, after applying the SMOTE
technique and to predict the best model.
Keywords Machine learning · Performance evaluation · Prediction · Breast cancer
1 Introduction
Cancer is a generalized term for group of diseases which can affect at any part of the
body. Rapid growth of abnormal cells which can form as tumor is called as cancer
and slowly affects adjacent parts of the body and spread to other organs [1]. Breast
cancer (BC) is one of the major cancers among the women throughout the worldwide,
which represents the majority of new cases and deaths according to the World Health
Organization (WHO) report, making it as a severe health issue in the present society
[2]. WHO declared that nearly 10 million deaths in 2020 as breast (22.6 lakh cases),
lung (22.1 lakh cases), colon and rectum (19.3 lakh cases), prostate (14.1 lakh cases),
skin (non-melanoma) (12 lakh cases), and stomach (10.9 lakh cases) [3]. Among the
other types of cancers, breast cancer is considered to be the main cause of death
in women in most countries [4]. It is difficult to overestimate the importance of
appropriate breast cancer diagnosis, as the disease ranks second among all cancers
that lead to death in women [5].
S. S. Sunayna · S. N. T. Rao (B) · M. Sireesha

Department of CSE, Narasaraopeta Engineering College, Narasaraopeta, A.P, India
324 S. S. Sunayna et al.
Fig. 1 Worldwide cancer report
Family history, reproductive factors, radiation, obesity, and lifestyle are the most
influencing factors for higher risk of breast cancer in women. The early diagnosis
saves the life of patient with advanced treatment. As it is the most dangerous cancer,
always it has a high mortality rate. As per recent statistics, 25% of all new cancer
cases are BC alone and 15% of total cancer deaths are BC among women throughout
the world. Figure 1 shows worldwide different types of cancers with different colors
by WHO [1]. In that, PINK color indicates breast cancer. The early detection of
disease can improve the chance of survival of patients, and it can also suggest timely
treatment to patients.
As the population is increasing continuously, in the same way different types of
health issues are also raising exponentially. Nowadays, huge number of patients are
visiting hospitals for different types of treatments. In today’s world, majority of the
hospitals are maintaining hospital management system to maintain patient health-
care data. But, this data is rarely used in the decision-making process or sometimes
decision’s based on doctors experience may not be guaranteed [6, 7]. So, there is
a need of an automated disease prediction system for saving the life’s of patients.
From the past few years, artificial intelligence and machine learning (ML) techniques
are widely used in medical field to maintain intelligent healthcare systems [8–12].
Researchers are continuously finding new ways to fight with these cancers.
2 Literature Survey
Formation of irregular development of the tissue is the major reasons for the devel-
opment of cancer which is affected by the production of estrogen. Sometimes tumor
may be dangerous (malignant) or non-dangerous (Benign) [13]. Malignant tumors
Performance Evaluation of Machine Learning Algorithms … 325
are expanding to other adjacent organs of the body, whereas benign tumors cannot
expand to other organs of the body and also the cells in malignant tumors are divided
more briskly and expand faster than cells in benign tumor [14]. Early detection of
disease may be difficult in the initial stage, due to the absence of symptoms but,
after some clinical tests it is possible to do exact differentiation of the malignant and
benign tumors.
Different preprocessing methods data cleaning, Synthetic Minority Oversampling
Technique (SMOTE), applying correlation coefficient, and tenfold cross-validations
are applied on the breast cancer data to obtain the accuracy [8]. Sireesha Moturi et al.
[9] used an information gain as a feature selection technique in order to reduce the
search space, and the proposed method was evaluated on the NRI Hospital medical
data. Amrane et al. [14] used K-nearest neighbor and Naive Bayes algorithms are
used to construct the model and got the accuracy of 97.51% and 96.19%, respectively.
Shahidi et al. [15] told that the accuracy scores are different for the different
machine learning models, and it indicates other factors such as filling missing values,
outlier removal, and feature selection methods can also influence the ability of models
to get the highest accuracy [15]. Researchers are continuously finding new ways to
find the best models.
3 Experimental Setup
To conduct the experiment, the following hardware and software are used: Intel(R)
Core(TM) i3-4005U CPU @ 1.70 GHz processor, 1 MB Cache memory, 64 bit
Windows 10 operating system, Jupiter notebook 6.1.4 for python 3.
Data Collection: Wisconsin Diagnostic Breast Cancer Dataset [16] (WDBC
Dataset) is used for the breast cancer prediction. It is obtained from UCI machine
learning repository [16]. The dataset contains 32 attributes, 569 records, and the class
distribution is 357 benign, 212 malignant.
Proposed model of WDBC dataset is depicted in Fig. 2. After preprocessing of
data, the dataset is divided into two parts test set and train set. Seventy-five percent
of records are divided into train set and remaining twenty-five percent of dataset will
be test set. Train data is used to construct the model using ML algorithms, and model
is evaluated using test data.
Machine learning algorithms applied on the dataset for creating some models or to
come to vital conclusions from that dataset. Some popular data mining algorithms are:
1) Decision tree (DT), 2) Logistic regression (LR), 3) Random forest classifier (RFC),
4) light gradient boosting machine classifier (LGBM), 5) support vector machine
classifier (SVC), 6) K-nearest neighbors (KNN). These algorithms are applied on
WDBC dataset to evaluate the best method for the identification of breast cancer.
• After removing missing values
• After removing outliers
• After applying correlation
Fig. 2 Architecture of proposed model
• After applying SMOTE.

After processing, the results of all methods are compared to find the best method.
3.1 Analysis on Dataset to Check Missing Values
Missing Values: Data cleaning should be done on missing data or an erroneous data.
Data cleaning can be done by filling missing values manually or by attribute mean
or median or the most probable value.
The heat map was constructed to find missing values. And it is also observed
that there are no white stripes in the heat map. So, the dataset does not contain any
missing values. All the above-said algorithms are applied on this data and evaluated
their accuracy.
3.1.1 Decision Tree Classifier on the WDBC Dataset
A model is created by passing training data to decision tree classifier. A model

is predicted by passing test data to classifier. Accuracy, precision, and recall are
calculated for the predicted model.
The recall percentage is 98%. So, 98% of malignant records classified correctly.
The precision percentage is 78%. So, 78% of TP records classified correctly.
The accuracy is 89%. So, 89% of TP + TN records are classified correctly.

Performance metrics are calculated by using a confusion matrix. It is used to
evaluate the performance of a classifier against the test data. As shown in Fig. 3,
confusion matrix contains two rows and two columns and 4 values those are: the
number of False Positives (FP), False Negatives (FN), True Positives (TP), and True
Negatives (TN). It allows the complete analysis of classification, that is, accuracy,
precision, and recall. The confusion matrix for decision tree classifier in Fig. 4 shows
that 76 true positive values and 50 true negative values are classified correctly. And
14 positive records and 3 negative records are improperly classified. This curve plots
on FP on X-Axis and TP on Y-axis.
This curve plots on FP on X-Axis and TP on Y-axis. ROCAUC stands for ‘Area
under the ROC Curve.’ Decision tree ROC AUC (Fig. 5) measures the entire two-
dimensional area underneath the entire ROC curve. Figure 6. shows precision–recall
curve. It indicates the trade-off between precision and recall. A high area under the
curve represents both high recall and high precision, where high precision relates to
a low false positive rate, and high recall relates to a low false negative rate.
Fig. 3 Confusion matrix
Fig. 4 Confusion matrix for

decision tree classifier
Fig. 5 Receiver operating

characteristics curve (ROC
AUC) for decision tree
Fig. 6 Recall for decision

tree classifier
3.1.2 Analysis of WDBC Dataset Above Said After Checking Missing

Values
It can be observed that from Tables 1, 2, and 3, the accuracy, precision of LGBM
classifier, and recall of decision tree are the best classifiers compared to other models.
Table 1 Comparing accuracy

Model Accuracy
of various models of WDBC
LGBM 0.972028
RF 0.965035
LR 0.944056
SVC 0.937063
KNN 0.937063
DT 0.895105
Table 2 Comparing
Model Precision
precision of various models
of WDBC LGBM 1.000000
SVC 0.978261
RF 0.945455
KNN 0.907407
LR 0.896552
DT 0.776119
Table 3 Comparing recall of

Model Recall
various models of WDBC
DT 0.981132
LR 0.981132
RFC 0.981132
LGBM 0.924528
KNN 0.924528
SVC 0.849057
3.2 Analysis of WDBC After Removing Outliers
Box plots are constructed for 1–11 attributes, 12–22 attributes, and 23 to 32 attributes.
And it is observed that outliers are present in the data. Outliers are removed from
the dataset and constructed models and evaluated them. Figure 7 depicts the boxplot
for 1–11 attributes. Initially, the dataset contains 569 records, then after removing
outliers the dataset contains 544 records. It means 25 records are removed.
It can be observed that from Tables 4, 5, and 6, the accuracy and precision of
LGBM classifier and recall of decision tree are the best compared to other models
Fig. 7 Box plot of WDBC

data set after removing worst
outliers
Table 4 Comparing
Model Accuracy
accuracy of various models
after removing outliers LGBM 0.975155
RF 0.962733
DT 0.950311
KNN 0.937888
SVC 0.931677
LR 0.919255
Table 5 Comparing
Model Precision
after removing outliers LGBM 0.977778
RF 0.936170
KNN 0.930233
SVC 0.928571
DT 0.867925
LG 0.854167
Table 6 Comparing
Model Recall
RECALL of various models
after removing outliers DT 0.978723
RF 0.936170
LGBM 0.936170
LR 0.872340
KNN 0.851064
SVC 0.829787
after removing outliers. It can also observe that accuracy and RECALL of DECISION
TREE classifier has INCREASED after removing outliers. Even though precision
of LGBM has decreased compared to after filling missing values still it is the best
algorithm with respect to precision.
3.3 Analysis of WDBC Dataset After Correlation
Correlation is one of the preprocessing methods which can be used to know the
relationship between features and to identify the relevant attributes. This method is
applicable for continuous data. The original dataset contains 32 attributes, and after
applying correlation, 21 attributes were selected. The above-said ML algorithms are
applied after correlation of WDBC data. The results of performance of the various
Table 7 Comparing
Model Accuracy
ACCURACY of various
models after correlation LGBM 0.972028
LR 0.972028
RF 0.972028
KNN 0.951049
DT 0.944056
SVC 0.930070
Table 8 Comparing
Model Precision
after correlation LGBM 0.962264
SVC 0.957447
LR 0.945455
RF 0.945455
KNN 0.925926
DT 0.894737
Table 9 Comparing
Model Recall
RECALL of various models
after correlation LR 0.981132
RF 0.981132
DT 0.962264
LGBM 0.962264
KNN 0.943396
SVC 0.849057
algorithms are mentioned in Tables 7, 8, and 9. It is observed that LGBM gave better
accuracy and precision, the LR gave better recall.
3.4 Analysis of WDBC Dataset After SMOTE
Most machine learning techniques ignore class imbalance, which in turn gives poor
performance on the minority class. In SMOTE method, duplicating the minority
class is to balance the dataset. These examples do not add any new information to the
model. The WDBC dataset is balanced using SOMTE technique. Before applying
SMOTE Class ‘0’ contains 357 records and class ‘1’ contains 212 records. Figure 8
shows imbalanced dataset, and Fig. 9 represents the balanced classes after applying
SMOTE.
Fig. 8 WDBC dataset

before applying SMOTE
Fig. 9 WDBC dataset after

applying SMOTE
All above-said algorithms applied on balanced dataset to find the performance of

each algorithm. It can be observed from Tables 10, 11, and 12 that DT, RF, LGBM
classifiers have given better performance compared to after missing value filling and
after outliers removal.
Table 10 ACCURACY of
Model Accuracy
various models after applying
SMOTE LGBM 1
RF 1
DT 1
LR 0.938547
KNN 0.932961
SVC 0.893855
Table 11 PRECISION of
Model Precision
SMOTE LGBM 1
RF 1
DT 1
LR 0.916667
SVC 0.907895
KNN 0.905882
Table 12 RECALL of
Model Recall
SMOTE LGBM 1
RF 1
DT 1
LR 0.950617
KNN 0.950617
SVC 0.851852
Accuracy
1.1
0.9
0.8
LGBM RF LR SVC KNN DT
WDBC DATA SET Correla on Removing Outliers SMOTE
Fig. 10 Comparing accuracy of various ML algorithms
4 Comparison of All Methods
By observing Figs. 10, 11, 12, it can be concluded that LGBM classifier is giving
better accuracy and precision on WDBC dataset, after removing outliers, correlation,
and after SMOTE. DT is giving better recall percentage.
5 Conclusion
This paper presents a disease prediction model which contains filling missing values,
outlier removal, applying SMOTE, feature selection, and classification algorithms
Precision
1.5
0.5
0
ON WDBC DATA SET Correla on Removing Outliers SMOTE
Fig. 11 Comparing precision of various ML algorithms
Recall
1.5
0.5
0
ON WDBC DATA SET Correla on Removing Outliers SMOTE
Fig. 12 Comparing recall of various ML algorithms
like decision tree classifier, random forest classifier, LGBM classifier, logistic regres-
sion, KNN, and SVC. The accuracies are varying from one algorithm to other. From
the analysis, it is noticed that the LGBM classifier gives better accuracy and preci-
sion when compared to other classifiers, because of its faster training speed, lower
memory usage, higher efficiency, and compatibility with large datasets. This model
will be helpful for physicians for quick and better decision making in the process of
disease diagnosis to enhance patient safety.
References
1. https://www.who.int/news-room/fact-sheets/detail/cancer [Last Accessed on 12 Sep 2021]

2. W. Yue, Z. Wang, H. Chen, A. Payne, X. Liu, Machine learning with applications in breast
cancer diagnosis and prognosis. Designs 2, 13 (2018). https://doi.org/10.3390/designs2020013
3. Cancer.org/content/dam/cancer-org/research/cancer-facts-and-statistics/breast-cancer-facts-
and-figures/breast-cancer-facts-and-figures-2019–2020.pdf. Last Accessed on 14 Aug
2021
4. J. Alwidian, B.H. Hammo, N. Obeid, WCBA: Weighted classification based on association rules
algorithm for breast cancer disease. Appl. Soft Comput. 62, 536–549 (2018). ISSN 1568–4946
5. R. Jafari-Marandi, S. Davarzani, M.S. Gharibdousti, B.K. Smith, An optimum ANN-based
breast cancer diagnosis: Bridging gaps between ANN learning and decision-making goals.
Appl. Soft Comput. 72, 108–120 (2018). ISSN 1568-4946
6. M. Sireesha, S. Vemuru, S.N.T. Rao, Frequent item set mining algorithm: a survey. J. Theor.
Appl. Inf. Technol. 96(3), 744–755. ISSN-1992-8645
7. M. Sireesha, S. Vemuru, S.N.T. Rao, Coalesce based binary table: An enhanced algorithm for
mining frequent patterns. ijet 7(1.5), 51–55 (2018)
8. H. Dhahri, E. Al Maghayreh, A. Mahmood, W. Elkilani, M.F. Nagi, Automated breast cancer
diagnosis based on machine learning algorithms. J. Healthc. Eng. 2019, 11, Article ID 4253641.
https://doi.org/10.1155/2019/4253641
9. S.A. Mohammed, S. Darrab, S.A. Noaman, G. Saake, Analysis of breast cancer detection
using different machine learning techniques, in Data Mining and Big Data. DMBD 2020.
Communications in Computer and Information Science eds. by Y. Tan, Y. Shi, M. Tuba, vol
1234 (Springer, Singapore, 2020). https://doi.org/10.1007/978-981-15-7205-0_10
10. M. Sireesha, S. Vemuru, S.N.T. Rao, Optimized feature extraction and hybrid classification
model for heart disease and breast cancer prediction. Int. J. Recent Technol. Eng. 7(6), 1754–
1772. ISSN-2277–3878
11. S.H. Nallamala, P. Mishra, S.V.Koneru, Breast Cancer detection using machine learning way.
Int. J. Recent Technol. Eng. 8, 1402–1405 (2019)
12. M. Nilashi, O. bin Ibrahim, H. Ahmadi, L. Shahmoradi, An analytical method for diseases
prediction using machine learning techniques. Comput. Chem. Eng. 106, 212–223, https://doi.
org/10.1016/j.compchemeng.2017.06.011
13. S. Moturi, S.N.T. Rao, S. Vemuru, Grey wolf assisted dragonfly-based weighted rule generation
for predicting heart disease and breast cancer. Comput. Med. Imaging Graph. 91, 101936
(2021). ISSN 0895-6111
14. M. Amrane, S. Oukid, I. Gagaoua, T. Ensarİ,Breast cancer classification using machine
learning, in 2018 electric electronics, computer science, biomedical engineerings’ meeting
(EBBT) (2018), pp. 1–4. https://doi.org/10.1109/EBBT.2018.8391453
15. F. Shahidi, S.M. Daud, H. Abas, N.A. Ahmad, N. Maarop, Breast cancer classification using
deep learning approaches and histopathology image: a comparison study. IEEE Access 8,
187531–187552 (2020). https://doi.org/10.1109/ACCESS.2020.3029881
16. UCI Machine Learning Repository: Wisconsin Diagnostic Breast Cancer Dataset
(WDBC Dataset). https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagno
stic). Last Accessed on 14 Aug 2021
Topology Dependent Ant Colony-Based
Routing Scheme for Software-Defined
Networking in Cloud
B. S. Shylaja, S. R. Deepu, and R. Bhaskar
Abstract The exponential growth of cloud computing has resulted in the rapid
creation of data center networks (DCNs). As data center networks have grown in
popularity, efficient routing has become a critical problem for maximizing network
performance, scalability, and reliability. Traditional link-state algorithms are widely
used in data center networks, but they take a longer convergence time. In DCN,
topology-aware routing algorithms were recently discovered to be efficient. This
paper proposes the ant colony-based shortest path routing algorithm (AC*) to illus-
trate typical topologies. To guarantee unified control of the whole network, this
approach decouples the control plane from the data forwarding plane, and topology
description language (TPDL) files are utilized as prerequisite information to estab-
lish initial topology in the software-defined network (SDN) controllers. Unlike other
topology-aware routing algorithms, the ant colony-based shortest path routing algo-
rithm (AC*) is designed to work on a variety of standard topologies. The AC*
algorithm outperforms traditional link-state routing methods according to the results
of the experiments.
Keywords Data center network · Topology aware · Ant colony · Software-defined

network
1 Introduction
Data centers (DCs) act as a central core in modern ICT ecosystems. The vast network
infrastructure of physical machines in data centers is called as data center network
[1] which enables online information services to run continuously from all over
the world. DC systems are rapidly expanding and redesigning themselves for high
B. S. Shylaja · S. R. Deepu (B)

Department of Information Science and Engineering, Dr. Ambedkar Institute of Technology,
Bangalore 560056, India
R. Bhaskar
Department of Computer Science and Engineering, Don Bosco Institute of Technology, Bangalore
560074, India
338 B. S. Shylaja et al.
reliability and availability to avoid catastrophic failures and system outages [2, 3].
A server system’s reliability and availability in a data center are usually believed to
be based on the reliability and of server systems involved in the system architecture,
as well as the own physical machine availability.
Data center networking (DCN) [4, 5] is a new networking model that has the
potential to overcome the shortcomings of today’s network infrastructures. The net-
work’s control plane, i.e., control logic is separated from primary routers and switches
that transmit traffic in a data center network which allows for vertical integration (the
data plane). Second, by separating data planes and control planes, network switches
can be reduced to basic forwarding units, and control logic can be implemented
in a logically centralized controller (or network operating system), which makes
implementation, network evolution, and (re)configuration easier [6].
This is interesting to see how manipulating a system with similar components
yields different outcomes, since each computable node in a DCN interacts with other
nodes through the topology of the network. Even if the total number of components
remains unchanged, proper allocation and networking would greatly increase the
system’s efficiency and availability. The impact of subsystem allocation and inter-
connection on overall device performance and availability in DCNs has received
little attention.
An efficient framework is required for communication among the physical
machines in a data center networking which is essential for data center agility
and reconfigurability. While maintaining high reliability/availability, capacity, and
throughput, the DCNs must be able to respond to a wide range of application service
demands requirements. In data centers, top of rack (ToR) switches are linked to end
of rack (EoR) switches which are then linked to core switches.
While other steps and characteristics of a data center network must be balanced,
using many small products and similar switches will dramatically diminish the
building cost for a new data center [6]. Furthermore, if the DC’s size needs to be
scaled/built out, pods deployment in a fat-tree topology can be made gradually with
no downtime or rewiring. Furthermore, when considering good results, the greatest
benefit of fat-tree topology [7] is that network software does not need to be written
to be network conscious. Cabling difficulty, on the other hand, is the most serious
deployment error in the fat-tree topology.
In several ways, fat-tree outperforms other DCN topologies. Fat-tree outperforms
BCube and DCell in terms of performance metrics like latency and throughput. Fat-
tree DCNs, unlike three-tier topologies, do not need the use of high-end switches or
high-speed connections, lowering overall deployment costs substantially [6]. Scal-
ability, route diversity, latency, throughput, power usage, and cost are the most
common metrics used to evaluate a DCN in practice [8]. DCNs’ ability to withstand
multiple failures (of links, switches, and compute nodes) has recently become an
important feature for DCNs to support long-running online services [8]. To increase
the reliability and availability of DCNs, stochastic models must be used to model
and evaluate fault-tolerance characteristics.
In a data center network, network topology and the routing protocol that goes
with it are important elements of application success. In the following areas, new
Topology Dependent Ant Colony-Based Routing Scheme … 339
topologies and routing protocols have recently been researched to enhance network
performance.
(1) High-bandwidth: Many existing data center network applications, for
ex.MapReduce, Hadoop, and Dryad are data-intensive and necessitate exten-
sive intranet work communication. In densely connected data center networks,
common features include a bandwidth of high bisection and multiple parallel
paths between any two servers. It is important to provide routing protocols that
can take advantage of the bandwidth of the network and a variety of routes.
(2) Flexibility: The configuration of a data center network will change after it is set
up according to a new survey. 93% of data center operators in the United States
and 88% of data center operators in Europe plan to expand their facilities. As
a result, without removing or replacing existing switches data center network
should be able to accommodate incremental network size expansion such as
adding servers and network bandwidth.
(3) Scalability: In data center network, routing and forwarding should be based
on forwarding small states of switches and expandable to huge networks.
Since forwarding tables use costly and increasingly fast line speeds, scala-
bility of forwarding tables is highly desired in huge enterprise and data center
networks. If the forwarding state is small and will not expand with the size
of the network, we can build large data centers with relatively inexpensive
switches and eliminate the need for switch memory upgrades as the network
expands.
2 Literature Study
Numerous network topologies and routing methods have been proposed in a recent
study. Each has its network design, routing algorithms, and failure prevention and
recovery mechanisms.
DCNs are divided into three groups according to [9] DCN architecture classifi-
cation: 3 tier [5], fat-tree [6], Portland [7], and F2Tree [8] are switch-centric archi-
tectures, while DCell [9], Ficonn [10], MCube [11] are server-centric architectures,
and Helios [12] is a hybrid/enhanced architecture. Server networks in data centers
are usually built using 2 switch-centric topologies (three-tier, fat-tree) and 2 server-
centric topologies (BCube, DCell). Fat-tree (its variants also) may be a good fit for
DCN topologies in mass-produced DCs from companies like Google and Facebook.
BCube is a high-performance and reliable MDC network architecture. Demands
from data-intensive computing, recent technological developments, and special MDC
specifications all influenced the design and implementation of BCube. Rather than
using a switch-oriented approach, the BCube network architecture uses a server-
centric approach. It deals with commodity switches and positions intelligence on
MDC servers. Between any two servers, BCube provides several parallel short paths.
BSR is a source routing protocol used by BCube [10].
DCell can use a robust and inexpensive single path unicast routing algorithm due to
its recursive design. DCell routing was designed with a divide-and-conquer strategy
in mind. This method computes the transitional relation (n1; n2) that connects the two
DCellk1s before calculating the path from src to dest in a DCell k. The next step is to
figure out where the two sub-paths from src to n1 and n2 to dest are located. In DCell
routing, nal route combines the two sub-paths (n1; n2) [11]. There has been a lot of
work performed on the modernization of routing protocols. Pre-computed backup
paths, such as MPLS fast-reroute (see RFC4090) and IP restoration [12], will not be
a better idea for huge, scaled data center networks. FCP [13] is the modern routing
model that suggests finding a working route without knowing the entire topology.
Some studies examined how to improve efficiency by changing timing parameters.
This, however, is not a significant change that applies to SRP. No sufficient work
on routing protocols to replace OSPF in data center networks has been done. Three
simple steps make up Hedera’s control loop. It starts by looking for massive flows at
the edge switches. It then calculates good paths for large flows by estimating their
natural demand and using placement algorithms. Finally, the switches are wired with
these paths [14].
3 System Architecture
To achieve better efficiency with dynamic routing, the routing strategy is continuously
updated based on the existing state of the network. However, it usually requires
several operations and is thus more costly. In Fig. 1, data center networks typically
have routine coordinates, addressing, and connections between nodes that can be
expressed recursively. Designers must have a well-defined definition method to fully
utilize them.
This section introduces a method for formally describing standard network topolo-
gies. In Fig. 2, the regularities of topology are more explicitly illustrated in a formal-
ized way with this approach. It is also important to have a domain-specific language.
A topology description language, which acts as a link between formalized formulas
and routing programmers, is used to get more intuitive and parse able forms of
topology description.
For data center networks, in Fig. 3, the AC* algorithm is a TPDL-based SDN
routing system. It focuses on lowering topology discovery overheads and replacing
the conventional shortest path algorithm with a more robust route selection scheme.
The use of TPDL as prior knowledge is one of the core concepts in AC*.
The controller will create a simple environment for ensuring that the entire network
works by leveraging the knowledge in TPDL. AC* can accommodate topology
changes and failures with the addition of additional components.
Data center networks are more stable and less changeable by default. Because
of these characteristics, only less amount of variable information, such as failure of
the link is obtained during system startup, while the vast majority of static and less
changeable topology information is provided before system startup. When compared
Fig. 1 System architecture
Fig. 2 Pod four architecture

AC* Controller
Topology Manager
Routing Calculator Fault Processor
OpenFlow Module
Fig. 3 AC* controller architecture
to a typical LLDP topology detection mechanism, total device overheads in SDN are
significantly lower.
The TPDL parser oversees processing TPDL input files and sending the results
to the topology manager. The topology manager will create a topology from TPDL
data and update it while the controller is running to keep it up to date with the current
condition of the network. As its name implies, the routing calculator selects routes
using the AC* algorithm. The routing calculator finds the best path for the flow when
OpenFlow module sends a packet-in request. Any topology changes are immediately
communicated to the fault processor by the topology manager.
The AC* switch architecture is given in Fig. 4. A fault detector (FD) module has
been added to OpenFlow switches which use hello message processor to detect faults
between switches and their neighbors (HMP). Using its identifier, HMP sends hello
messages to all ports which are connected on regular basis.
Furthermore, all received hello packets will be forwarded to HMP so that the
details of the neighbors can be saved. The FD watches for hello messages from
its neighbors to see if there is a problem between them. If the hello message is
not received after a predetermined time, it will give a fault message report to the
controller using OpenFlow module.
Normal Open vSwitch switches [15] are transformed into AC* switches using the
proposed architecture in the AC* implementation. Traditional SDN implementations
AC*
Fault Detector
OpenFlow Module Hello Message Processor
Packet Processor
Fig. 4 AC* switch architecture

are compatible with AC* controllers and switches, allowing AC* to use all SDN’s
existing algorithms and infrastructure. The traffic engineering methods described in
[16] and the QoS algorithms described in [17], for example, can still be used on AC*
with little modifications they can benefit from AC* topology information.
3.1 Procedure 1. AC* Crossover Operation
Step 1: Assume P1’s path is (l1, m1, n1), and P2’s path is (l1, m1, n1) (l2, m2,
n2). With m1 and m2, a crossover operation is performed, new route obtained is
P3: (l1, m2, m1, n1).
Step 2: The new route P3 is calculated by deleting the duplicated switch.
Step 3: Another new path, P4, is obtained in the same way.
Step 4: When the fitness function is applied to P1, P2, P3, and P4, the best
direction is found.
Step 5: Mutation—The mutation operation is dependent on the mutation proba-
bility that has been predetermined. For the mutation process, two points from the
off-spring’s paths are chosen at random.
For instance, in the case of g1,
g1 = (2, 4 j 7, 6, 5 j 8, 9, 3) (1)
After the mutation process of swapping 5’ and 3’, g1 becomes.
g0 = (2, 4 j7, 6, 3 j 8, 9, 5) (2)
3.2 Algorithm 1: Pseudo code of AC* Crossover Operation
void crossover () {
Random rmd = new Random ();
int crossoverpt = rmd. nextInt(population.individuals[0].geneLength);
//choose a random crossover point for (int k = 0; k < crossoverpt; k + + ) {
//Swap values among parents int tmp = fittest.genes[k];
fittest. genes[k] = secondFittest.genes[k]
secondFittest.genes[k] = tmp
}
}
The procedure for the second mutation operation is outlined below.

3.3 Procedure 2. AC* Mutation Operation
1. The optimal path has m switches, and the frequency of mutation is set in the
simulation part.
2. Two natural numbers, n1 and n2, are generated at random (n2 < n1 < m).
3. A new path, Pn, is obtained by swapping the switch at n1 and n2 locations of
the optimal path, P0.
4. Calculate the fitness of P0 and Pn and choose the path with the lowest value as
best. After AC* operations, the best route is determined, and packets are sent
along with it.
3.4 Algorithm 2: Pseudo Code of AC* Mutation Operation
void mutation1 () {
Random rmd = new Random ();
//choose a random mutation point.
int mutationpt = rmd.nextInt(population.individuals[0].geneLength);
if (fittest.genes[mutationpt] = = 0) {fittest.genes[mutationpt] = 1;
} else {
fittest.genes[mutationpt] = 0;
}.
mutationpt = rmd.nextInt(population.individuals[0].geneLength); if (second-
Fittest.genes[mutationpt] = = 0){
secondFittest.genes[mutationpt] = 1;
} else {
secondFittest.genes[mutationpt] = 0;
}
}
4 Experiments
To explain how AC* works, we compare it to various routing schemes. For first, OSPF
was chosen because it is one of the widely used link-state routing algorithms in data
center networks. Since AC* is based on SDN, SDN is increasingly being extended
to data center networks. Floodlight is also used as a representation of conventional
SDN. Additionally, since DCell like fat-tree and BCube is a standard topology-aware
topology, we use its topology and routing algorithm [18, 19]. The controller is used
in our tests to ensure that AC* is feasible and performs well. The Quagga [17] routing
suite was used to create OSPF networks, as well as standard SDN networks based
on Open vSwitch and floodlight.
Fig. 5 Comparison of path computation
A DCell implementation based on SDN is often used as a reference. In terms of

network bandwidth and fault tolerance, this implementation, as seen in [9], produces
results that are very close to the original DCell [9] experiment. In terms of topology
structure and num of nodes, DCell uses a different topology than fat-tree. DCell, like
fat-tree and BCube, is well-known as a standard topology-aware routing algorithm.
Performance metrics are used to measure the efficiency of the algorithm.
4.1 Effectiveness of Path Computation
The path computation effectiveness of the AC* algorithm is compared with the tradi-
tional OSPF algorithm is shown in Fig. 5. Different scale fat-tree network topologies
with scales of k = 4, k = 8, k = 12, and k = 16 are used for path computation
efficiency testing.
4.2 Network Convergence Time
AC*algorithm’s network convergence time is defined as the time from network

initialization to the first packet arriving at the dest node. In fat-tree networks with k
= 4 and k = 8, the convergence time of AC* is quicker than the standard link-state
protocol OSPF and the LLDP-based SDN network, as shown in Fig. 6.
Fig. 6 Comparison of network convergence time
During the experiment, the first packet transmission will be completed in less
than one second by AC*. Both OSPF and floodlight, on the other hand, take tens
of seconds to complete network synchronization and detection. The DCell takes
the same amount of time as the AC* to converge in a small network. But, like the
floodlight, it takes much longer in a larger network.
4.3 End-To-End Delay
First packet delay is used as the estimation criterion in end-to-end delay evaluation
for both AC*, floodlight, and DCell. The experiment uses three traffic models: one-
to-all, one-to-one, and all-to-all, which will be the evidence for inter data center traffic
scenarios. The fat-tree topology is used by AC* and floodlight, with k = 4, which
means the number of flows in various traffic models is 1, 15, and 240, respectively.
DCell uses a network of DCell of 20 servers with flow numbers of 1, 19, and 380 as
shown in Fig. 7.
Following convergence, the network’s nodes will be sent out the first packet of
the flows at the same time. Multiply the returned round-trip time (RTT) by two to
get the end-to-end delay.
Fig. 7 End-to-end delay comparison
5 Conclusion
This paper proposes AC* for standard network topologies, which is a topology-
aware routing algorithm based on software-defined networks. This algorithm gener-
ates and implements a routing scheme using SDN and TPDL technologies, as well
as extremely efficient routing calculations and handling of fault mechanisms are
implemented. The AC* algorithm incorporates discovery, crossover, and mutation
operations to increase path search speed and capability. In contrast to OSPF, flood-
light, and DCell, the AC* algorithm’s routing computation is quicker, and the network
convergence period is shorter. The AC* algorithm is thought to better leverage the
ability of data center networks (DCN) and increase network performance.
References
1. R. Sahba, A brief study of software-defined networking for cloud computing, in 2018 World
Automation Congress (WAC) (IEEE, 2018), pp. 1–5
2. A.A. Bahashwan, M. Anbar, N. Abdullah, New architecture design of cloud computing using
software-defined networking and network function virtualization technology, in International
Conference of Reliable Information and Communication Technology (Springer, Cham, 2019),
pp. 705–713. 22 Sep 2019
3. S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer, J.
Zhou, M. Zhu, J. Zolla, U. Hölzle, S. Stuart, A. Vahdat, B4: experience with a globally-
deployed software defined wan, in Proceedings of the ACM SIGCOMM 2013 Conference on
SIGCOMM,ser.SIGCOMM”13 (ACM, New York, 2013), pp. 3–14
4. D. Kreutz et al., Software-defined networking: A comprehensive survey. Proc. IEEE 103(1),

14–76, Jan 2015
5. M. Shin, K. Nam, H. Kim, Software-defined networking (SDN): A reference architecture and
open APIs. Int. Conf. ICT Convergence (ICTC) 2012, 360–361 (2012)
6. S. Schenker, The future of networking, and the past of protocols. Oct 2011. [Online]. Available:
http://www.youtube.com/watch?v=YHeyuD89n1Y
7. H. Kim, N. Feamster, Improving network management with software-defined networking.
Commun. Mag. IEEE 51(2), 114–119 (2013)
8. T. Koponen, M. Casado, N. Gude, J. Stribling, L. Poutievski, M. Zhu, R. Ramanathan, Y. Iwata,
H. Inoue, T. Hama, S. Shenker, Onix: a distributed control platform for large- scale production
networks, in Proceedings of the 9th USENIX Conference on Operating Systems Design and
Implementation, ser. OSDI”10 (USENIX Association, Berkeley, 2010), pp. 1–6
9. J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters. Commun.
ACM 51, 1(January 2008), 107–113 (2008)
10. C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, S. Lu, BCube: A high
performance, server-centric network architecture for modular data centers, in Proceedings of
the ACM SIGCOMM 2009 Conference on Data Communication (2009), pp.63–74
11. C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, S. Lu, Dcell: a scalable and fault-tolerant network
structure for data centers, in Proceedings of the ACM SIGCOMM 2008 Conference on Data
Communication (2008), pp. 75–86
12. I. Gianluca, C. Chen-Nee, B. Supratik, C. Diot, Feasibility of IP restoration in a tier 1 backbone.
Netw. IEEE 13–19 (2004)
13. K.L. Narayanan, M. Caesar, M. Rangan, T. Anderson, S. Shenker, I. Stoica, Achieving
convergence-free routing using failure-carrying packets, in Proceedings of the 2007 Conference
on Applications, Technologies, Architectures, and Protocols for Computer Communications
(2007), pp. 241–252
14. A. Greenberg, J. Hamilton, D.A. Maltz, P. Patel, The cost of a cloud: Research problems in
data center networks. SIGCOMM Comput. Commun. Rev. 39(1), 68–73 (2009)
15. M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, A. Vahdat, Hedera: Dynamic flow
scheduling for data center networks, in Nsdi, vol. 10(8) (AODV, 2010), pp. 89–92
16. L. Davis, Applying adaptive algorithm to Epistatic Domains, in Proceedings of the International
Joint Conference on Artificial Intelligence (Los Angeles, 1985), 18–23 Aug 1985
17. P. Zeng, Y. Shen, Z. Qiu, Z. Qiu, M. Guo, SRP: A routing protocol for data center networks,
in Proceeding of the 16th Asia-Pacific Network Operations and Management Symposium
(Hsinchu, 2014) Sep 2014, pp. [Online]. Available: http://ieeexplore.ieee.org/document/699
6564/
18. Mininet, An instant virtual network on your laptop (or Other PC). Accessed: 16 Nov 2019.
[Online]. Available: http://mininet.org/
19. Quagga Routing Suite. Accessed: 16 Nov 2019. [Online]. Available https://www.nongnu.org/
On Computational Complexity
of Transfer Learning Approaches
in Facial Analysis
Alexandra-S, tefania Moloiu, Grigore Albeanu,

and Florin Popent, iu-Vlădicescu
Abstract More practical experiments proved high performance rates due not only to
recent hardware technological advancements, but also to the new proposed pipelines
in preprocessing inspired from domain-specific techniques for high-speed image
recognition systems, image analysis applications, increased quality of tracking
systems, and closely related to human performance of some decision-based applica-
tions. The performance has been obtained, mainly, through learning from previous
experiments, integrating multi-step approaches, parallel processing, and transfer
learning. In this paper, we describe some computational complexity results based
on practical projects in facial image analysis and recognition.
Keywords Computational complexity · Facial analysis · Facial recognition ·

Transfer learning
1 Introduction
Recently, high performance rates in image analysis have been reported. Not only
the hardware technological advancements have a big contribution, but also the new
proposed pipelines in preprocessing inspired from domain-specific techniques. In
this paper, we describe some computational complexity results both in theory and
applications. Firstly, the framework of transfer learning is discussed, and computa-
tional complexity models in machine learning (ML) are examined. The most popular
convolutional neural networks (CNNs) are discussed from point of view of static
complexity, and the usage of transfer learning is illustrated in the context of solving
A.-S, . Moloiu
University ‘Politehnica’ of Bucharest, Bucharest, Romania
G. Albeanu
Spiru Haret’ University, Bucharest, Romania
e-mail: g.albeanu.mi@spiruharet.ro
F. Popent, iu-Vlădicescu (B)
University ‘Politehnica’ of Bucharest and Academy of Romanian Scientists, Bucharest, Romania
e-mail: popentiu@imm.dtu.dk
350 A.-S, . Moloiu et al.
facial analysis tasks. Some concepts and notations are well known in literature [1–4],
and only short remarks will be given.
2 The Transfer Learning Theoretical Framework
Firstly, we establish the theoretical framework on transfer learning [5–7]. Let be D

a domain given by a pair (X, P), with X = {x[1], x[2], …, x[n]} being a non-empty
set of features (n greater or equal 1) in a suitable representation, and P a probability
distribution p[i] = p[x[i]], i = 1, 2, …, n, where p[i] are non-negative, with p[1]
+ p[2] + … + p[n] = 1. Solving an instance problem on D requires at least one
task specification. Let be Y a non-empty set of labels and a task specification, as
requirements, given by a pair (Y, F), where F is a mapping from X to Y, discovered
from a training set of pairs (x[i], y[i]), i = 1, 2, …, n. F is called a hypothesis and
corresponds to a learning model obtained through minimization of a ‘loss function’
l, with performance ε. When F is a nonlinear model corresponding to a deep neural
network, the process is called deep learning. Sometimes, F(x[i]) = {P(y[k] | x[i]),
for all y[k], k = 1, 2, …, |Y|}, mainly in classification tasks.
Transfer learning is based on two structures: As = ((X s , Ps ), (Y s , Ps (Y s |X s )),
(F s , l s , εs )) for source and At = ((X t , Pt ), (Y t , Pt (Y t |X t )), (F t , l t , εt ) for target. The
aim of a transfer learning procedure is to improve the mapping F t from X t to Y t
based on information about As and At , assuming different instances for domains and
tasks by producing an effective model for a new task that uses little training data by
leveraging knowledge from a different but connected source field to determine the
correct class for any new instance. For instance, a deep neural network is pre-trained
to learn various types of cars from images and then is fine-tuning for another task, say
detecting people emotions [8–10], people faces [11], de-abstract art [12], or applied
to industrial recognition [13].
A transfer learning strategy is necessary when the size of X t is too small to identify
a hypothesis with good performance. According to [5], the domain adaptation process
tries to change the source domain in an attempt to bring the probability distribution of
the source closer to those of the target. The transfer is homogeneous when X s = X t ,
otherwise is heterogeneous [6]. Most heterogeneous transfer learning solutions focus
on fitting the input space of the source and target domains assuming the same domain
distributions, while the majority of the homogeneous transfer learning solutions work
on correcting both the probability distribution of the features and the conditional
probability related to labels.
The following transfer learning approaches are most used [5]:
1. Transfer learning through objects instances (to correct probability distribution
by reweighting);
2. Transfer learning through feature transformations (symmetric or asymmetric
way [14, 15]);
3. Transfer learning directed by some relationships between source and target;
On Computational Complexity of Transfer Learning … 351
4. Transfer knowledge through shared parameters;

5. Mixed-based (instance and parameter) transfer learning.
The task involved in a transfer learning process is based on some learning algo-
rithm belonging to one class like [16, 17]: unsupervised learning (clustering—
discovery of data clusters, and dimension reduction—discovery of latent factors),
supervised learning (through generative or discriminative models), semi-supervised
learning, reinforcement learning, and Turing learning (by observing and doing). Deep
learning is applied to specialized multi-layer neural networks which are trained and
validated by data belonging to a specific domain and takes into account both a reason-
able number of preprocessing steps and hidden layers between input features and
output results.
The similarity of target and source domains, as well as the amount of data in the
target domain, is influential indicators for choosing a transfer method. The transfer
scenario can be differentiated by degree of data similarity and the size of the dataset.
When the fields are significantly different, but still relevant, a generally used strategy
to accommodate that difference is to minimize some of the inconsistency criteria
between the source and target distribution of empirical data.
In the general transfer learning framework, the target model is trained for a task
other than the source mode and is possible to find intermediate features of the source
model F s which are not useful for discovering the target model F t [7]. In this case, a
utility-based approach can be used. Moreover, a learnable parameter can be used to
shift from source to target in order to increase the performance of transfer.
3 Complexity Models
The learning objective in ML is based on finding the best hypothesis (if possible),
otherwise only a probably approximately correct learning hypothesis [4]. In order to
measure the learning capability, the empirical risk minimization (ERM) principle is
used to estimate a hypothesis risk function. Depending on the nature of the labeling
set (discrete, or continuous), the risk measure is given along a summation, respective,
integration procedure. The so-called loss function, or the cost function, to be used
during risk minimization can be:
• Mean squared error (MSE—as the difference between the obtained predictions
and the known values, square it, and average it for the whole dataset);
• The likelihood function (the sum of each input sample multiplied by its predicted
probability);
• The cross-entropy loss (a straightforward modification of the likelihood function
with logarithm of predicted probability).
Other loss functions can be used like: Huber loss, mean absolute error loss,
Kullback Leibler divergence loss [18], adding penalty terms to MSE [1], ArcFace
[19].
Using the likelihood function for estimating the model parameters (weights)
for neural networks (in machine learning) implies that the model improves with
increasing size of the training dataset. Actual practice recommends MSE for solving
regression problems and cross-entropy for both binary and multi-class classification
problems. A natural selection for the loss function (when false positives and false
negatives are similar in costs) would be the 0–1 loss function, which is 0 if the
predicted classification equals that of the true class, or is 1 otherwise.
The computational complexity of learning (CCL) deals with evaluation of learning
algorithms that are able to estimate a hypothesis (a model and a set of computed
parameters) according to a computational paradigm, that means, a learning algorithm
receives a training set and outputs a hypothesis, which is a program/model. According
to Mitchel [1], CCL may answer to questions about the sample complexity (the
minimum size of the training pool for achieving convergence with high probability),
the computational complexity (the training time or the required effort to find the
hypothesis), the expressiveness (what kind of models are best identified), and the
mistake bound (the size of misclassified patterns toward achieving convergence).
Depending on the family of hypothesis, such answers can be obtained fast, while for
learning algorithms in facial analysis, it is difficult to provide such indicators.
In statistical learning view, the basic assumption refers to using identical and
independent distributed pairs both for training and testing. The generalization error
depending on the sample size and model complexity is measured by the gap between
training error and test error. If a bound of the gap can be given for the source
domain, the principal problem is to control the target error. Some basic estimators
will be outlined below after describing some concepts and notations. The probably
approximately correct (PAC) algorithm analysis is used.
Let be given a learning problem in the set problems = (Z, H, c), with input space
Z, the set of hypothesis H, and the cost function c. Let be f : (0, 1) x (0, 1) → N.
A learning algorithm A has computational complexity O(f ) if, for some constants
ε and δ from (0, 1), the complexity of A is below Kf (ε, δ), where K is a positive
constant, the output of A is hA being approximately correct, that means, hA is the
best hypothesis with probability 1-δ: Prob(Error(hA ) ≤ ε) ≥ 1-δ [3, 4, 20].
Bounds for the generalization error, for sample complexity, or for computational
complexity were identified for various learning problems taking into account the
size of Z, the size of H, ε, δ, and the Vapnik–Chervonenkis (VC) dimension [1,
2, 21]. Taking into account the size of hypothesis space, the number of training
samples necessary to achieve the best (ε, δ) hypothesis can be overestimated by
(log(|H|+log(1/δ))/ε.
In the images that follows, we show only the evolution of the number of training
samples in the statistical training case of 95% (Fig. 1), 97% (Fig. 2), and 99% (Fig. 3),
when the approximation error is 0.01 (over size of hypothesis space).
The above figures explain the necessity to use large sizes of training samples,
according to the hypothesis space for any modeling technique. Therefore, the
modeling approach can affect the performance of the analysis framework.
The VC dimension of a learning model is the size of the largest subset of points
from the input space Z that can be shattered by the model. In many cases, the VC
Fig. 1 Acceptance level =

95%

97%

99%
dimension is related to the number of parameters of the learning model. The case
of piecewise neural networks is described in [22]. Let be a multilayer perceptron
(MLP), with n inputs, two layers, where one is hidden with m processing units and
the second having one output (m + 1 neurons). The number of parameters (weights)
is (n + 1)m + m + 1 = VC dimension. If the training error is ε, a bound for the
sample complexity is μln(μ), where μ = 32((n + 1)m + m-1)/ε.
If the hypothesis space H has the VC dimension d, the sample size is n, and
δ is given, then an upper bound of the generalization error is sqrt((dlog(n) +
log(2/δ))/n), that means, n should be increased enough to decrease the generalization
error depending on log(n)/n. Also, according to [1], a good bound on the size of set
of samples is (4 log(2/δ) + 8d log(13/ε))/ε. However, under (d-1)/(32ε), there is no
chance for small error. From above relations is clear that during training is necessary
a large amount of input data, due to the fact that both ε and δ are small positive
numbers near zero. In the images that follows, we show only the evolution of the
number of training samples used in neural neural networks training in the statistical
case of 95% (Fig. 4), 97% (Fig. 5), and 99% (Fig. 6), when the approximation error
is 0.01 (over size of hypothesis space).
The above figures show the evolution of the number of training samples required
to gain a performance level, against n and m. This analysis is valuable for researchers
to estimate, by practical strategies, the number of inputs used to obtain a high perfor-
mance level. Figures 4, 5, and 6 describe a logarithmic evolution depending on the
number of inputs and the number of internal neurons. This analysis is valid only for
neural (multi-layers) networks.
Other measures, not considered here, as Kolmogorov complexity [23] and Rade-
maker complexity [24] can be used when neural networks are used as learning
models.
In the following, the main interest goes to neural networks (NN) learning models
taking into account other aspects. There are many types of NNs and some tricks
to accelerate NN training [25] use fast computed activation functions (e.g., ReLU)
and over-specification and regularization. NN learnability depends on the class of
hypothesis, the number of layers, the activation function, and the number of neurons.
In this respect, positive and negative results are given in [25]. According to Hu
Fig. 4 Acceptance level = 95%

et al. [26], complexity of neural models is sensitive to the values of parameters

(weights and bias). There are many complexity measures associated to NN. Model
structure influences the NN complexity by the number of layers (depth of network,
measuring the depth efficiency), types of layer (convolutional (ConV), full-connected
(FC) hidden), and the layer width. Activation functions (piecewise linear like ReLU
or LeakyReLU, differentiable curves like Sigmoid or Hyperbolic Tangent) play an
important role for NN performance (generalization error, convergence speed). The
proposed complexity measures for deep neural networks (DNN, as sequence of

many fully connected hidden layers) having curve activation functions are based
on regularization method. Hu et al. [26] have obtained, also, improved performance
in overfitting detection.
Learnability of DNNs was estimated by Zhang et al. [21] in terms of number of
weights appearing in a polynomial expression describing the time complexity. Both
in sample error and generalization error can be assured in terms of (ε, δ) after a
number of iterations of the BoostNet algorithm depending on the sample size.
For some applications, sparsely connected layers (with many weights being zero)
can be used if the accuracy is maintained. Such DNN architecture with a small number
of weights to be found by the loss minimization algorithm (e.g., back propagation)
may learn in short time. Due to the large static complexity, based on the above theoret-
ical considerations follows that training CNN models requires not only computational
resources, but also various tricks to reduce the computational complexity, as convo-
lution decomposition, using less channels, controlling the number of parameters,
using fast algorithms, and high-speed hardware.
4 Facial Analysis Using Transfer Learning
Mainly in image understanding (analysis, recognition), a form of DNNs is CNNs.

These networks are composed by a large number of various types of ConV layers.
The CNN architecture is composed on learnable filters each producing a feature map,
and the output is often presented by a dense/FC layer, FC layers, optional layers
(nonlinearity-based layers, normalization layers, pooling layers), as resumed by Sze
et al. [27]. As LeCun et al. [28] have described, the role of the convolutional layer is to
detect similar local features in the previous layer, while the role of the pooling layer
is to fuse semantically similar features. To assure nonlinearity, various activation
functions are used as ReLU or other ReLU-based. Both batch normalization and
local response normalization can be used.
A short discussion on the static complexity of four popular CNNs follows. More
details can be found in [27] and cited references.
LeNet-1 was the first CNN having an input layer (an image of 28 × 28), an
output layer (ten units corresponding to digits 0–9), and four hidden layers: H1 (four
features maps), H2 (four sub-sampling maps), H3 (12 convolution maps 8 × 8), and
H4 (12 sub-sampling maps 4 × 4). Due to the weight sharing, the total number of
parameters is 2578, while the number of processing units is 4635. On the MNIST
dataset, the LeNet-1 has performed with 1.7% error.
The recent version is LeNet-5 having two ConV layers and two FC layers, with
a total of 60 K weights and requiring 341 K (MACs) multiply-and-accumulate
operations per image [29]. LeNet-5 has achieved 0.9% error when tested on MNIST.
The second type of well-known model is AlexNet having five ConV layers, three
max pooling layers, three FC layers. As input receives color images of size 227 × 227
cropped from 256 × 256 images, the output layer has 1000 classes. The number of
weights is 61 M, while the number of MACs is 724 M. Some techniques were used by
AlexNet: ReLU nonlinearity, local response normalization, and weights splitting in
groups. An adapted AlexNet was used by Cîrlescu et al. [30] in emotion recognition
(eight output classes).
The third important CNN is VGG-16 (also there is VGG-19). VGG-16 has 13
ConV layers and three FC layers, requiring 138 M weights to be learnt and 15.5G
MACCs.
Another CNN to be mentioned is ResNet-50 consisting of 53 ConV layers and one
FC layer, with 25.5 M weights and 3.9 G total MACs. Let consider also MobileNet,
which is a class of small and less computing intensive architectures for vision appli-
cations based on depthwise separable convolutions [31]. MobileNetV1 lightweight
(486,784 parameters) and MobileNetV1 deep (1,088,576 parameters), architecture
constructed based on MobileNetV1 by using the process of ablation, were used by
Issa et al. [32] for facial recognition (identification) and verification. The architectures
were tested using MSCeleb1M database under TFRecords format.
Considering the computational effort and top-5 error indicators for ResNet-50,
AlexNet, and VGG-16, it was founded a correlation of -0.99455. This is consistent
with practical experiments consisting of high processing effort to obtain reduced
error.
Facial analysis with ML techniques asks for solving tasks like: face detection,
facial features detection, and facial recognition. After detection, face alignment is
an important preprocessing task consisting of geometric transformations like trans-
lation, scale, and rotation. By facial features detection, some localization tasks are
required for eyes, mouth, and nose. Not only face recognition (for verification tasks
with high accuracy) but also gender recognition and the affective state (anger, antici-
pation, disgust, fear, joy, sadness, surprise, and trust) of the person can be established
with some accuracy [30].
In order to describe the performance of a face verification system, genuine and
impostor pairs are considered and used for testing [32]. A genuine pair consists of
two samples that pertain to the same user, while an impostor pair is constructed using
samples acquired from different users. The following four direct indicators belong
to the confusion matrix: (1) true positives/true accept (TP) the number of authorized
individuals who claim to have access to the system and are classified correctly; (2) true
negatives/true reject (TN) unauthorized persons trying to impersonate and classified
correctly; (3) false positives/false accept (FP) unauthorized individuals who claim to
have access to the system and are classified incorrectly; and (4) false negatives/false
reject (FN) persons who have the right of access but are rejected by the system. They
are used to compute false acceptance rate (FAR) or false rejection rate (FRR). FAR
is the probability of cases for which a biometric system authorizes an unauthorized
person. FRR is the probability of cases where a biometric system denies access to an
authorized person. The work of Issa et al. [32] proved performance in user verification
by discriminating between genuine and impostor pairs.
Fig. 7 ROC AUC analysis
In order to test the generalization capabilities of an identification/recognition

system to new classes, several tests have been performed using MobileNet-like archi-
tectures. The datasets used for training are subsets of MSCeleb1M, and the test set
containing new identities is labeled faces in the wild (LFW) [33].
The outputs of the last hidden layer of the identification networks are used as
discriminating features in a verification system. The binary classification of genuine,
impostor is further computed using a similarity score.
In Fig. 7, it can be observed that increasing the number of identities for training
leads to better discrimination performances of the verification system which is
described using area under the receiver operating characteristic curve (ROC AUC).
Almost perfect recognition of new instances of the same identities is achieved
using any training set, and despite the fact that in the first tests, only 15, 100, and 1000
identities are used, respectively, the system still has good discrimination capabilities
of new identities. This is attained by training with and adaptive ArcFace loss function
and comparing the new users with cosine similarity.
Figure 8 represents the genuine and impostor score distribution. The discrimi-
nating features vectors are computed using the neural network trained only on 1000
classes, and the comparison is based on cosine similarity. The scores for genuine and
impostor pairs are not perfectly separated, but there can be observed a distinction
between them.
When using the whole dataset, the overlap disappears as the distributions are
moved further apart confirming the generalization capabilities to new identities with
large enough number of trained classes.
As mentioned in previous section, CNNs can be designed, trained, and tested to
solve with a given level of accuracy the mentioned tasks. However, in order to speed
Fig. 8 Scores distribution analysis
up the process, both pre-trained networks and some transfer learning procedures can
be used:
1. Freezing any pre-trained network for image understanding and adding two FC
layers, the last one being responsible with binary classification (gender recog-
nition), with multi-class discrimination (eight classes for emotion recognition,
or n + 1 classes when n different image prototypes are available);
2. Adapting the loss function like ArcFace;
3. Preparing auxiliary data and augmenting the dataset (adding mustache, adding
glasses, small zooming, small shifting, small rotating);
4. Oversampling small classes and under-sampling large classes.
Experiments were possible using inhouse applications written in Python and using
TensorFlow library and GPU hardware. The following qualitative aspects have to be
mentioned:
– The use of augmented data has added performance to the recognition systems,
which were able to generalize much better;
– The modified architectures are faster, consume less memory, and prove compa-
rable performance with state-of-the-art models.
5 Concluding Remarks
The transfer learning approaches used in machine learning projects proved efficiency
in short-time implementation of new systems. In order to guarantee a performance
rate, it is necessary to estimate the size of training data both in the initial phase and
in the transfer-based phase. In this paper, an analysis inspired by PAC paradigm was
conducted for solving image analysis and recognition tasks. The use of augmented
data has added performance to the recognition systems, which were able to generalize.
Moreover, much better results were obtained through modified architectures inspired
by transfer learning approaches.
References
1. T.M. Mitchell, Machine Learning (McGraw-Hill, Inc., New York, 1997), pp. 870–877
2. S. Shalev-Shwartz, S. Ben-David, Understanding Machine learning: From Theory to Algo-
rithms (Cambridge University Press, Cambridge, 2014). https://doi.org/10.1017/CBO978110
7298019
3. A. Engel, Complexity of learning in artificial neural networks. Theoret. Comput. Sci. 265,
285–306 (2001)
4. R. Gupta, T. Roughgarden, A PAC approach to application-specific algorithm selection. SIAM
J. Comput. 46(3), 992–1017 (2017)
5. K. Weiss, T.M. Khoshgoftaar, D. Wang, A survey of transfer learning. J. Big Data 3, 9 (2016).
https://doi.org/10.1186/s40537-016-0043-6
6. O. Day, T.M. Khoshgoftaar, A survey on heterogeneous transfer learning. J Big Data 4, 29
(2017). https://doi.org/10.1186/s40537-017-0089-0
7. Y. Jang, H. Lee, S.J. Hwang, J. Shin, Learning what and where to transfer, In Proceedings of
the 36th International Conference on Machine Learning. (PMLR 97, 2019), pp. 3030–3039
8. H.-W. Ng, V.D. Nguyen, V. Vonikakis, S. Winkler, Deep learning for emotion recognition
on small datasets using transfer learning, In Proceedings of the International Conference on
Multimodal Interaction. (ICMI, ACM, 2015), pp. 443–449. https://doi.org/10.1145/2818346.
2830593
9. C. Florea, L. Florea, C. Vertan, M. Badea, A. Racoviteanu, Annealed label transfer for face
expression recognition. British Mach. Vis. Conf. (BMVC), Art.Id 321, 1–12 (2019). https://
bmvc2019.org/wp-content/uploads/papers/0321-paper.pdf
10. K. Feng, T. Chaspari, A review of generalizable transfer learning in automatic emotion
recognition. Front. Comput. Sci. (2020). https://doi.org/10.3389/fcomp.2020.00009
11. J. Luttrell, Z. Zhou, Y. Zhang, C. Zhang, P. Gong, B. Yang, R. Li, A deep transfer learning
approach to fine-tuning facial recognition models, in Proceedings of the 13th IEEE Conference
on Industrial Electronics and Applications (2018), pp. 2671–2676. https://doi.org/10.1109/
ICIEA.2018.8398162
12. M. Badea, C. Florea, L. Florea, C. Vertan, Can we teach computers to understand art? Domain
adaptation for enhancing deep networks capacity to de-abstract art. Image Vis. Comput. 77,
21–32 (2018). https://doi.org/10.1016/j.imavis.2018.06.009
13. B. Maschler, S. Kamm, M. Weyrich, Deep industrial transfer learning at runtime for image
recognition. Automatisierungstechnik 69(3), 211–220 (2021). https://doi.org/10.1515/auto-
2020-0119
14. W.M. Kouw, L.J.P. van der Maaten, J.H. Krijthe, M. Loog, Feature-level domain adaptation.
JMLR 17(171), 1–32 (2016)
15. J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features in deep neural
networks? NIPS’14, in Proceedings of the 27th International Conference on Neural Information
Processing Systems, vol. 2 (2014), pp. 3320–3328
16. H.-W. Lee, N. Kim, J.H. Lee, Deep neural network self-training based on unsupervised learning
and dropout. Int. J. Fuzzy Logic Intell. Syst. 17(1), 1–9 (2017). https://doi.org/10.5391/IJFIS.
2017.17.1.1
17. X. Li, Q. Sun, Y. Liu, S. Zheng, Q. Zhou, T.-S. Chua, B. Schiele, Learning to self-train for semi-
supervised Few-Shot classification, in Advances in Neural Information Processing Systems 32:
Annual Conference on Neural Information Processing Systems. (NeurIPS, 2019), pp. 10276–
10286
18. A. Achille, G. Paolini, G. Mbeng, S. Soatto, The information complexity of learning tasks,
their structure and their distance. Inf. Infer. J. IMA 10(1), 51–72 (2021). https://doi.org/10.
1093/imaiai/iaaa033
19. J. Deng, J. Guo, N. Xue, S. Zafeiriou, ArcFace: Additive angular margin loss for deep face
recognition, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019),
pp 4685–4694. https://doi.org/10.1109/CVPR.2019.00482
20. J. Pan, Review of metric learning with transfer learning. AIP Conf. Proc. 1864, 1–9 (2017),
Art.Id 020040. https://doi.org/10.1063/1.4992857
21. Y. Zhang, J. Lee, M. Wainwright, M.I. Jordan, On the learnability of fully-connected neural
networks, in Proceedings of the 20th International Conference on Artificial Intelligence and
Statistics, (PMLR 54, 2017), pp. 83–91. http://proceedings.mlr.press/v54/zhang17a/zhang17a.
pdf
22. P.L. Bartlett, N. Harvey, C. Liaw, A. Mehrabian, Nearly-tight VC-dimension and pseudo dimen-
sion bounds for piecewise linear neural networks. JMLR 20(63), 1–17 (2019). https://jmlr.org/
papers/v20/17-612.html
23. J. Schmidhuber, Discovering neural nets with low Kolmogorov complexity and high general-
ization capability. Neural Netw. 10(5), 857–873 (1997)
24. W. Gao, Z.-H. Zhou, Dropout Rademacher complexity of deep neural networks. Sci. China
Inf. Sci. 59(7), 1–12 (2016), Art.Id 072104. https://doi.org/10.1007/s11432-015-5470-z
25. R. Livni, S. Shalev-Shwartz, O. Shamir, On the computational efficiency of training neural
networks, in Proceedings of the 27th International Conference on Neural Information
Processing Systems, vol. 1 (NIPS, 2014), pp. 855–863
26. X. Hu, W. Liu, J. Bian, J. Pei, Measuring model complexity of neural networks with curve
activation functions, in Proceedings of the 26th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining KDD (2020). https://dl.acm.org/doi/10.1145/3394486.3403203
27. V. Sze, Y.-H. Chen, T.-J. Yang, J.S. Emer, Efficient processing of deep neural networks: A
tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017). https://doi.org/10.1109/JPROC.
2017.2761740
28. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436–444 (2015). https://doi.org/
10.1038/nature14539
29. A.-S. Moloiu, Automatic Character Recognition. Licence Thesis in Informatics (under the
supervision of Dana-Mihaela Vilcu), “Spiru Haret” University, Bucharest, 2014
30. M.G. Cîrlescu, A.S. Moloiu, G. Albeanu, Improving Facial Analysis using Deep Learning.
Technical Report, “Spiru Haret” University, Bucharest, 2019
31. A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H.
Adam, MobileNets: Efficient convolutional neural networks for mobile vision applications.
https://arxiv.org/abs/1704.04861 (2017)
32. B. Issa, A.-S. Moloiu, G. Albeanu, Developing a robust facial verification system using deep
neural networks. Technical Report, “Spiru Haret” University, Bucharest, 2019
33. LFW, http://vis-www.cs.umass.edu/lfw/ (last accessed 30 July 2021)
Adaptive Classifier Using Extreme
Learning Machine for Classifying
Twitter Data Streams
M. Arun Manicka Raja, S. Swamynathan, and T. Sumitha
Abstract As the online business is being exponentially advanced from time to time,
analyzing the data on the fly is the need of the hour and getting high attention from
the researchers. The proliferation of a variety of data that has been generated from
different kinds of devices needs to have a practically efficient algorithm for the data
stream analysis. However, most of the existing data mining algorithms are widely
used for the analysis of the stream of data as well which are better for static data
mining and not as efficient for the data stream. Thus, to overcome this limitation, this
work has proposed an adaptive classifier (algorithm) for the analysis of data stream
instantly. The proposed adaptive classifier uses extreme learning machine along with
the product’s attributes-based rule database to identify the popularity of the product.
Keywords Twitter data streams · ELM classification · Rule-based classifier ·

Adaptive classifier
1 Introduction
Social media is prominently used as a medium for sharing information instantly.

The views or opinions about any product or service are also shared among Internet
users in social media applications. Most of the companies perform analysis on these
social media data for evaluating their product or services. People trust the information
M. A. M. Raja (B)
Department of Computer Science and Engineering, RMK College of Engineering and
Technology, Chennai 601206, India
e-mail: arunmcse@rmkcet.ac.in
S. Swamynathan
Department of Information Science and Technology, College of Engineering Guindy, Anna
University, Chennai 600025, India
e-mail: swamyns@annauniv.edu
T. Sumitha
Department of Computer Science and Engineering, RMK Engineering College, Chennai 601206,
India
e-mail: sat.cse@rmkec.ac.in
364 M. A. M. Raja et al.
about the products or services shared by other users on social media. Many social
media applications provide different services for knowing the trend about events or
popular personalities or products or services. It has become routine that people look
for trending topics or products. Twitter is one of the familiar and powerful social
media applications for sharing messages instantly with other Twitter users. Twitter
allows users to share information on many topics and also provides trending topics
to its users.
The main contribution of the work in this paper is the adaptive classifier using
extreme learning machine (ELM). ELM is the neural network-based classifier that
makes use of the number of hidden nodes for learning from data in a short time. In this
work, an adaptive classifier using ELM is used, and it provides a faster training phase
for data streams. The performance of ELM is further improved using the decision
rule database.
In Sect. 2, the related works of twitter trend analysis with different learning models
and classification algorithms are discussed. In Sect. 3, the system architecture with
its various functional components is described. In Sect. 4, the experiments and results
of the proposed adaptive classifier are explained. In Sect. 5, the comparative analysis
of the proposed work with other existing works is concluded.
2 Related Works
Many research works have been carried out for performing the analysis of the social
media data to predict customer interest. Twitter data analysis is performed by many
organizations to evaluate their product or services to know their usage among users.
Trend analysis includes the identification of popular events that are being discussed
on Twitter. Qian et al. [1] used a multi-modal social event tracking framework which
is used for finding the topics of social events by modeling social media documents.
This makes use of an incremental learning strategy to get the updated event topics
from social media. Shi et al. [2] applied a trend identification mechanism for knowing
the public interest. The authors have used complex event processing methods for
performing public mood tracking. These methods used microblogs which are trans-
formed into microblog events using sentiment analysis. After finding the event, the
summarization technique is used to summarize the public mood at different periods.
Khater et al. [3] implemented a recommendation system in which the trend is
identified based on the interest of the social media users. The trending user community
is identified using the location of the social media users. The recommendation is
provided to the users who prefer the trending topic in their interested domain. The
suggestion about the trending topic is provided using the information gathered from
the social media users along with the content recommended by the social media
application. Lai et al. [4] combined the time and space attributes as a spatio-temporal
model since the time series play a vital role in microblog trend detection. It helps
to find the relation among the topic discussed by different user communities. The
Adaptive Classifier Using Extreme Learning Machine … 365
abnormalities in the topic diversion are also found out by correlating the time and
space of the user content.
He and Yan [5] made a model to mine blogs. This model is used to understand the
use of social media in customer co-creation. The relevant posts and blogs are identi-
fied for finding the customer co-creation. This co-creation information is helpful for
organizations to promote their products or services easily. Masud et al. [6] designed
a class detection model which is used to identify new classes in the generated data.
This class detection model helps to identify the immediate change in the generation
of data. The new class topic detection is achieved by the ensemble classification
framework. It is effective in handling the drift among the data streams, but it requires
much time to give the newly evolved topic.
Kasiviswanathan et al. [7] used distributed dictionary learning method for
handling the evolution of new topics in the data streams. When compared to the
ensemble methods, this dictionary learning focuses on the specific domain with the
dictionary of data. This reduces the learning time required for identifying the new
topic in the data streams. Aiello et al. [8] designed topic modeling for sensing trending
topics on Twitter. It is required to focus on the trend of a specific topic in a particular
domain. Co-occurrence of related events and topic ranking is important methods for
detecting the trending in Twitter. Wang et al. [9] invoked enhanced topic modeling
methods to focus on the volume as well as the time-based content generation in
social media. Xie et al. [10] implemented a bursty topic detection model to identify
bursty topics from Twitter with the time series of user posts. The trending topics are
identified with the scaling of data generation.
Fatma et al. [11] performed cloud-based data stream optimization to process data
streams immediately in the real-time environment. This requires optimization of the
methods to process the streams instantaneously. The continuous query optimization
is used with the multiple plans for the data streams in the cloud environment for the
cluster of data. Abdullatif et al. [12] used fuzzy methods to perform the analysis on
non-stationary data streams. This addresses the way to find out the evolving nature
in the data streams and outlier identification.
Many works have performed a trend analysis of the Twitter social networking
application. However, there is a lack in identifying the top tweet content about a
product or service without spending much time. Since most of the classification-
based product reputation finder required much time for training the classifier. It is
necessary to identify top tweets by reducing the training time using the adaptive
classifier. In this work, an adaptive classifier is proposed to handle the Twitter data
streams for identifying the popularity of the products.
3 Adaptive Classifier System Design
The system design of the adaptive classifier is shown in Fig. 1. The components of
the systems are tweet extractor, sliding window, tweet repository, reservoir sampler,
Fig. 1 Adaptive classifier system design
adaptive ELM classifier, domain-specific seeds, polarity corpus, and popularity

predictor.
3.1 System Explanation
Twitter is the most promising social media application, used by many Internet users.
It allows the users to share their views or opinions instantly with others through the
140-character microblog, called tweets. Recently, the tweet size has been increased
to 280 characters by Twitter [13]. It is designed to collect tweets from Twitter. The
input keywords are given to the extractor for a specific domain. The tweets related
to these keywords are collected using Twitter streaming API. Each tweet consists
of many parameters [14] such as user name, tweet content, user id, date and time,
geolocation, retweet, and favorite.
3.2 ELM Method
ELM is a feed-forward neural network, mostly used for classification, regression,

and clustering. In neural networks, generally, the hidden nodes need to be tuned
according to the input weight. In ELM, the hidden nodes are not needed to be updated.
The output of these hidden nodes is usually learned in a single-layer step. Given
the single hidden layer of ELM, the output function of the ith hidden node is that
h i (x) = G(ai , bi , x), where ai and bi are the parameters of the ith hidden node. The
output of the ELM Lfor a single-layer feed-forward neural network with L hidden nodes
f
is that L (x) = i=1 βi h i (x), where βi is the output weight of the ith hidden node.
The hidden layer output mapping function of ELM is h(x) = [G(h i (x), . . . ..h L (x).
With N training samples, the hidden layer output H of the ELM is
⎡ ⎤ ⎡⎡ ⎤⎤
h(x1 ) G(a1 , b1 , x1 ) · · · G(a L , b L , x1 )
⎢ ⎥ ⎢⎢ ⎥⎥
H = ⎣ ... ⎦ = ⎣⎣ ..
.
..
. ⎦⎦
h(x N ) G(a1 , b1 , x N 1 ) · · · G(a L , b L , x N )
There are a total of 1907 positive words, 4750 negative words used for finding
the polarity of a tweet. The classifiers are first trained on the collected tweets using
these keywords and then tested on new tweets [15].
In this section, the experiment is explained with the result analysis, and also, the
results are compared with other data stream classifiers to show better use of the
proposed adaptive classifier. The dataset used for this work is collected from Twitter
social media. Twitter provides APIs for collecting tweets. In this work, the tweets
are collected using the Twitter streaming API. The tweets are received from the
streaming API either in JSON or XML format.
The proposed adaptive classifier makes use of the extreme learning machine algo-
rithm (ELM) [16, 17]. It is the single-layer feed-forward network [18] for the training
of data in a short time. This classifier requires one iterative feed and not multiple
iterative feeds. It is considered that there may be a varying number of hidden layers.
In a practical view, one hidden layer is enough [19] for estimating the nonlinear
function. Though the hidden layer is only one, the number of nodes may be kept
as sufficient as required to estimate the nonlinear function with the input values. In
ELM, the training data are converted as vectors that are considered as the features
which have the attribute values. In this work, the tweet with its positive and negative
polarities of words are considered as two features.
The ELM algorithm is shown in Fig. 2. The distinct input samples are given to
the ELM classifier as inputs. There are 100 hidden nodes used in ELM for experi-
mental purposes. 10,000 tweet samples are populated using the reservoir sampler. 10
Fig. 2 Adaptive classifier using ELM
rounds are conducted to evaluate the learning capabilities of the ELM. The learning
capability is measured using both training time and training accuracy.
The training of classifiers is performed on the collected tweets. It is inferred from
Table 1 that the training time required for an adaptive classifier using ELM is lesser
than the time required by decision tree-based and ensemble-based stream classifiers.
The error rate is also minimum since the classifier training makes use of domain-
specific keywords as well as the polarity word corpus. In addition, the error rate is
reduced with the increase of the number of hidden nodes in ELM.
The results are shown in Table 1 that the adaptive classifier using ELM is capable
of performing better when compared to other stream classifiers.
Table 1 Performance comparison of adaptive ELM and other stream classifiers on training data
Stream classifier Training data Precision Recall F-measure
Positive Negative Positive Negative Positive Negative
Adaptive ELM Sample 1 0.998 0.995 0.970 0.998 0.996 0.994
stream classifier Sample 2 0.997 0.994 0.972 0.996 0.995 0.993
Sample 3 0.998 0.995 0.971 0.998 0.996 0.994
Sample 4 0.996 0.996 0.970 0.997 0.994 0.995
Average 0.997 0.995 0.970 0.997 0.995 0.994
Ensemble Sample 1 0.986 0.984 0.992 0.976 0.986 0.984
stream classifier Sample 2 0.985 0.983 0.991 0.975 0.985 0.983
Sample 3 0.983 0.985 0.992 0.976 0.986 0.982
Sample 4 0.985 0.984 0.993 0.974 0.983 0.984
Average 0.985 0.984 0.992 0.975 0.985 0.983
Decision Sample 1 0.946 0.930 0.956 0.958 0.946 0.953
tree-based Sample 2 0.936 0.925 0.952 0.946 0.942 0.952
stream classifier
Sample 3 0.932 0.918 0.947 0.942 0.945 0.942
Sample 4 0.928 0.914 0.942 0.938 0.938 0.944
Average 0.936 0.922 0.949 0.946 0.943 0.948
It is observed that the comparison of the classifiers on the training data is shown in
Fig. 3. The adaptive classifier on the training data streams achieves better precision,
recall, and f-measure when compared to ensemble and decision tree-based classifiers.
Classifiers are generally used to classify the data based on the available features.
But, hierarchical classifiers like ELM are used in many scenarios, wherein the dataset
Fig. 3 Comparison of classifiers on training data

Table 2 Types of
Feature Class label
smartphone features and its
class labels Battery C1
Screen size C2
Price C3
Memory size C4
Camera pixel C5
Processor speed C6
Touch sensitivity C7
Durability C8
Compactness C9
has more features with a large volume of data. In these cases, the rule-based classifiers
help to increase the accuracy by classifying the missed instances in some other
classifiers [20]. In this work, ELM-based adaptive classifier using which the tweets
have been classified with polarity as features. There is a need for improving the
classification process for better product recommendations for two reasons. First,
the tweets need to be considered with the features available in the Twitter data for
smartphone brands. Second, the need for a rule-based classifier is to achieve the
highest classification accuracy with the minimum number of hidden nodes in ELM.
Sometimes, the increase in the number of hidden nodes may lead to an increase
in training time for the data streams. So, there is a necessity to keep the number of
hidden nodes minimum. To overcome these difficulties, the rule database is required
so that the data stream will be correctly classified according to the rules represented
in the rule database. The features of smartphones and their corresponding class labels
are shown in Table 2 to use in the rule database.
The rules are created based on the attributes of smartphones. There are five smart-
phone attributes taken for consideration with the total attribute values of 13. 1287
possible rule combinations are identified for these attributes. These rule combinations
are taken as follows.
With the combination of each attribute value with other attribute values, 13C5
(nCr) combinations are obtained as 1287. The tweets are processed using the rule
database, and Table 3 shows how the tweets are processed, and the rules are evaluated.
It is necessary to consider the attribute value present in the tweets for identifying
whether the particular smartphone model may be recommended or not to the user. The
sentiment is identified as associated with the attributes mentioned in the tweet. The
tweet is analyzed with the polarity present in the content, and also, the smartphone
model is identified from the tweet content. After every tweet is invoked to the rule
database, it is concluded that whether the user recommends the product or not and
in addition what attribute or feature the user is interested on a particular smartphone
model is also identified. The comparison of the classifiers on testing data streams
is shown in Fig. 4. The classifier performance mainly varies in the unclassified data
Table 3 Process of rule database on tweets

Tweet Attribute Sentiment Polarity Attribute rule Rule consequent
word category
Galaxy S9 Battery Worse Negative battery_time < 8 h Insufficient
battery antecedents
replacement
time gets worse
Back to iPhone Memory, Good Positive Battery_time > 8 h, Recommended
with good compact, memory = medium
memory. There battery or memory = high,
are much compact = yes
flashier options
and
compactness
with an
improved
battery
The Apple Screen, Not beautiful, Negative Price = high, Not
iPhone is not price, no compact = recommended
beautiful as compact compactness no,screen-size =
earlier in small
screen display,
and it does the
job. Little
wonder why
it’s not my
crush. Perhaps
high price and
no
compactness
Fig. 4 Comparison of classifiers on testing data streams

372
Table 4 Product reputation based on the classification results

Classifier Smartphone brands Tweet category Power consumption Display size Processor speed Price +ve tweet Accuracy
Adaptive ELM Brand A Positive 488 565 735 390 96.19
Negative 176 124 284 423
Positive 597 458 486 445 95.12
Negative 180 164 152 466
Brand B Positive 472 529 855 228 98.27
Negative 369 186 227 365
Positive 488 492 798 205 97.76
Negative 385 172 239 480
Brand C Positive 478 468 418 485 94.88
Negative 418 396 212 214
Positive 412 388 346 653 95.45
Negative 324 342 312 428
M. A. M. Raja et al.
streams. The number of unclassified data streams in the ELM classifier is less when
compared to the other two data stream classifiers such as ensemble and decision tree.
The product reputation is identified using the adaptive classifier. The classification
results have been shown in Table 4. Four features of the smartphone are considered
for finding out which of the features are commented with more positive or negative
tweets. Two different sets of streams are used for checking the product’s reputation.
5 Conclusion and Future Enhancement
In this work, an adaptive classifier algorithm has been proposed using extreme
learning machine along with the product’s attribute-based rule database. This work
proposes a method for adapting the classifier to the changing nature of data. The
rule-based ELM classifier adapts to the continuously generated Twitter streams. The
performance of the classifier has improved, and the classifier model adapted to new
data streams and produced better classification accuracy up to 98.27%. The proposed
adaptive classifier algorithm is suitable for handling data streams with lesser training
time. This experimental study provides insights for developing a classifier model for
dynamic data streams.
References
1. S. Qian, T. Zhang, C. Xu, J. Shao, Multi-modal event topic model for social event analysis.
IEEE Trans. Multimedia 18(2), 233–246 (2016)
2. S. Shi, D. Jin, G. Tiong-Thye, Real-time public mood tracking of chinese microblog streams
with complex event processing. IEEE Access 5, 421–431 (2017)
3. S. Khater, D. Gračanin, H.G. Elmongui, Personalized recommendation for online social
networks information: Personal preferences and location-based community trends. IEEE Trans.
Computat. Soc. Syst. 4(3), 104–120 (2017)
4. E.L. Lai, D. Moyer, B. Yuan, Topic time series analysis of microblogs. IMA J. Appl. Math.
81(3), 409–431 (2016)
5. W. He, G. Yan, Mining blogs and forums to understand the use of social media in customer
co-creation. Comput. J. 58(9), 1909–1920 (2015)
6. M.M. Masud, Q. Chen, L. Khan, Classification and adaptive novel class detection of feature-
evolving data streams. IEEE Trans. Knowl. Data Eng. 25(7), 1484–1497 (2013)
7. S.P. Kasiviswanathan, G. Cong, P. Melville, R.D. Lawrence, Novel document detection for
massive data streams using distributed dictionary learning. IBM J. Res. Dev. 57(3), 1–15
(2013)
8. L.M. Aiello, G. Petkos, C. Martin (2013) Sensing trending topics in Twitter. IEEE Trans.
Multimedia 15(6), 1268–1282 (2013)
9. Z. Wang, L. Shou, K. Chen, G. Chen, S. Mehrotra, On summarization and timeline generation
for evolutionary tweet streams. IEEE Trans. Knowl. Data Eng. 27(5), 1301–1315 (2015)
10. W. Xie, F. Zhu, J. Jiang, E.P. Lim, K. Wang, TopicSketch: Real-time bursty topic detection
from Twitter. IEEE Trans. Knowl. Data Eng. 28(8), 2216–2229 (2016)
11. M.N. Fatma, R.M. Ismail, N.L. Badr, M.F. Tolba, Cloud-based data streams optimization.
WIREs Data Min. Knowl. Discov. 8(8), e1247 (2018)
12. A. Abdullatif, F. Masulli, S. Rovetta, Clustering of nonstationary data streams: A survey of

fuzzy partitional methods. WIREs Data Min. Knowl. Dis. e1258 (2018)
13. https://en.wikipedia.org/wiki/Twitter, Last accessed on 13 Sept 2021
14. M.A.M. Raja, S. Swamynathan, Tweet analyzer: Identifying interesting tweets based on the
polarity of tweets. Adv. Intell. Syst. Comput. 410, 307– 316 (2015)
15. M.A.M. Raja, S. Swamynathan, Stratified and reservoir sampling algorithms for continuous
twitter data streams to improve the classification accuracy, in International Conference on
Knowledge and Computing Technology (2016)
16. X. Zhixin, M. Yao, A fast incremental method based on regularized extreme learning machine.
Proc. Adapt. Learn. Optim. 1, 15–30 (2014)
17. G.B. Huang, H. Zhou, X. Ding, Extreme learning machine for regression and multiclass
classification. IEEE Trans. Syst. Man Cybern. 42(2), 513–529 (2012)
18. G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: theory and applications.
Neurocomputing 70(1), 489–501 (2006)
19. J. Wang, Z. Deng, S. Wang, Q. Gao, Training generalized feedforward kernelized neural
networks on very large datasets for regression using minimal-enclosing-ball approximation, in
Proceedings in Adaptation, Learning and Optimization, vol. 3 (2014), pp. 203–214
20. F. Fabris, A.A. Freitas, J.M.A. Tullet, An extensive empirical comparison of probabilistic
hierarchical classifiers in datasets of ageing-related genes. IEEE/ACM Trans. Comput. Biol.
Bioinf. 13(6), 1045–1058 (2016)
Decision Making on Covid-19
Containment Zones’ Lockdown Exit
Process Using Fuzzy Soft Set Model
R. K. Mohanty, B. K. Tripathy, and Sudam Ch. Parida
Abstract Covid-19 is one of the biggest pandemics in the history of mankind. It

has kept the modern world hostage for more than one and a half years now. Strict
lockdown is creating more havoc in the minds of everybody. The decision of exiting
from a lockdown and deciding the kind of strictness needed as per the scenario is not
an easy task for any administration. This paper provides a new approach to estimate
the seriousness of the situation to strategize the exit from lockdown. There are many
mathematical models to handle uncertainty such as fuzzy set, rough set, soft set,
generalizations of these models, and their hybrid models. In this paper, a decision-
making application using the notion of a fuzzy soft set is provided to assess the
seriousness of the corona situation, which helps to decide on lockdown relaxations.
Keywords COVID-19 · Decision making · Soft set · Fuzzy soft set · Lockdown
1 Introduction
COVID-19 is a contagious disease caused by one of the strains from the family of
Coronavirus named 2019-nCoV. The disease started in Wuhan, China at the end of the
year 2019 and gradually spread to the whole world leading to the ongoing coronavirus
pandemic. Viruses mutate easily to adapt to the surrounding environment, which
creates new variants of the same virus with more resistance to the medicines available.
Due to different mutated variants of the virus, the containment of the virus becomes
difficult. The difficulties grow more with the more spreading of the disease. The
symptoms of the coronavirus are shown in 2–14 days of infection. Due to its high
transmission rate the virus is creating havoc among all, even with a recovery rate
R. K. Mohanty (B)
SCOPE, VIT, Vellore, Tamil Nadu 632014, India
B. K. Tripathy
SITE, VIT, Vellore, Tamil Nadu 632014, India
e-mail: tripathybk@vit.ac.in
S. Ch. Parida
KBV Mahavidyalaya, Kabisurya Nagar, Odisha, India
376 R. K. Mohanty et al.
of greater than 98% and fatality rate of less than 2%. Another dangerous feature of
the disease is that it shows different symptoms in different persons. The presence of
many variants of the virus adds to the list of symptoms even further.
To design a model for handling any uncertain situation like this pandemic, there
are many uncertainty-based models available in literature such as fuzzy set (FS) [18],
rough set, Intuitionistic fuzzy set [1]. Molodtsov [11] proposed a soft set (SS) model
to handle uncertainty-based problems with multiple parameters. Every parameter
is associated with a subset in a SS. In [15] characteristic functions are introduced
to redefine the definitions and notions of SS. SS is applied for decision making in
[9]. Mohanty et al. [10], Sooraj et al. [14], and Tripathy et al. [16, 17] redefined
FSS through the use of characteristic functions and made the definitions more useful
for decision making applications. Later several other hybrid models related to SS
were redefined other hybrid models of the soft set using characteristic approach
and several decision-making algorithms were proposed using these models. There
are several articles about uncertainty-based decision making applications to handle
different situations during Covid-19 pandemic [2–5, 12, 13]. This paper proposes
an algorithm to strategize the unlock process by assessing the seriousness of the
coronavirus pandemic.
2 Definitions and Notions
Definition 1 Let W be a set of all elements under consideration. Let C be a parameter

set. A soft set (over W ) is denoted by (F, C). It is defined as shown in Eq. 1:
F : C → Poset(W ) (1)
where Poset(W ) denotes the set of all subsets of W .

(F, C) can also be defined through characteristic functions [15] as elucidated
shown in Eq. 2.
The set of parametric characteristic functions χ(F,C) = {χ(F,C)
t
: ∀t ∈ C} defined
over (F, C), ∀t ∈ C such that χ(F,C) : W → {0, 1} and ∀s ∈ W ,
t

1, if s ∈ F(t);
χ(F,C)
t
(s) = (2)
0, Otherwise.
Definition 2 Let W be a set of all elements under consideration. A FS G drawn

from W is given by its membership function μG where μG : W → [0, 1] such that
∀s ∈ W , μG (s) is called as membership grade of s G in. A FS reduces to a crisp set
when μG : W → {0, 1}.
Definition 3 (FSS) Let W be a set of all elements under consideration and C be a

parameter set. A FSS (over W ) is represented by (F, C) as shown in Eq. 3, such that
Decision Making on Covid-19 Containment Zones’ Lockdown … 377
F : C → Poset(W ) (3)
where Poset (W ) is the fuzzy power set of W .

Tripathy et al. [17] described the set of parametric membership functions of (F, C)
as μ(F,C) = {μa(F,C) |a ∈ C}.
For ∀a ∈ C, the membership function defined as shown in Eq. 4
μa(F,C) (x) =α,α ∈ [0, 1] (4)
3 Decision Making on Covid-19 Containment Zones’

Lockdown Exit Process Using FSS
In this section, let us discuss a decision-making application to strategize the exiting

process of lockdown in this ongoing covid-19 pandemic using FSS model.
3.1 Background of Application
The coronavirus pandemic is creating havoc in the allover world since the beginning
of 2020. To save the people from coronavirus and contain the spreading of the virus,
the governments impose forceful lockdowns in their respective areas. This application
is all about the assessment of the seriousness of pandemic in an area to exit from
lockdown.
The rate of spreading of a viral pandemic like Covid-19 depends on many param-
eters such as Test positivity Rate, Fatality Rate, healthcare Index, Active cases,
recovery rate, average new cases per day, vaccinated population Percentage, Active
Ratio, doubling Rate, Homeless population, Population Density, Air Quality, Traffic
Mobility, Vulnerable age group population, Transmission Rate, Asymptomatic
positivity rate.
To decide on more relaxations from lockdown, the administrators need to think
about most of these necessary safety parameters as much as possible in their
constraints. The decision will be more accurate if the decision is taken by assessing
all these parameters in the areas as small as possible. Because the parameters won’t
be uniform for all places of a bigger area.
7-day moving average: Some parameter values go up and down a lot. For
example, the number of new COVID-19 cases daily is difficult to assess as it depends
on many other parameters like the reporting time, testing time, exposure to infec-
tion, and many more such parameters. Because of this, it’s very difficult to apply
those values in decision-making applications directly. The “7-day moving average”
approach can handle nicely to these types of parameters. An n-day moving average
Fig. 1 Daily covid-19 case increase with the 7-day moving average approach
takes the mean value of past n-days, and the result is saved as of the value of that day.
For example, the value for June 7 is the mean of values from May 1 to May 7. For
May 8, it’s the mean of values from May 2 to 8, and so on. This approach adjusts the
confirmation delay due to some reason. In most cases, coronavirus disease shows its
symptoms in 3–10 days. So, the 7-day moving average is an appropriate approach
in this case. The parameters like Test positivity Rate, Fatality Rate, recovery rate,
average new cases per day, Transmission Rate, Asymptomatic positivity rate can be
handled using the “7-day moving average” approach.
Figure 1 shows the real COVID-19 cases growth values and daily growth values
plotted as per the 7-day moving average approach.
3.2 Parameter Description
In this section, the definitions of some important parameters used to assess the coron-
avirus situation are discussed. Due to space constraints, all the discussed parameters
are not used in the application provided in the next section.
i. Test positivity Rate = TotalTotal
number of positive cases found
number of tests done
ii. Transmission Rate (R0): Average number of persons getting infected from a
single person.
iii. Doubling Rate: It is the number of days in which the number of infected
people will be doubled.
iv. Asymptomatic positivity rate: The number of asymptomatic positive cases
identified, divided by total covid positive cases confirmed.
v. Fatality: percentage of death occurred among total covid positive cases
identified.
vi. Recovery Ratio: Percentage of people recovered among total covid positive
cases identified.
vii. Active Ratio: Percentage of people who are suffering in covid among total
covid positive cases identified.
Fuzzification of collected data can be done as follows.
Infected: Number of persons infected divided by population of the area considered.
Values for the parameters Active, Recovered, Vaccination can also be computed
similarly, that is, dividing the value by the total population of the area considered.
The values which are expressed in percentages can be fuzzified by dividing the
value by 100.
Doubling Rate (Rd ): In this application, the doubling rate value is fuzzified by
using the formula mentioned in Eq. 5.

1
Fuzzified Rd = min 1 − ,1 (5)
Rd
In decision-making applications, values come from discrete sources and vary a

lot among the alternatives. To normalize those values Eq. 6 can be used.

n
Normalized value ai = ai / a j , i = 1, 2, ..., n (6)
j=1
where ai is the normalized value ai .

As mentioned in Tripathy et al. [17], parameters are categorized into two cate-
gories. The parameters which impact the result positively are called positive param-
eters, and the parameters which impact the result negatively are called negative
parameters. Every parameter should be given a priority value, to represent correctly
in a decision-making application. The priority values of positive parameters lie in
[0,1]. The actual priority of a negative parameter is multiplied by (−1) for computa-
tion, to showcase its negative impact on the result. So, the priority values of negative
parameters lie in [−1, 0].
Priority of a parameter in this application can be computed by counting the number
of other parameters impacted by that parameter and then fuzzifying the value.
Number of other parameters impacted by the parameter ti

P(ti ) = × Cti (7)
Total number of parameters
where P(ti ) denotes the priority of the parameter ti Cti and denotes the type of
parameter, Cti = (−1) if a parameter ti is negative, otherwise Cti = 1.
To compare the scores of multiple competitors, the following formula is used.
Score(xi ) × n
Cscore(xi ) = n (8)
i=1 x i
where Cscore(xi ) denotes the comparison score of the competitor xi and n denotes
the number of competitors.
3.3 Proposed Algorithm
Step-1: Get the Covid-19 data table and fuzzify the data as mentioned in the previous
section (Table 1).
Step 2: Normalize all values as mentioned in Eq. 6 to get the required data in FSS
format (Table 2).
Step 3: Compute the priority of all parameters as mentioned in Eq. 7.
Step 4: Compute the priority data table by multiplying respective parameter
priorities to FSS values.
Step 5: Compute the comparison table by using the formula given in Eq. 8.
Step 6: The decision table can be constructed by computing the total score of
every competitor in the comparison table.
Step 7: Rank the scores obtained in the comparison table. The place with a higher
score should get a better rank. Better-ranked places are better suitable for more
relaxations in lockdown.
Table 1 Covid-19 data represented using FSS

1 2 3 4 5 6 7 8 9
s1 0.04 0.01 0.03 0.17 0.02 0.84 0.14 0.13 0.98
s2 0.04 0.01 0.04 0.10 0 0.80 0.19 0.21 0.97
s3 0.02 0.01 0.02 0.06 0.01 0.74 0.25 0.14 0.95
s4 0.01 0 0.00 0.03 0.01 0.74 0.25 0.05 0.95
s5 0.02 0 0.01 0.05 0.01 0.89 0.10 0.08 0.98
s6 0.06 0 0.05 0.07 0.01 0.90 0.09 0.16 0.97
Table 2 Normalized data

1 2 3 4 5 6 7 8 9
s1 0.20 0.20 0.21 0.36 0.23 0.17 0.14 0.17 0.17
s2 0.24 0.31 0.23 0.21 0.05 0.16 0.19 0.27 0.17
s3 0.13 0.21 0.11 0.13 0.16 0.15 0.25 0.18 0.16
s4 0.03 0.05 0.03 0.06 0.16 0.15 0.24 0.07 0.16
s5 0.08 0.05 0.09 0.11 0.19 0.18 0.10 0.10 0.17
s6 0.31 0.18 0.34 0.14 0.22 0.18 0.08 0.21 0.17
3.4 Application of the Proposed Decision-Making Algorithm

for a Lockdown Exit Strategy
Let E be a parameter set denoted by E = {1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 }. The

elements of this set represent all parameters that are considered to be used in
the following application. The parameters are Infected, Active, Recovered, Test
Positivity Case, Fatality, Recovery Ratio, Active Ratio, Doubling Rate, Vaccination.
To make the application more precise, let us take only six states of India for
comparison. Let X is a set of all those six states denoted by X = {s1 , s2 , s3 , s4 , s5 , s6 }.
The elements of this set represent the states Maharashtra, Kerala, Karnataka, Uttar
Pradesh, Tamil Nadu, and Delhi, respectively. All data are collected from [6–8]
Table 1 is constructed by fuzzifying the covid-19 data as mentioned in the previous
section.
Normalize the Table 1 values as mentioned in Eq. 5 to compute the normalized
data table (Table 2). It helps to normalize the values if most of the values of a column
having very small or very high fuzzy values.
Table 3 shows the impact of a parameter on others. If a parameter impacts another
parameter in any way, the value is 1 otherwise 0. A parameter having an impact on
more number of parameters should be given priority accordingly. Table 3 computes
the priority of parameters by taking the row sum and dividing the sum by the total
number of parameters.
Negative parameters in this application are Infected (1 ), Active (2 ), Test Posi-
tivity Case (4 ), Fatality (5 ), and Active Ratio (7 ). So the priority of these parameters
will be multiplied by (−1) for further computation. So the new parameter priorities
are −0.11, −0.89, 0.67, −0.78, −0.88, 0.78, −0.78, 0.78, 0.89 for the parameters
1 , 2 , …, 9 , respectively.
Table 4 is computed by multiplying the values of Table 3 with the respective
parameter priorities.
Values in Table 5 are computed by comparing all other values in respective
parameter columns.
Table 3 Absolute parameter priority

1 2 3 4 5 6 7 8 9 Priority
1 1 0 0 0 0 0 0 0 0 0.11
2 1 1 1 1 1 1 1 0 1 0.89
3 0 1 1 0 1 1 1 0 1 0.67
4 1 1 0 1 1 1 1 0 1 0.78
5 1 1 1 1 1 1 1 0 0 0.78
6 0 1 1 1 1 1 1 0 1 0.78
7 1 1 0 1 1 1 1 0 1 0.78
8 0 1 0 1 1 1 1 1 1 0.78
9 1 1 1 1 1 1 1 0 1 0.89
Table 4 Data prioritization

1 2 3 4 5 6 7 8 9
s1 −0.02 −0.17 0.14 −0.28 −0.21 0.13 −0.11 0.13 0.15
s2 −0.03 −0.28 0.16 −0.16 −0.04 0.13 −0.15 0.21 0.15
s3 −0.01 −0.19 0.07 −0.10 −0.14 0.12 −0.19 0.14 0.15
s4 0 −0.04 0.02 −0.05 −0.14 0.12 −0.19 0.05 0.15
s5 −0.01 −0.05 0.06 −0.08 −0.16 0.14 −0.08 0.08 0.15
s6 −0.03 −0.16 0.22 −0.11 −0.19 0.14 −0.07 0.16 0.15
Table 5 Parameter wise comparison

1 2 3 4 5 6 7 8 9
s1 −0.03 −0.15 0.16 −0.88 −0.36 0.02 0.12 0.02 0.01
s2 −0.05 −0.78 0.27 −0.19 0.63 −0.02 −0.10 0.47 0
s3 0.03 −0.22 −0.22 0.19 0.05 −0.08 −0.37 0.08 −0.01
s4 0.09 0.62 −0.56 0.48 0.05 −0.07 −0.36 −0.46 −0.01
s5 0.06 0.60 −0.31 0.28 −0.11 0.07 0.33 −0.31 0.01
s6 −0.10 −0.07 0.67 0.12 −0.27 0.08 0.38 0.20 0
Table 6 Score table

States s1 s2 s3 s4 s5 s6
Covid safe score −1.09 0.23 −0.55 −0.21 0.60 1.02
Table 6 is constructed by computing the sum of scores in each parameter for all
states.
The state which got a higher score can be given more relaxations in lockdown.
The seriousness of the pandemic situation in the states can be realized by looking at
the scores in Table 6. The higher score is better.
4 Conclusions
This paper proposes an algorithm to help in making strategies for the lockdown
exiting process in this ongoing pandemic using a fuzzy soft set. It also provides
an application of the proposed algorithm to make the process more understandable.
This paper used the state-level data in the provided application. But, the approach
can also be implemented in smaller areas to make the decision even more effective.
The approach can be better by taking different uncertainty-based models like rough
sets and interval-valued mathematics.
References
1. K. Atanassov, Intuitionistic fuzzy sets. Fuzzy Set Syst. 20, 87–96 (1986)
2. A. Ghosh, S. Roy, H. Mondal, S. Biswas, R. Bose, Mathematical modelling for decision making
of lockdown during COVID-19. Appl. Intell. 1–17 (2021)
3. A. Gulia, N. Salins, Ethics-based decision-making in a COVID-19 pandemic crisis. Indian J.
Med. Sci. 72(2), 39 (2020)
4. M. Gupta, S.S. Mohanta, A. Rao, G.G. Parameswaran, M. Agarwal, M. Arora, S. Bhatnagar,
Transmission dynamics of the COVID-19 epidemic in India and modeling optimal lockdown
exit strategies. Int. J. Infect. Dis. 103, 579–589 (2021)
5. M. Herle, A.D. Smith, F. Bu, A. Steptoe, D. Fancourt, Trajectories of eating behavior during
COVID-19 lockdown: longitudinal analyses of 22,374 adults. Clin. Nutrit. ESPEN 42, 158–165
(2021)
6. https://www.who.int/publications/m/item/weekly-epidemiological-update-on-covid-19---27-
april-2021. Last accessed on 25th May 2021.
7. https://www.mohfw.gov.in/, last accessed on 25th May 2021.
8. https://www.covid19india.org/. Last accessed on 25th May 2021.
9. P.K. Maji, R. Biswas, A.R. Roy, An application of soft sets in a decision making problem.
Comput. Math. Appl. 44, 1007–1083 (2002)
10. R.K. Mohanty, T.R. Sooraj, B.K. Tripathy, IVIFS and decision-making. Adv. Intell. Syst.
Comput. 468, 319–330 (2017)
11. D. Molodtsov, Soft set theory—first results. Comput. Math. Appl. 37, 19–31 (1999)
12. A. Smirnova, L. DeCamp, G. Chowell, Mathematical and statistical analysis of doubling
times to investigate the early spread of epidemics: application to the COVID-19 pandemic.
Mathematics 9(6), 625 (2021)
13. S. Snuggs, S. McGregor, Food & meal decision making in lockdown: how and who has Covid-19
affected? Food Qual. Prefer. 89(104145), 1–6 (2021)
14. T.R. Sooraj, R.K. Mohanty, B.K. Tripathy, Improved decision making through IFSS. Smart
Innov. Syst. Technol. 77, 213–219 (2018)
15. B.K. Tripathy, K.R. Arun, A new approach to soft sets, soft multisets and their properties. Int.
J. Reason. Based Intell. Syst. 7(3/4), 244–253 (2015)
16. B.K. Tripathy, R.K. Mohanty, T.R. Sooraj, A. Tripathy, A modified representation of IFSS and
its usage in GDM. Smart Innov. Syst. Technol. 50, 365–375 (2016)
17. B.K. Tripathy, T.R. Sooraj, R.K. Mohanty, A new approach to fuzzy soft set theory and its
application in decision making. Adv. Intell. Syst. Comput. 411, 305–313 (2016)
18. L.A. Zadeh, Fuzzy sets. Inf. Control 8, 338–353 (1965)
Deep Learning on Landslides:
An Examination of the Potential
Commitment an Expectation of Danger
Evaluation in Sloping Situations
J. Aruna Jasmine and C. Heltin Genitha
Abstract The landslide comes as the rehashed geographical dangers in the blustery
season, which brings fatal damages, harmful to property, what’s more, monetary
misfortunes. Landslides are in charge of in any event 17% of all fatalities from
normal perils around the world, and almost 25% of yearly losses brought about
by regular dangers. Because of worldwide environmental change, the recurrence of
a landslide event has been expanded and, in this way, the misfortunes and harms
related to landslides likewise have been expanded. Consequently, the exact forecast
of landslide event, and observing and advance cautioning for ground needs to be
done. Developments are significant errands to diminish the harms and misfortunes
brought about via landslides. Landslide is turning into an issue all through the world.
The recurrence and greatness of landslides undermining an enormous populace and
the condition is expanding over the world. Remote detecting framework is playing
an imperative job in Landslide forecast. In explicit, satellite remote detecting is
powerful in covering a huge territory for catching pictures, which thusly is utilized
as a contribution for preparing the framework for foreseeing landslides before about
fourteen days utilizing the neural systems.
Keywords Relocation expectation · Gated recurrent unit · Landslide observing ·

Early cautioning · Deep learning
1 Introduction
Normal risks like avalanches are one of the most unsafe geological calamities in
various zones all throughout the planet. Avalanches can achieve huge property
J. Aruna Jasmine (B)

Department of Information Technology, Sri Sairam Engineering College, Chennai 600 044, India
Anna University, Chennai 600025, India
C. Heltin Genitha
Department of Computer Science and Engineering, St. Joseph’s College of Engineering, Chennai
600 119, India
e-mail: heltingenitha@stjosephs.ac.in
386 J. Aruna Jasmine and C. Heltin Genitha
mischief and human misfortunes in rough regions. A continuous world failure report
shows that heavy slides and avalanches address 42% of the overall event of calami-
tous occasions, with ordinary yearly monetary disasters in light of avalanches adding
up to billions of US dollars. As per information from the Middle for Exploration on
the study of disease transmission of Fiascos (CRED) in Brussels, Belgium, torrential
slides are liable for basically 17% of all passing’s brought about by catastrophic
events around the world [1].
Solid early cautioning frameworks are a sensible methodology for landslide hazard
decrease. Such methods can be successfully implemented if avalanche movement
can be visualized. For example, during the 1985 Xin tan avalanche, which occurred
26 km upstream of the Three Gorges Dam (TGD), practical foresight of the avalanche
dislodging decreased the financial incidents and losses significantly [2].
Torrential slide delicacy assessment techniques can be accumulated into the theo-
retical (information driven) and measure (acquired from information and reality-
based) systems, reliant upon the way where they treat torrential slide beginning parts
and models (Fig. 1) by and large, learning-driven cycles depend totally on the expert
assessment of those doing the vulnerability appraisal [3]. Along these lines, data-
driven strategies are just to a great extent used for weakness examination over tremen-
dous regions since they don’t have a strong actual thought of slope dissatisfaction
[4]. Quantitative assessments might be separated into two kinds: information driven
cycles and based strategies. The exact connections between the areas of torrential
Fig. 1 Detection of landslides using remote sensing techniques

Deep Learning on Landslides: An Examination … 387
slides that have recently happened and torrential slide prompting parts are surveyed in
information driven strategies, and afterward quantitative guesses for torrential slide
free zones with almost tantamount conditions are made. These methods utilize an
as far as anyone knows information driven strategy that considers data from earlier
avalanches to decide the overall meaning of every part. This framework expects
that conditions that have recently set off torrential slides will do as such once more
[5]. The different primary procedures used to anticipate avalanches dependent on
past designs are bivariate quantifiable techniques, multivariate approaches, and fake
neural structure assessment. In the two different assessments, each cut section, for
instance, tendency, geography is considered to address issues. In multivariate quan-
tifiable strategies, the solid relationship among avalanches and geographic variables
like development is thought of. A phony neural system offers a computational part
that can ensure about, address, and register an aide starting with one multivariate
space of data then onto the accompanying given arrangement of information tending
to the affiliations [6]. A fake neural structure is set up by the utilization of a lot of
related information and yield respects. Since the data-driven proceedings are used
by most researchers, the properties are not put forth in this review. These techniques
are considered good for practically every one of the works. The most pacifying tech-
nique is the examination of Landslide transport over a wide range of huge districts
and their impact factors [7]. Additionally, these avalanche defenselessness evaluation
procedures consider only the associations among avalanches and related segments,
not the mistake framework [8]. To add to that, the real models, generally, take out the
transient pieces of avalanches and can’t expect the measure of existence of changes in
bringing landslide into a limit criteria (e.g., variation in the water table and difference
in land use) [9].
2 Remote Sensing on Finding Out and Foreseeing

of Landslides
An earlier review on landslide occurrence prediction through remote sensing and

aftermath effect in Europe used Multispectral pictures were used for evaluating
avalanche development by a group of researchers. The data collected using such
multispectral images was collected, which says that this terminology for recognizing
instrument associated in the mapping and seeing of avalanche qualities, anyway the
degree and detail of explanation vary in a general sense. Figure 1 shows the landslides
detected using remote sensing techniques.
Roy and Islam [3] found phenomenal methods to figure out landslides on planes
or surfaces with remotely sensed data. Increasingly Qualitative-properties were used
by the most basic methods, for example, number, apportionment, type, and character
of trash streams. The data for this kind of analysis is either collected from images
from high-resolution satellite images or Airborne photography. Sound system Data
like flying photography is required to predict landslides more effectively. It likewise
includes the passionate portrayal, evaluating assessments along and over the bulk
turn of occurrences.
3 Research on Landslide Dams Around the World:
The severe disturbance, disaster, loss of property, wealth, and agony caused by land-
slides have always been a matter of serious concern and discussion. As in front of
timetable as the 1930s was the predecessors in predicting landslides. To take a step
on this discussion Schuster and Costa have given away a volume of a book about the
landslide. This method seems to be more phenomenal on stream blockages without
any weakness. Figure 2 shows the landslide on dams in New Zealand.
Despite the extensive study done thus far, the multimodal geomorphic character-
istics seen in large slide dams continue to provide enormous challenges in portraying
an assessment explanation. It’s unclear if the heavy sliding volume or the volume of
the allocated storeroom, for example, would perform the stunt as an operator measure
to assess the geomorphic criticality or impact.
Fig. 2 Landslide on dams in New Zealand

4 Artificial Neural Networks’ Landslide Defenselessness
Artificial Neural Networks (ANNs) have a lot to do with measurable approaches. In

the application of Multilayered Perceptron (MLP) and Probabilistic Neural Network
(PNN) to avalanche defenselessness, the two approaches may be classified as reve-
lation models, and a couple of ANNs have been built on a true reason. These lines
are perpendicular to one another. In this way, it’s difficult to come up with a widely
accepted description that captures the complexity of ANNs and honest methods. The
most important thing is that ANNs provide for a different point of view on situations
that can’t be addressed by (separate) precise methods [3].
The use of a Fake Neural Organization begins by seeing the sort of issue that
will be outlined. Exactly when everything is said in done, ANNs are associated
with the course of action or backslide issues. Normal portrayal issues are credit
task (is an individual a blessed or heartbreaking credit peril) and imprint affirmation
(manufacture, veritable), while in backslide issues, the objective is to foresee the
estimation of a for the most part constant variable (e.g., tomorrow’s protections trade
cost). The feebleness of the area to land sliding can be viewed as a planning issue for
general applications. Figure 3 shows the landslide prediction using Artificial Neural
Networks. Along these lines, the ANN yields can be considered as a sort of level
of cooperation of each scene unit to the class avalanche. The Trajan Network test
framework (Trajan Programming, 2001) can set up different sorts of ANNs are open
for request issues [10].
Fig. 3 Prediction of landslides using an Artificial neural network

5 Landslide Early Warning
Avalanche early alerted is essential for in front of calendar or advance affirma-

tion of avalanche markers with the objective that occupants can be purged from
potential avalanche zones to diminish the mischief realized utilizing avalanches.
Early detection of heavy slides in a wide-ranging typical landscape can be accom-
plished by monitoring precipitation and changes in the earth’s physical characteris-
tics on a continuous or near-constant basis. Most systems that trigger torrential slides
have an impelling cutoff that is determined by precipitation and the earth’s physical
characteristics [11].
Precipitation limits for the initiation of heavy slides have been portrayed as around
the globe, typical, or nearby. Figure 4 shows the Early admonition framework for
landslide prediction. It shows the various parameters that contribute to the occur-
rence of landslide like subsurface model, ground water tables, rainfall scenarios, soil
parameters, slope profile. It also demonstrates how stability of a slope is calculated
using hydrology and stability models. The result presentation could be shown as
simulation details, presentation of results, or as a map result [12].
They looked at the new single and composite cutoff focuses with five appropriated
by and big ID edges, as well as the new single and composite cutoff focuses with five
Fig. 4 Early warning system for landslide prediction

appropriated by and large ID edges. Using RTI estimations of recorded precipitation

events, they provided a methodology for selecting the lower basic RTI respect and the
supercritical RTI respect. After the two critical RTI qualities have been established
in a precipitation event, a graph with second RTI values at time t on the ordinate
and the variety of time t on the abscissa may be used to estimate the upcoming
trash stream event probability [13]. The system has a geographical aim of a few
square kilometers: Emilia Romagna’s lopsided and rugged region is divided into 19
territorial units, each with its reference storm check and precipitation borders.
Precipitation, as previously stated, is the most critical need of torrential slide early
warning. Regardless, there can be differences in torrential slide impelling even when
the precipitation circumstances are almost equal. Precipitation information assessed
by rainfall measurements in torrential slide inclined zones is required for both small
and large torrential slides to be urged early. In any case, the impact of precipita-
tion is difficult to estimate since it is influenced by a variety of factors, including
soil heterogeneity. Early warning structure [14] turned a torrential avalanche into a
breaking point comprehension model. It should address, in that order, the standard-
ization of frequently used checking equipment, terrifying measures, takeoff method
style, and different disaster status and reaction alliances. Regardless of how much
flexibility is required during use, it must adhere to the requirements of the standard.
Lissak et al. [15] developed a detailed standard for torrential slide early warning
structures that include seven sub-structures, taking into account the four essential
components of a people-centered early warning framework (UN-ISDR, 2006) and
the crossbreed socio-technical approach for disaster risk reduction [16] (Fig. 5). The
checking and exhorted associations, which are currently considered as the point of
convergence of the early warning structures, will remain an essential element of the
disaster management program, it should be clarified.
Fig. 5 Landslide prediction early warning time point

There are various factors furthermore that co-sway the avalanche occasion, for
instance: incline assortment inside the inclination unit, typical recurring pattern of
slope unit, lithological fair assortment, the ordinary partition to water-net, and typical
detachment to basic segments.
A zone ZG118 on the Baishuihe heavy slide and a zone ZG111 on the Bazimen
heavy slide was chosen to set up and grab the longing model, as mentioned:
• For zone ZG118, the organizing dataset was selected from August 2003 to
December 2012, and the remaining data from January 2013 to December 2013
was utilized to test the model.
• For zone ZG111, the organizing dataset was selected from August 2003 to
December 2011, and the data from January 2011 to December 2012 was utilized
to test the model.
Taking into account the accumulated dislodging turn, the dislodging throughout
the flood season displays consistent step-wise progress [15] demonstrated that the all-
inclusive length model may be mirrored by discarding the influence of the advance-
ment in collected clearings utilizing a moving standard way of thinking. The TGR’s
water level fluctuates between 145 and 175 m at various points. The year was chosen
as the moving normal cycle for the TGR’s booking case [17].
There were three levels in the two models in zones ZG118 and ZG111. The first
two levels were LSTM layers, with the third layer being thick. The period of the data
creation was substantial, and the fraction of recorded data points that were reinforced
as information was also significant [18]. The system search process did not restrict
the perfect length of the data plan.
It indicates that the clearing time course of action of stations XD1, XD2, XD3, and
XD4 began later or was shorter than that of stations ZG118 and, more importantly,
ZG93, even though stations ZG118 and ZG93 had a comparable launch structure.
Station ZG118 was chosen as an example for the assessment of the expulsion using
the new model because it has a longer dislodging time plan and a low flying altitude.
Figure 6 addresses the different estimations of avalanches. Blue concealing in
Fig. 6 addresses the data assembled from google earth while the orange concealing
outline addresses the data accumulated from the field mapping. The above outline
depicts the Number of avalanches at substitute tallness. The data gathering is from
the avalanche stock guide for the Chittagong Hilly Areas of Bangladesh subject to
Google earth and field mapping.
Figure 7 addresses the match rate between the threat control made by the ITC
gathering and the estimate maps concerning the assorted avalanche types. The map
1 in Fig. 7 is the geomorphologic hazard control made by the ITC bunch subject to
the comprehension of ethereal photograph and the field assessment. It is made by
masters. The sorts of dangers are low, moderate, and high. The map 2 is the desire
map made by the quantitative estimate model concerning the various avalanche types.
It is made by the desired model.
Information driven help vector machine (SVM) was used to choose the perfect
avalanche forming factors. Components that reason avalanches like soil, precipitation
are progressively basic to anticipate the avalanche earlier with more precision with
Fig. 6 Landslide types investigation
Fig. 7 Landslide type-examination based on experts
the objective that we assembled the data through google earth and made it as a
diagram that addresses the weight of the components causing avalanches. Figure 8
discusses the landslide conditioning factors.
Fig. 8 Evaluation of landslide vulnerability
6 Conclusion
This research reviews landslide prediction measures utilizing various Remote

Sensing applications such as GIS and Artificial Neural Network and examines the
bigger phase of current predictive models that expresses static connections as they
were light of Gated Recurrent Unit (GRU) neural networks, which were formerly
thought to be more successful than other approaches in forecasting landslides than
static models, and make use of actual data, this effective study provides a thorough
look on an effective model to foresee landslide uprooting. To prepare the framework,
the contemporary forecast mannequin uses a long-term computation. We are using
Gated repetitive unit (GRU) computation to address the disadvantages of LSTM,
such as the extra time it takes to create the framework and the fact that it is extremely
unpredictable. It’s never easy to achieve the best of all factors at the same time.
Another disadvantage is that the LSTM model may be highly inquisitive about the
dataset’s measurements. If the preparation examinations are insufficient, the neural
systems will not be fully organized, and the mannequin’s anticipation accuracy will
be impacted.
References
1. M. Baldonado, C.-C.K. Chang, L. Gravano, A. Paepcke, The stanford digital library metadata
architecture. Int. J. Digit. Libr. 1, 108–121 (1997)
2. K.B. Bruce, L. Cardelli, B.C. Pierce, Comparing object encodings, in Theoretical Aspects of
Computer Software. Lecture Notes in Computer Science, vol. 1281, ed. by M. Abadi, T. Ito
(Springer, Berlin, 1997), pp. 415–438
3. A.C. Roy, M.M. Islam, Predicting the probability of landslide using artificial neural network,
in 2019 5th International Conference on Advances in Electrical Engineering (ICAEE) (2019),
pp. 874–879. https://doi.org/10.1109/ICAEE48663.2019.8975696
4. J.A.V. Ortiz, A.M. Martínez-Graña, A neural network model applied to landslide susceptibility
analysis (Capitanejo, Colombia). Geomat. Nat. Hazards Risk 9(1), 1106–1128 (2018). https://
doi.org/10.1080/19475705.2018.1513083
5. Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is
difficult. Neural Netw. IEEE Trans. 5(2), 157–166 (1994)
6. Y. Cao, K. Yin, D.E. Alexander, C. Zhou, Using an extreme learning machine to predict the
displacement of step-like landslides in relation to controlling factors. Landslides 13(4), 725–736
(2016)
7. S.-Y. Chen, W.-Y. Chou, Short-term traffic flow prediction using EMD-based recurrent
Hermite neural network approach, in 2012 15th International IEEE Conference on Intelligent
Transportation Systems (IEEE, 2012). https://doi.org/10.1109/ITSC.2012.6338665
8. J. Corominas, et al., Prediction of ground displacements and velocities from groundwater level
changes at the Vallcebre landslide (Eastern Pyrenees, Spain). Landslides 2(2), 83–96 (2005)
9. J. Du, K. Yin, S. Lacasse, Displacement prediction in colluvial landslides, three Gorges
reservoir, China. Landslides 10(2), 203–218 (2013)
10. R. Eberhart, J. Kennedy, A new optimizer using particle swarm theory, in MHS’95. Proceedings
of the Sixth International Symposium on Micro Machine and Human Science (IEEE, 1995)
11. Y. Fan, et al., TTS synthesis with bidirectional LSTM based recurrent neural networks, in
Fifteenth Annual Conference of the International Speech Communication Association (2014)
12. X. Fan, et al., Failure mechanism and kinematics of the deadly June 24th 2017 Xinmo landslide,
Maoxian, Sichuan, China. Landslides 14(6), 2129–2146 (2017)
13. X. Fan, X. Qiang, G. Scaringi, Brief communication: post-seismic landslides, the tough lesson
of a catastrophe. Nat. Hazard. 18(1), 397–403 (2018)
14. Z. Ma, F. Mei, Q.X. Xuanmei, G. Scaringi, Brief communication: post-seismic landslides, the
tough lesson of a catastrophe. Nat. Hazards Earth Syst. Sci. 18(1): 397–403 (2018). F. Piccialli,
Machine learning for landslides prevention: a survey. Neural Comput. Appl. 33, 10881–10907
(2021). https://doi.org/10.1007/s00521-020-05529-8
15. C. Lissak, A. Bartsch, M. De Michele, et al., Remote sensing for assessing landslides and
associated hazards. Surv. Geophys. 41, 1391–1435 (2020). https://doi.org/10.1007/s10712-
020-09609-1
16. F.A. Gers, J. Schmidhuber, Recurrent nets that time and count, in Proceedings of the
IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural
Computing: New Challenges and Perspectives for the New Millennium, vol. 3 (IEEE, 2000)
17. F. Paul, Remote sensing-based assessment of hazards from glacier lake outbursts: A case study
in the Swiss Alps. Can. Geotech. J. 39, 316–330 (2002)
18. J. Innes, Debris flows. Prog. Phys. Geogr. 7(4), 469–501 (1983). International Federation of
the Red Cross and Red Crescent Societies, I (2001). World Disasters Report 2001
The Good, The Bad, and The Missing:
A Comprehensive Study on the Rise
of Machine Learning for Binary Code
Analysis
S. Priyanga, Roopak Suresh, Sandeep Romana, and V. S. Shankar Sriram
Abstract Binary code analysis is an enabling technique for a wide range of appli-
cations such as digital forensics, software reengineering, malware detection, and
hardening software codes against known vulnerabilities. With the recent advance-
ments in compilers and run time libraries, the lack of high-level semantics is a critical
capability that exists in binary analysis. This challenge highly impacts the perfor-
mance of the existing binary disassembly tools which may provide inaccurate results
between source code and binary code. To address this problem, machine learning
techniques attract significant attention in recent times due to their dominance and
automation in analyzing the code at a binary level. Hence, this article discusses the
challenges in existing disassembly tools and attempts to show the significance of
machine learning approaches in binary code analysis.
Keywords Binary program analysis · Function identification · Machine learning ·

Recurrent neural network · Disassembly tools
1 Introduction
Binary Code Analysis (BCA) is the process of deriving properties about the behavior
of computer programs which can be done by either dynamic program analysis or static
S. Priyanga · R. Suresh · V. S. Shankar Sriram (B)

Centre for Information Super Highway, School of Computing, SASTRA Deemed University,
Thanjavur 613401, India
e-mail: sriram@it.sastra.edu
S. Priyanga
e-mail: priyanga@sastra.ac.in
R. Suresh
e-mail: roopaksuresh@sastra.ac.in
S. Romana
Centre for Development of Advanced Computing (C-DAC), Hyderabad 501510, India
e-mail: sandeepr@cdac.in
398 S. Priyanga et al.
program analysis [1]. It is primarily focused on two major areas: program optimiza-
tion and program correctness. The input to program analysis can be source code,
binary file, byte-code, memory dump, abstract model, etc. Two factors emphasize
the importance of the analysis of binary program files: (1) most of the time, high-level
source code is not available and, (2) there is mistrust in the compilation chain.
BCA is currently evolving in the fields of defense, medical, and other Internet of
Things [2–4] based environments where the source code is not certainly accessible.
However, analyzing the binaries is not an easy task as the stripped binaries lack
precise information like function boundaries when compared to source code. Func-
tion identification is the important step in the binary analysis as most of the binary
analysis approaches depend on function boundary information to identify functions
and other security vulnerabilities effectively. On the other hand, compiler optimiza-
tion makes the function identification process complex in binary code [5]. Due to
the varying behavior of compiler optimizations, the following challenges are posed
to the function identification [6]:
1. All the bytes are not functions: Each byte present in a binary sequence is avail-
able when reverse engineered. These bytes can be a variable that is independent
of any function which is not present in a program.
2. Functions are non-contiguous: During BCA, functions are not necessary to
present continuously in the memory. It can interleave or can share code with
other functions.
3. Functions reachability: Some functions in the binary program cannot be
reached always when it reacts to system status such as high memory/ CPU
usage. On the other hand, some functions can be called from other programs
which are not reached inside the path.
4. Compilations are not the same: The performance of the BCA is influenced
by the compiler optimization. It changes from compiler to compiler. Compiler
version and compiler optimization are the major factors in manipulating func-
tions. On disassembling, compiling a code with static links varies from dynamic
links compilation.
5. Multiple entries in a function: During compiler optimization, functions may
have multiple entries. i.e., a single function can be called at multiple instances.
To solve the fore mentioned problems in function identification, several disas-
sembly tools and machine learning approaches have been developed. This article
brings out a detailed study on BCA. Further, it put forth the challenges and research
gaps present in the existing disassembly tools and machine learning approaches for
function identification in binary program analysis. Also, we provide our insights for
future directions on machine learning approaches for BCA.
This article presents the background study in Sect. 2. Section 3 discusses the chal-
lenges in existing disassembly tools. Section 4 provides insights on machine learning
in binary program analysis. Section 5 provides scope and assumptions following the
conclusions in Sect. 6.
The Good, The Bad, and The Missing: A Comprehensive Study … 399
2 Background
Most of the analysis techniques have been designed and developed for source code
analysis or byte-code analysis. The crude way of statically analyzing the binaries is
to disassemble them and then analyze the assembly code for the behavior of interest.
This process is time-consuming as well as completely dependent on the skill of the
analyst to decode the bits and pieces of the target logic that appears in the assembly
code. With the exponential increase in the rate of production of binary files, it is
necessary to automate or semi-automate the process of decoding the assembly code
[7].
Binary analysis can evaluate object and library files for the quality of the software
and finding bugs or vulnerabilities which further helps in extending the data flow
analysis to binary code. However, few problems exist specific to binary analysis: (i)
type information of data/address/code is missing, (ii) lack of high-level structure,
(iii) compiler can introduce dynamic jumps on its own, (iv) Shared code and data
in the same memory, and (v) all problems in the source code analysis still exist at a
binary level.
The binary analysis includes the following stages: (i) Disassembly of the binary,
(ii) Recovering semantics, (iii) Translation to IR, and (iv) Compiling back to binary
[8]. Among the 4 stages, the loss of semantic information is the bigger hurdle in
enabling the analysis of binaries, as the compilers strip the semantic information
during the compilation process from source to binary. Thus, the critical step in
enabling the analysis of binaries is recovering/constructing the semantic information
from the binary code. i.e. Recovering semantics.
In efforts at the global level, The Defense Advanced Research Projects Agency
(DARPA) has launched the Cyber Grand Challenge, a competition to create auto-
matic defensive systems capable of reasoning about flaws, formulating patches, and
deploying them on a network in real-time [9]. Under this program, academicians and
industry are designing frameworks and technologies which enable binary analysis
and help identify vulnerabilities in binaries. A few frameworks and tools developed
as part of this initiative are MAYHEM, ANGR (FIRMALICE), and MC-SEMA.
These tools are available for ×86 based architecture. Similar translation tools are not
available currently for other architectures like MIPS and ARM which are commonly
used in many popular devices/systems.
3 Challenges in Existing Disassembly Tools
When the program develops from code to artifact, it is easy to understand the logic
of the particular command which makes the program act on the specified operation,
since the design and code are already known. Considering the reverse engineering
process, where only the program is present, the code and design logic are unknown
poses a greatest hurdle for reverse engineers to disassemble the binary. This situa-
tion is further complicated by the fact that the plethora of reverse engineering tools
available and the capacity of the reverse engineer is decided by the ability to use
available tools to their maximum potential [10]. However, the existing disassembly
tools may not provide a particular set of advantages over the other despite being in
the same category and it is listed (Table 1).
IDA Pro [11] allows disassembly and step-by-step instruction jumping to under-
stand the flow of a program, it also provides a Control Flow Graph (CFG) which
increases the visualization and understanding of the control flow inside the program.
Though Ollydbg [12] possesses both debugger and disassembler, it is not much
versatile as IDA Pro. Immunity Debugger [13] provides the same functionalities as
Ollydbg and it restricts to allow the disassembling of files that are not meant for
static disassembly. The disassembly of Ollydbg and Immunity Debugger is far less
functional and understandable when compared to IDA Pro. Binary Ninja is entirely
different from both IDA Pro and Radare2. Though it is a popular reverse engineering
platform, fewer functionalities exist than IDA Pro [14].
Tools like Radare2 and GNU Debugger provide complex maneuverability inside
the disassembled code. However, it requires a deep knowledge for a reverse engineer
to utilize the tools as both are command-line-based tools. Cutter [15] is a tool that
works based on the disadvantage of radare2 and has an option to use Ghidra function-
alities to the tools as well and it is built as a GUI-based tool for easier functionality
[16, 17]. This tool has a promising future which has the resource utilization variable.
However, it is not as robust as IDA Pro when it comes to functionality and versatility
and it is not considered a viable option due to stability issues.
IDA Pro allows python scripting rather than supporting multiple architectures
without the commercial license [18]. For a novice to script the process and purchase
the commercial license would be a daunting task as the tool is not cheap as well.
Ignoring the above statements, these tools face difficulty to learn and produce
improved results when the number of binaries increases [19].
Over the past few decades, research in function identification shows a signifi-
cant growth in this field and produced more remarkable disassembly tools. Each
tool adopts different systematic strategies which have its strengths and weaknesses.
However, the performance of most of the existing tools relies on the algorithms
and heuristics to prove their correctness. Extensive study of the disassembly tools
emphasize the need to address the following research gaps in function identification:
(i) Accuracy and efficiency
(ii) Varying results for different types of tools
(iii) Factors that affect the results of existing disassembly tools
The above-said limitations alarming the researchers that still the research gap
exists to improve the efficiency and effectiveness of function identification. Existing
disassembly tools employ function signatures to identify functions. However,
these signatures can be generated automatically by ML approaches. Hence, recent
researchers focus on designing ML approaches for function identification that can
learn the key features of a binary code automatically [20].
Table 1 Summary of disassembly tools
Features IDA Pro Binary Ninja Ollydbg
License Commercial Commercial Single developer
Static or Dynamic Both Static Both (more dynamic)
Speed Longest load time Long load time Faster in dynamic debugger
Multiple Instances Requires heavy resources
Memory utilization High Medium Medium
Automation Requires a commercial license
Architectures ×86 32-bit, × 86 64-bit, ×86 32-bit, × 86 64-bit, ×86 32-bit, ×86 64-bit,
ARMv7,ARMv8,Thumb2,PowerPC, ARMv7,ARMv8,Thumb2,PowerPC, ARMv7,ARMv8,Thumb2,PowerPC,
MIPS,6502,8085,8086,8051,AMD k6-2 3-D MIPS,6502 (Paid) MIPS,6502 (paid)
The Good, The Bad, and The Missing: A Comprehensive Study …
and many more (Paid)

Load on user Requires working knowledge of the tool
401
4 Machine Learning in Binary Program Analysis
In General, machine learning-based BCA comprises two phases: (i) Data pre-
processing and (ii) Training and Testing phase (Fig. 1). In the data pre-processing
phase, data is collected and pre-processed with feature selection and feature extrac-
tion techniques. In feature extraction, binary sequences of patterns are learned from
feature vectors to identify the distinct patterns from the binaries. This feature may
not help in reducing the size of original data which further opens room for feature
selection. This phase helps in identifying informative features from binary code and
achieves dimensionality reduction [21]. Reduced data obtained from these techniques
act as an input for training the machine learning model. Based on the training data,
the ML-based classifier identifies vulnerability paths, recognizing functions, and so
on.
The major challenge that exists in the binary analysis is the lack of high-level
semantic structures since the compilers discard them during source code compilation
making this an advantage for the intruders to modify the binary sequences that have a
huge impact on the output. In general, functions are the building blocks of high-level
programs but most of the binaries derived from them are undifferentiated sequences.
It is hard to find out useful information and the functional relation details from the
binary sequences. Therefore, the existing binary analysis techniques rely on function-
information repositories which initially attempt to recover the functions from the
binary sequences.
Fig. 1 A general machine learning framework for binary program analysis

In recent times, several ML approaches have been designed for function iden-
tification to automate the signature generation process which is evident from the
recent tools such as Byte Weight and CMU Binary Analysis Platform (CMU-BAP)
[22]. However, these tools rely on predefined function signatures for the identifi-
cation of functions/constructs from the given binary sequences. Also, it requires
learning for every compiler version and it is failed to act independently for unknown
compilers. Initially, Rosenblum et al. [23] paved the way to design an ML approach
for function identification. This paper designed a supervised learning approach that
considers graphlet features to define the structure of the program. Bao et al. [22]
employ weighted prefix trees that learn CFG based features to improve the effi-
ciency of function identification. Shin et al. [24] apply Recurrent Neural Network
(RNN) which solve function boundary identification problem efficiently by learning
tokens of byte sequences. It has reduced the computation time drastically despite
achieving better accuracy on a prior test suite. FID proposes an ensemble learning
approach using LinearSVC, AdaBoost, and Gradient Boosting to recognize the func-
tions. This method works fine for different compilers and optimizations [25]. The
complex, undifferentiated, and ill-conditioned nature of binary instructions makes it
difficult for the above-said tools to achieve high accuracy (Table 2).
Based on the literature study, the following points summarize the need for machine
learning in function identification:
1. Analyzing multiple binaries manually is tedious and time-consuming
2. Pre-processing without information loss results in the massive volume of
samples
3. Feature engineering to identify informative samples thereby achieving dimen-
sionality reduction
4. Training the learning model with more informative and unique samples yields
a robust and reliable function identification tool
Table 2 Performance evaluation of machine learning approaches

Tools Function identification Input Merits
Rosenblum et al. [23] ✓ Idiom features Identifies the source
compiler of unknown
binary sequence
Bao et al. [22] ✓ Raw bytes and Prior information
disassembled about the compiler is
instructions not required
Shin et al. [24] ✓ Fixed length byte Performs well for the
sequences; Not an trained patterns. Not
entire binary sequence feasible for new
patterns
Wang et al. [7] ✓ Semantic Information Outperforms existing
approaches on
obfuscated code
5 Decision Making Without Human Intervention
In contrast to these ML approaches, Dennis et al. proposed a novel function iden-

tification approach called Nucleus [26]. This method is compiler agnostic which is
capable of identifying function boundaries without the help of predefined signatures
and learning. It achieves better accuracy than other ML approaches. From the existing
works, we observed that the application of machine learning algorithms for function
identification poses various challenges such as:
(i) Each byte is considered as an input for the learning model, therefore it is
difficult to design a classifier that operates over millions of input sequences
at a time
(ii) Each byte in the byte sequence represents multiple information; hence,
knowledge extraction is a computationally intensive task
(iii) As the assembly level languages consist of multiple correlated functions
like call and jump commands, their binary representations and arbitrary
rearrangement process are a time-intensive task
6 Scope and Assumptions
From our study, we observed that most of the machine learning-based function iden-
tification models construct a baseline profile of functions and conditional statements
and their corresponding binary sequences. Also, the performance of ML approaches
is limited to syntax and semantic feature extraction. Hence, we assume that the
function identification model can be envisaged as a multi-task knowledge discovery
framework that integrates various phases like extraction and selection of informative
feature vectors from the binary sequences (Fig. 2). Thereby, given a binary sequence
as an input to the ML model will provide the respective function as the output. In this
way, we provide a taxonomy of Recurrent Neural Network (RNN), a deep learning-
based function identification model, since it can incorporate context and efficiently
scale the computation and memory based on the binary sequence length.
However, the performance of the RNN relies on various intrinsic parameters like
learning rate, weight decay, number of epochs, batch size, momentum, etc. The
identification of optimal values of these parameters enhances the performance of
RNN in terms of maximum classification accuracy. Hyperparameters of RNN can
be optimized using simple metaheuristic approaches such as swarm optimizations,
genetic algorithms, etc. which were predominantly used for parameter tuning in deep
learning approaches.
As a takeaway of this study, the following key aspects should be considered while
evaluating the performance of the ML-based function identification model:
(i) Accuracy of the ML approach towards the complete dataset
(ii) Run time performance
(iii) Stability of the results over different compiler architectures/optimization
Fig. 2 Deep learning-based semantic aware function identification model
7 Conclusions
The varying behavior of compiler optimizations has posed critical challenges to

the function identification in BCA. This paper has discussed the key challenges of
function identification in BCA generated by different compiler optimizations. From
the extensive study, we have witnessed a lack of disassembly tools and machine
learning approaches to recognize functions in binary code. However, we have also
observed that ML approaches help to speed up the process of BCA significantly. Most
of these approaches showed their significance in malware detection and software
testing. Further, it inspires researchers to design and test ML approaches for function
identification with the known behavior of test binaries.
Acknowledgements This work is supported by the Ministry of Electronics and Information

Technology (MeitY), India (4(2)/2019-ITEA).
References
1. J. Lee, T. Avgerinos, D. Brumley, TIE: principled reverse engineering of types in binary

programs, in 18th Annual Network & Distributed System Security Symposium Proceedings,
NDSS’11 (2011)
2. M. Wagner et al., A survey of visualization systems for malware analysis. Eurographics
Conference on Visualization (2015), pp. 105–125
3. S. C. Satapathy, B. N. Biswal, S. K. Udgata, J. K. Mandal, Proceedings of the 3rd International

Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA) 2014,
Advances in Intelligent Systems and Computing, vol. 1 (2014)
4. L.C. Harris, B.P. Miller, Practical analysis of stripped binary code. ACM SIGARCH Comput.
Archit. News 33(5), 63–68 (2005)
5. X. Meng, B. P. Miller, Binary code is not easy, in ISSTA 2016—Proceedings of 25th
International Symposium on Software Testing and Analysis (2016), pp. 24–35
6. X. Meng, J.M. Anderson, J. Mellor-Crummey, M.W. Krentel, B.P. Miller, S. Milaković, Parallel
binary code analysis, in Proceedings of ACM SIGPLAN Symposium on Principles and Practice
of Parallel Programming, PPOPP (2021), pp. 76–89
7. S. Wang, P. Wang, D. Wu, Reassembleable disassembling, in 24th {USENIX} Security
Symposium ({USENIX} Security 15) (2015), pp. 627–642
8. DARPA, Cyber Grand Challenge (CGC) (2016). [Online]. Available: https://www.darpa.mil/
program/cyber-grand-challenge. Accessed: 02-July-2021
9. C. Pang et al., SoK: all you ever wanted to know about x86/x64 binary disassembly but were
afraid to ask (2020), pp. 833–851
10. Hex-Rays, DA Pro. [Online]. Available: https://hex-rays.com/ida-pro/. Accessed: 02-July-2021
11. O. Yuschuk., ollydbg. [Online]. Available: www.ollydbg.de. Accessed: 02-July-2021
12. Immunity, Immunity Debugger. [Online]. Available: https://www.immunityinc.com/products/
debugger/. Accessed: 02-July-2021
13. B. Ninja, Binary Ninja. [Online]. Available: https://binary.ninja/. Accessed 11-July-2021
14. Cutter. [Online]. Available: https://cutter.re/. Accessed: 02-July-2021
15. Radareorg, Radare2. [Online]. Available: https://github.com/radareorg/radare2. Accessed: 02-
July-2021
16. Ghidra, Ghidra. [Online]. Available: https://ghidra-sre.org/. Accessed: 02-July-2021
17. G. Sri Shaila, A. Darki, M. Faloutsos, N. Abu-Ghazaleh, M. Sridharan, IDAPro for IoT malware
analysis?, in 12th USENIX Workshop on Cyber Security Experimentation and Test (CSET ’19),
2019 (2019)
18. A. Lee, A. Payne, T. Atkison, A review of popular reverse engineering tools from a novice
perspective, in Proceedings of the 2010 International Conference on Software Engineering
Research & Practice (2018), pp. 68–74
19. Z. L. Chua, S. Shen, P. Saxena, Z. Liang, Neural nets can learn function type signatures from
binaries this paper is included, in Proceedings of the USENIX Security Symposium (2017),
pp. 99–116
20. H. Xue, S. Sun, G. Venkataramani, T. Lan, Machine learning-based analysis of program
binaries: A comprehensive study. IEEE Access 7, 65889–65912 (2019)
21. T. Bao, J. Burket, M. Woo, R. Turner, D. Brumley, BYTEWEIGHT: Learning to recognize
functions in binary code, in Proceedings of 23rd USENIX security symposium (2014), pp. 845–
860
22. N. Rosenblum, Z. Xiaojin, B. Miller, K. Hunt, Learning to analyze binary computer code, in
Proceedings of the National Conference on Artificial Intelligence, vol. 2 (2008), pp. 798–804
23. E.C. Richard Shin, D. Song, R. Moazzezi, Recognizing functions in binaries with neural
networks, in Proceedings of 24th USENIX Security Symposium (2015), pp. 611–626
24. S. Wang, P. Wang, D. Wu, Semantics-aware machine learning for function recognition in
binary code,” Proceedings of 2017 IEEE International Conference on Software Maintenance
and Evolution, ICSME 2017 (2017), pp. 388–398
25. D. Andriesse, A. Slowinska, H. Bos, Compiler-agnostic function detection in binaries, in 2nd
IEEE European Symposium on Security and Privacy, EuroS P 2017 (2017), pp. 177–189
26. S. Schrittwieser, J. Kinder, G. Merzdovnik, E. Weippl, S. Katzenbeisser, Protecting software
through obfuscation: can it keep pace with progress in code analysis? ACM Comput. Surv.
49(4), 1–40 (2015)
Ensemble Machine Learning Approach
to Detect Various Attacks in a
Distributed Network of Vehicles
Aparna Pramanik and Asit Kumar Das
Abstract The Internet of vehicles is a kind of distributed network that consists of

smart cars, sensors, and microcomputers to acquire various travel information. They
communicate with each other by controller area network bus communication pro-
tocol to get information about vehicle identification, accident detection, and danger
warnings. This information is transmitted from one node to another node on the bus
without containing the source and destination address. Therefore, the system can be
attacked by an attacker by injecting any wrong information during communication.
So, the detection of these attacks is important to provide protection for the system.
We have proposed an ensemble machine learning algorithm using K-nearest neigh-
bors (KNN) and eXtreme Gradient Boosting (XGBoost) to detect this type of attack.
Initially, the KNN and XGBoost classifiers are combined to develop a base classifier,
and finally, XGBoost-based meta-classifier is developed for the prediction of attacks.
In the case of KNN, we have chosen seven nearest neighbors to predict the class.
The datasets applied in our proposed model are of class imbalance, i.e., they have
a non-uniform distribution of non-attack data and attack data. This may create an
over-fitted model which is ineffective and dangerous to use for attack prediction.
Here, we have applied an undersampling operation to handle the imbalanced nature
of the dataset, and a hashing operation has also been performed on the dataset and
transformed the dataset into a new form. The experimental results show that the pre-
dicting accuracy of the model has improved by stacking the KNN classifier with the
XGBoost classifier for DoS attack, Fuzzy attack, RPM attack, and Gear attack.
Keywords IoV · CAN bus · KNN · XGBoost · Imbalance dataset
A. Pramanik · A. K. Das (B)

Department of Computer Science and Technology, Indian Institute of Engineering Science and
Technology, Shibpur, Howrah 711103, India
e-mail: akdas@cs.iiests.ac.in
408 A. Pramanik and A. K. Das
1 Introduction
An urban area can be upgraded to become a smart city by allowing the connection
among intelligent vehicles who work together to accomplish complex jobs [1]. Beside
this advantage, this causes cybersecurity challenges for the network of the connected
vehicles. Nowadays, as the cyberattacks have been increasing, the cybercriminals
try to attack the network to disrupt all types of communication on the network [2].
These communications are made through Controller Area Network (CAN), and it
provides the protocol to make the communication [3]. The main motivation behind
this network is to improve traffic safety and driving efficiency.
1.1 Basics of Internet of Vehicles
In all industries, the technological impact has been increasing, and many intelligent
devices are introducing in all sectors. Therefore, the use of IoT has been increasing.
There are many applications of IoT, and one of the important applications is the
Internet of vehicles (IoV) [4]. IoV is an intelligent system for transportation. It
consists of hardware and numerous networks that permit cars to share information
with several components in real time [5]. The IoV made five kinds of communication
through the network. These are as follows:
1. Vehicle to Sensor (V2S) [6] structure allows vehicles to send information to the
sensors. Then the sensors send that information to the microcontroller.
2. Vehicle to Vehicle (V2V) [7] structure helps to interchange data regarding the
position and speed of a vehicle to the other neighboring vehicles wirelessly.
3. Vehicle to Infrastructure (V2I) [8] structures allow vehicles to share information
with the supporting Roadside Units (RSUs) wirelessly.
4. Vehicle to Cloud (V2C) [9] structures give access to the vehicles to gain addi-
tional data via Application Program Interfaces (APIs) from the Internet.
5. Vehicle to Personal devices (V2P) [10] structures provide permission to share
information to any electronic device from vehicles.
The Controller Area Network (CAN) bus provides the protocol to establish these
types of communication between vehicles and the microcontroller component.
Figure 1 describes the vehicle communication through the Internet.
1.2 Controller Area Network Bus
A Controller Area Network (CAN) bus specifies the standard during communication
between the Electronic Control Units (ECUs) [12]. It provides a message which is
broadcasted to the destination from the transmitter. This message contains several
Ensemble Machine Learning Approach … 409
Fig. 1 Vehicle communication through the Internet [11]
Table 1 Standard CAN bus data frame format

SOF MID RTR R, DLC Data CRC CRC ACK ACK EOF
Field DB DB
1 11 1 6 0-64 15 1 1 1 7
fields. The standard CAN bus data frame format [13] contains Start of Frame (SOF),
Message Identifier (MID), Remote Transmission Request (RTR), Reserved field (R),
Data Length Code (DLC), Data field, Cyclic Redundancy Check (CRC) sequence,
CRC Delimiter Bits (CRC DB), Acknowledgment (ACK) field, ACK Delimiter Bits
(ACK DB), and End Of Frame (EOF) field [13] as shown in Table 1.
The SOF and EOF are used to denote starting and ending of the remote and data
frame. The arbitration ID contains the message ID and Remote Transmission Request
(RTR) bits to differentiate the remote and data frames. It is used to recognize the
Electronics Control Unit (ECU) and also gives information about the priority of the
packet. The Reserved field and DLC induce the length of the message ID and the
size of the data. The Data field contains the data of the data frame. The utilization of
the CRC field during packet transmission is to encounter the error in the data packet.
The Acknowledgment field confirms the arrival of an authentic CAN packet [14].
The data packet is broadcasted over the bus to all the ECUs, but it does not provide
any information about the address. As the packet contains numerous arbitration IDs,
the devices can transmit the data packet to different devices. The CAN bus network
allows the addition of new nodes and also receives every packet in the absence of
the sender’s address in the message. Hence, the harmful packets can be sent to the
devices easily [15].
1.3 Different Types of Attacks
In CAN bus, any device can send or read data from any device as the devices do not
verify the packets. As there is no strict security protection, the system can get attack
easily. The attacks generally occur are as follows:
• Denial of Service (DoS) attack: In this attack, the authorized users cannot get
access to the network facilities due to jams caused by unauthorized access to
the network. The attacker introduces superior instruction (ID 0000) in every 0.3
milliseconds. The valid instruction does not get the chance to access the network
in time as the priority of ‘0000’ is higher [16, 17].
• Fuzzy attack: In this attack, the attacker gathers information about IDs. Then they
randomly send packets of random IDs into the network. Therefore, the vehicles
start behaving abnormally [16, 17].
• Spoofing attack: In this attack, the attacker steals information of the particular
type of IDs. Then they inject messages of certain IDs such as Revolutions Per
Minute (RPM) IDs or gear IDs in every 1 millisecond. After that, the vehicles start
behaving unexpectedly [16, 17].
Therefore, attack detection is important for in-vehicle communication to provide
security in this network so that the vehicles can communicate with each other without
any malicious attack and perform well.
1.4 Related Work
The attack detection of CAN bus has become an important topic in academia as
well as in industry. Therefore, many researchers have proposed different approaches
on this matter. Seo et al. [16] proposed a GAN-based machine learning technique
to detect intrusion on the system. They have extracted the useful information from
CAN IDs and converted those IDs to a image by encrypting one hot vector. In their
proposed GAN algorithm, they have combined the discriminators to differentiate
the attack information from normal messages. Lee et al. [18] proposed a method of
sending the remote frame to a receiver having a particular identifier for detecting any
kind of attack on the CAN bus network. This strategy is based on CAN’s offset ratio
and time delay between request and reply messages. Each node has a constant offset
ratio of reply and time delay in a non-attack state; however, these parameters fluctuate
in an attack state. Xiao et al. [3] proposed the SIMATT-SECCU framework to detect
intrusion on the CAN bus and build a security control unit (SECCU) by combining
the benefits of LSTM units and GRUs, as well as the simplified attention model
(SIMATT) to reduce overhead in computation. Song et al. [17] proposed an attack
detection method using deep convolution neural network. Their proposed method
utilizes the Inception–ResNet model’s structure by lowering the number of layers
and size in the architecture. In IDS for CAN bus, Tian et al. [19] have employed
a machine learning approach called Gradient Boosting Decision Tree (GBDT) and
proposed a new feature based on entropy as the GBDT algorithm’s feature creation.
Hu et al. [20] proposed robust anomaly detection using support vector machines. They
have applied their model to noisy data also. Hossain et al. [21] suggested an in-vehicle
CAN bus communications intrusion detection system based on LSTMs. They have
created new dataset by first extracting non-attack data from their experimental vehicle
and then adding attacks later. Barletta et al. [22] proposed an unsupervised Kohonen
Self-Organizing Map (SOM) network approach to detect intrusion for in-vehicle
communication networks. It is an Artificial Neural Network (ANN) that allows high-
dimensional data to be visualized on a two-dimensional map. Han et al. [23] presented
a method based on the survival analysis model to detect intrusion for vehicular
networks . The survival analysis is a statistical tool for determining which factors
influence the measurement of object’s survival rate and duration. The suggested
approach is concerned with the rate of survival of a single CAN ID in a chunk
unit. Tariq1 et al. [24] proposed an ensemble-based approach, and in their method,
they have used heuristics and Recurrent Neural Networks (RNNs) for predictions of
attacks. We have seen the proposed-related works are not able to get good accuracy
to detect all types of attacks. The above mentioned models are quite complex and
also not able to get good accuracy for all kinds of attacks. In our work, we have tried
to get better accuracy for all types of attacks on Internet of vehicles communication
using a less complex model.
1.5 Contribution
The movement of autonomous vehicles has increased with the advancement of tech-
nology. Injection of any wrong information in the CAN during vehicle communi-
cation via the Internet and cloud can disrupt the movement of vehicles. Therefore,
intrusions detection in CAN has been an important task for the cybersecurity sys-
tem. Thus, the main objective of this paper is to propose an effective CAN bus attack
detection algorithm for maintaining the security of the vehicle network. Initially, the
dataset has processed, and then the djb2 [25] hashing operation has performed over
the data and transformed into a new form by getting the hash values for the whole
data. Then the class imbalance nature of the dataset has also checked and applied
undersampling operation to get uniform distribution of non-attack and attack data
of the dataset. Here, we have considered DoS attack, Fuzzy attack, Gear attack, and
RPM attack. The datasets for these attacks are class imbalanced datasets. Next, we
have applied our proposed method on class balanced dataset to classify the normal
data from attacked data. The proposed method is an ensemble machine learning algo-
rithm to detect various attacks with more accuracy. The ensemble method combines
KNN classifier with the XGBoost classifier to predict the attacks more accurately
than the individual classifiers. We have performed ten-fold cross-validation over the
dataset to evaluate the performance of the proposed model. It has been seen that
the performance of the model is better compared to other related works to detect
all types of attacks of in-vehicle communication, and also the architecture of the
proposed model is less complex compared to them.
The rest of the section has assembled as follows: Sect. 2 represents the prepro-
cessing of the dataset and the framework of our proposed model. The experimental
results for evaluating the model are drawn in Sect. 3, and finally, the conclusion and
future scope have discussed in Sect. 4.
In the proposed method, an ensemble machine learning algorithm has been used.
The main purpose of using machine learning is to build a generalized algorithm from
a finite training dataset that can detect the attack on any new data. In the proposed
ensemble machine learning technique, we have stacked KNN with XGBoost algo-
rithm. The purpose of applying the XGBoost algorithm as it supports regularization
prevents the model from overfitting and also performs gradient boosting by consid-
ering gradient in the loss function [26]. This algorithm has also the ability of parallel
processing. Then the XGBoost algorithm is stacked with the KNN for better per-
formance of the proposed model and to get accurate result. The motivation behind
using KNN is it does not require any training time and learns from the training data
only at the time of making prediction [27].
2.1 Splitting and Mathematical Operation of XGBoost
This extreme gradient boosting algorithm performs classification operation by cre-

ating decision trees using the data of the dataset. It creates branches of the tree by
calculating similarity weight at each level. Based on the similarity weight, it calcu-
lates the gain by Eq. (7) [28] for each branch separately. It creates the decision tree
by considering the branches that have more gain value compared to others. It begins
with the initial prediction, and the evaluation of the prediction is made by introduc-
ing a loss function. Let us consider a dataset that contains m training examples, and
each of the training examples is denoted by j. It builds the trees by minimizing the
following regularized objective [28].

j=m

L(θ ) = r (x j , x j ) + (gn ). (1)
j=1 n
where (gn ) = α(S) + 21 γ ||v 2 ||

The first term, in the left part of Eq. (1), denotes the loss function. This loss
function is generated by the gap between the predicted value and desired outcome,
and the other term consists of two parts which has written after Eq. (1). In which,
the rightmost part contains a regularization term which is used to avoid overfit-
ting by smoothing the final learned weights and v denotes the weights of the leaf.
Also, S represents terminal nodes of a tree, and alpha indicates the user-definable
penalty [28].
Rather, the model is taught in an additive fashion. Here, we have considered x (j k)
as the prediction of the j-th sample at the k-th iteration and then gk has added to
minimize the following objective [28].

j=m
Lk = r ((x j , x (k−1)
j ) + gt ( p j )) + (gn ) (2)
j=1
The above Eq. (2) can be written as below

j=m
1
L ≈k
[r (x j , x (k−1)
j ) + d j gk ( p j ) + f j gk2 ( p j )] + (gk ) (3)
j=1
2
j=m j=m
where d j = φx(k−1) j=1 r (x j , x (k−1) ) and f j = φx2 (k−1) j=1 r (x j , x (k−1) ) are
first and second-order gradient statistics on the loss function [28]. In the next step,
the constant terms have been removed from Eq. (3) to achieve the following reduced
objective (4) at step k.

j=m
1
L k = d j gk ( p j ) + f j gk2 ( p j )) + (gk ) (4)
j=1
2
Equation (5) has been written from Eq. (4) by putting value of Eq. (1) as below

j=m
1 1 2
ı=S
k
L = d j gk ( p j ) + f j gk ( p j )) + (gk ) + α(S) + γ
2
v (5)
j=1
2 2 i=1 i
Now, Ji = [ j| p(q j ) = i] represents the sample set of leaf i. The optimal weight vi
has been computed of leaf i for a stable structure p(q) by the following formula [28].

j Ji dj
vi = (6)
fj + γ
In next Eq. (7), the optimal result has been obtained from Eq. (6)
ı=S
1 j Ji d j
[ + α(S)] (7)
2 i=1 f j + γ
The above Eq. (7) has been used to calculate the score that determines the structural
quality of a tree [28].
Now, considering the sample sets of left and right nodes after the split are JL and
J R , where J =JL ∪ J R , then by substituting into gain [28], we got

1 ( j JL l d j )2 j J R l d j )
2
j Jl d j )
2
L split =− + − −α (8)
2 j JL f j + γ j J R f j + γ j J f j + γ
Now considering our dataset, we are performing the construction of decision trees
by XGBoost algorithm.
DLC DLC
(8, 5, 2)
(8, 5, 2)
(8, 5) (2) (5) (8, 2)
The left and right trees have been drawn based on DLC field of our dataset, as it
contains different parameters such as 8, 5, and 2. Based on these values, the possible
trees have been drawn. Then the similarity weights are calculated for every node
of the left and right tree. After calculating the score by Eq. (7), the overall gain is
calculated for both trees by using the formula of Eq. (8). After that, we consider the
splitting of nodes for the tree that gets more gain compared to the others splitting.
In this way, the splitting operation has been performed considering the different
attributes of the dataset.
2.2 The Working Steps of Our Proposed Model
Data preprocessing: After collecting data, we have gone through preprocessing of

data. The dataset is a tabular dataset, and it contains the time stamp, message ID, DLC,
and the data fields. the whole data is stored on different columns based on their DLC
size. In the data processing step, the whole data have merged into a single column
and applied djb2 [25] hashing for the data column. The main aim of applying this
hashing operation is to get a single hash value for the variable size data of a particular
message. As the message ID contains hexadecimal number, it has converted into
decimal numbers. Then we have checked for missing values and removed the entire
Fig. 2 Work flow of the proposed methodology
information of that particular message ID that contains any missing value. As the
dataset is a class imbalance dataset, an undersampling operation has been applied
over the dataset to get uniform distribution of the data of different classes.
Feature selection: In our dataset, the features are time stamps, arbitration ID, data
length code (DLC), and 8-byte data. The dataset contains labeled data. We have
selected all these features to train our model.
Building model: In our proposed work, we have used the KNN stacked with the
XGBoost model to predict the attack during in-vehicle communication. Here, we
have chosen KNN and XGBoost as the base classifier. In KNN, we have considered
seven neighbor training examples to predict the class. Then the combination of the
predictions of level 0 has been used in level 1 for the final prediction. In level 1, we
have chosen XGBoost as the meta-classifier. Figure 2 represents the architecture of
the proposed model.
We have performed ten-fold cross-validation operation on the dataset to get the
accuracy score of our proposed model. At first, we have split the data into ten-folds,
and from each of the unique folds, we have selected one-fold as the test data and the
remaining k-1 have selected for training purposes.
3 Performance Evaluation
In this section, we have given details of the performance of our proposed method.
We have applied our proposed method on the publicly available dataset1 and com-
pared the performance of our proposed approach to other related work. In this case,
the actual dataset has been transformed into a new form by applying djb2 hashing
operation over the whole data. The hexadecimal data of different length has been
converted into numeric values that give single value for the whole data of a particular
1 http://ocslab.hksecurity.net/Dataset/CAN-intrusion-dataset [16, 17].

Table 2 Comparison of accuracy between our proposed method with other related work
Proposed method Accuracy in DoS Accuracy in Accuracy in gear Accuracy in RPM
attack fuzzy attack attack attack
DNN + Triplet 84 84 83 85
DNN + SVM 78 75 78 78
DNN + Softmax 61 60 63 63
Reduced 86 85.6 86 86
Inception–ResNet
XGBoost with 89 87 88 88
KNN
Table 3 Tenfold cross-validation score by our proposed method for different types of attack on
in-vehicle communication
Attack type DoS Fuzzy Gear RPM
1 89.2 87.3 88 87.9
2 89 87 87.8 88.2
3 88.8 86.5 88.3 88.1
4 89.3 87 88 87.8
5 89 86.9 87.9 87.6
6 88.6 87.2 88.2 87.9
7 89.1 87.1 88.1 88.2
8 89 87.1 87.8 88.3
9 89 86.9 88 88.3
10 89.1 87.2 88.1 88
Mean 89.01 87.02 88.02 88.03
message, and then the accuracy of different methods has been observed and given in
the below table.
In Table 2, it can be seen that the models have been built using Deep Neural
Network(DNN) with triplet, SVM, and softmax classifier that have not performed
well compared to the reduced Inception–ResNet model and our proposed model for
DoS, Gear, and RPM attack. It has also been seen that the accuracy of the reduced
Inception–ResNet model is less compared to our proposed model. In our proposed
method, we improved the model performance by getting better accuracy for all types
of the attack in vehicular communication. The proposed model has achieved 88%
accuracy for Gear and RPM attack, 87% accuracy for Fuzzy attack, and 89% accuracy
for DoS attack. Hence, we can say that our proposed model performs better compared
to the mentioned method for different types of attack that happen during message
transmission in a vehicular network.
In Table 3, the accuracy score after applying ten-fold cross-validation on the
dataset of DoS attack, Fuzzy attack, Gear attack, and RPM attack has been given,
and the mean score has also been given in the table. We have got a mean score of
89.01% for DoS attack, 87.02% for Fuzzy attack, and 88% for RPM attack and Gear
attack.
The drawback of our proposed method is it takes a long time for the large dataset
but performs well for all types of attacks during in-vehicle communication. In future
work, we can apply this method to other types of attacks such as impersonate attacks
on the CAN bus network and other types of intrusion detection in the Internet of
things (IoT) by removing this drawback of our model.
4 Conclusion
In this paper, we have mentioned the vulnerability of the autonomous vehicle network
system. We have proposed an ensemble machine learning method to detect cyber
attacks into the system. In our proposed method, we have used KNN and XGBoost as
a base classifier, and we have chosen XGBoost as a meta-classifier. As the dataset for
different attacks is class imbalance, an undersampling operation has been performed
over the dataset to create the balanced dataset in our method. We have compared
our model performance with other researcher’s work in this paper, and it has been
seen that our model provides better accuracy for detecting the attacks compared to
them. The accuracy for DoS, Gear, RPM, and Fuzzy attacks has reached 89%, 88%,
88%, and 87%. In future work, it can be applied to detect other types of intrusion in
the domain of cybersecurity, and computational time can also be reduced to detect
attacks in real time.
References
1. L.M. Ang, K.P. Seng, G.K. Ijemaru, A.M. Zungeru, Deployment of IOV for smart cities:
applications, architecture, and challenges. IEEE Access 7, 6473–6492 (2018)
2. J. Jang-Jaccard, S. Nepal, A survey of emerging threats in cybersecurity. J. Comput. Syst. Sci.
80(5), 973–993 (2014)
3. J. Xiao, H. Wu, X. Li, Internet of things meets vehicles: sheltering in-vehicle network through
lightweight machine learning. Symmetry 11(11), 1388 (2019)
4. W. Wu, Z. Yang, K. Li: Internet of vehicles and applications, in Internet of Things (Elsevier,
2016), pp. 299–317
5. L. Tuyisenge, M. Ayaida, S. Tohme, L.E. Afilal, Network architectures in internet of vehi-
cles (IOV): Review, protocols analysis, challenges and issues, in International Conference on
Internet of Vehicles (Springer, 2018), pp. 3–13
6. B. Ji, X. Zhang, S. Mumtaz, C. Han, C. Li, H. Wen, D. Wang, Survey on the internet of vehicles:
network architectures and applications. IEEE Commun. Stand. Mag. 4(1), 34–41 (2020)
7. A. Demba, D.P. Möller, Vehicle-to-vehicle communication technology, in 2018 IEEE Interna-
tional Conference on Electro/Information Technology (EIT) (IEEE, 2018), pp. 0459–0464
8. C. Wietfeld, C. Ide: Vehicle-to-infrastructure communications, in Vehicular Communications
and Networks (Elsevier, 2015), pp. 3–28
9. S. Rangarajan, M. Verma, A. Kannan, A. Sharma, I. Schoen, V2c: a secure vehicle to cloud

framework for virtualized and on-demand service provisioning, in Proceedings of the Interna-
tional Conference on Advances in Computing, Communications and Informatics (2012), pp.
148–154
10. F. Arena, G. Pau, An overview of vehicular communications. Fut. Internet 11(2), 27 (2019)
11. J. Joy, V. Rabsatt, M. Gerla, Internet of vehicles: enabling safe, secure, and private vehicular
crowdsourcing. Internet Technol. Lett. 1(1), e16 (2018)
12. K. Ismail, A. Muharam, M. Pratama, Design of can bus for research applications purpose hybrid
electric vehicle using arm microcontroller. Energy Procedia 68, 288–296 (2015)
13. J. Cook, J. Freudenberg, Controller area network (CAN). EECS 461, 1–5 (2007)
14. S.C. HPL, Introduction to the controller area network (CAN). Application Report SLOA101
(2002), pp. 1–17
15. S. Hartzell, C. Stubel, Automobile can bus network security and vulnerabilities (Seattle, Wash-
ington, 2017)
16. E. Seo, H.M. Song, H.K. Kim, Gids: Gan based intrusion detection system for in-vehicle
network, in 2018 16th Annual Conference on Privacy, Security and Trust (PST) (Aug 2018) ,
pp. 1–6
17. H.M. Song, J. Woo, H.K. Kim, In-vehicle network intrusion detection using deep convolutional
neural network. Vehicular Commun. 21, 100198 (2020)
18. H. Lee, S.H. Jeong, H.K. Kim, OTIDS: a novel intrusion detection system for in-vehicle network
by using remote frame, in 2017 15th Annual Conference on Privacy, Security and Trust (PST)
(IEEE, 2017), 57–5709
19. D.T ian, Y. Li, Y. Wang, X. Duan, C. Wang, W. Wang, R. Hui, P. Guo, An intrusion detec-
tion system based on machine learning for can-bus, in International Conference on Industrial
Networks and Intelligent Systems (Springer, 2017), pp. 285–294
20. W. Hu, Y. Liao, V.R. Vemuri, Robust anomaly detection using support vector machines, in
Proceedings of the International Conference on Machine Learning (Citeseer, 2003), pp. 282–
289
21. M.D. Hossain, H. Inoue, H. Ochiai, D. Fall, Y. Kadobayashi, LSTM-based intrusion detection
system for in-vehicle can bus communications. IEEE Access 8, 185489–185502 (2020)
22. V.S. Barletta, D. Caivano, A. Nannavecchia, M. Scalera, Intrusion detection for in-vehicle
communication networks: an unsupervised Kohonen Som approach. Fut. Internet 12(7), 119
(2020)
23. M.L. Han, B.I. Kwak, H.K. Kim, Anomaly intrusion detection method for vehicular networks
based on survival analysis. Veh. Commun. 14, 52–63 (2018)
24. S. Tariq, S. Lee, H.K. Kim, S.S. Woo, Detecting in-vehicle can message attacks using heuristics
and RNNs, in International Workshop on Information and Operational Technology Security
Systems (Springer, 2018), pp. 39–45
25. F. Breitinger, H. Baier, Similarity preserving hashing: eligible properties and a new algorithm
MRSH-v2, in International Conference on Digital Forensics and Cyber Crime (Springer, 2012),
pp. 167–182
26. C. Bentéjac, A. Csörgö, G. Martínez-Muñoz, A comparative analysis of xgboost (2019)
27. S.D. Jadhav, H. Channe, Comparative study of K-NN, Naive Bayes and decision tree classifi-
cation techniques. Int. J. Sci. Res. (IJSR) 5(1), 1842–1845 (2016)
28. T. Chen, C. Guestrin, Xgboost: a scalable tree boosting system, in Proceedings of the 22nd
ACM SIGKDD international conference on knowledge discovery and data mining (2016), pp.
785–794
Predictive Analytics of Engineering
and Technology Admissions
Sachin Bhoite, Punam Nikam, and Ajit More
Abstract An immense career without a quality education is just a DREAM. Degree,

specialization, college/University, and knowledge are key factors to achieve a great
career. Career-related decisions are discussed after the 10th standard and mostly
concluded after the 12th. After successful completion of 12th, the next target of any
student is to get into a suitable college/university for appropriate course/program so
that he can get a better education, guidance, and placement for his future. In this
research, exploratory data analysis (EDA), feature selection, label encoding, feature
scaling, normalization, and standardization are rigorously implemented on the dataset
using various Python libraries to prepare the dataset ready to apply machine learning
(ML) algorithms. We have built the ML models for the prediction of the college
by a testing suite of ML classification algorithms using the K-fold cross-validation
method and ensemble learning (EL) method. The suit of ML algorithms covers
logistic regression, K-nearest neighbors, decision tree classifier, random forest clas-
sifier, naïve Bayes, and support vector machine classifiers. Under EL, we have tested
adaptive boosting, gradient boosting, and grid search CV methods. Nevertheless, EL
methods are popular and give the best performance on a predictive modeling project;
we have received opposite results. After comparison, we found ML technique deci-
sion tree has higher accuracy for college prediction projects. In the end, researchers
have suggested a ‘A Free guide to engineering admission aspirant parent and student
(FGEAAPS)’ Web module through which engineering aspirant parents and students
can take the guidance for college selection.
Keywords Exploratory data analysis · K-fold cross-validation · Machine learning

algorithms · Ensemble learning
S. Bhoite (B) · P. Nikam

School of Computer Science, MIT WPU, Pune, Maharashtra, India
e-mail: sachin.bhoite@mitwpu.edu.in
P. Nikam
e-mail: punam.nikam@mitwpu.edu.in
A. More
IMED, Bharati Vidyapeeth, Pune, Maharashtra, India
e-mail: ajit.more@bharatividyapeeth.edu
420 S. Bhoite et al.
1 Introduction
Data mining and ML researchers have studied classification problems most

frequently [1], in which the value of a (categorical) attribute (the class) can be
predicted based on the values of other attributes (the predicting attributes) [2].
This paper aims to determine the features impacting the selection of engineering
colleges and guide the students to select engineering colleges for their first-year
admission. Most students and parents are spending unnecessary efforts, time, and
money on selecting the right engineering college for first-year admission. Hence,
here, the researchers have built a predictive model to guide the students about their
admissibility in the desired college and also suggest the college where they will get
admission.
This model will help to save time, money, and mainly confusion of predicting
the right alternatives for engineering education after 12th, which will also help in
arranging or planning future expenses. As our objective is to predict a list of possible
colleges where the student will get admission, it is a major classification problem.
Therefore, we used logistic regression, K-nearest neighbors, decision tree classi-
fier, random forest classifier, naïve Bayes, and support vector machine supervised
machine learning algorithms using K-fold cross-validation techniques train and test
data splitting technique. The value of K is tested for the better result though most of the
time it has considered as 10. Also, we used EL techniques, which are comparatively
faster and give better accuracy for the classification of projects.
2 Related Work
The researchers have studied various related national and international research
papers and thesis to understand objectives, type of algorithms used, datasets, data
pre-processing methods, features selection methods, etc. Kalathiya et al. [3] used
different machine learning algorithms to analyze students’ admission preferences.
They found random forest classifier is a good classifier as its accuracy is very high.
Khandale and Bhoite [4] used different machine learning models to analyze students’
placement in the early stage of their academics. They found an AdaBoost classifier
along with the bagging and decision tree as the base classifier as its accuracy is very
high. Nie et al. [2] used logistic regression (LR), SVM, random forest (RF), and
decision tree (DT) to propose a system for advanced forecasting of career choices
for college students based on campus big data. Random forest performs compara-
tively better than the other methods. Roy et al. [5] used advanced machine learning
algorithms like SVM, random forest decision tree, OneHot encoding, XGboost to
predict students’ careers. Out of all, SVM gave more accuracy than the XGBoost.
Waghmode and Jamsandekar [6] designed a framework for the expert system useful
for career selection. Machine learning algorithms ID3, PRISM, and PART give 100%
Predictive Analytics of Engineering and Technology Admissions 421
accuracy in classification along with rules. Gorad et al. [7] developed a Web applica-
tion based on student’s personality trait, interest, and capacity to take up the course
that would help students for their careers. The prediction is done using one of the
decision tree algorithms which is the C5.0 package in Borgavakar and Shrivastava
[8] used a k-means clustering algorithm to classify student’s grades on basis of class
tests, mid-test, and final test into three categories: ‘High’, ‘Medium’, and ‘Low’.
Saikuswanth et al. [9] developed a system on the basis of student marks in math-
ematics, physics, and some questions expert system and artificial neural network
are applied for student career assessment to analyze their capabilities whether they
are perfect for the job. Padmapriya* [10] recommended that the data mining algo-
rithm decision tree induction is best when compared to a naïve Bayesian classi-
fier according to classification accuracy misclassification rate, speed, and size on
students’ personal data, pre-college data and under graduation data to predict higher
education admissibility. Bibodi et al. [11] have done work on predicting the university
when students apply to explicit universities. The Random forest provided better accu-
racy than other algorithms, i.e., 90% accuracy. Sonawane and Dondio [12] explored
the system to help the students to find the best foreign universities/colleges based
on their performance. He used three algorithms: KNN (76% accuracy), decision tree
(80% accuracy), and logistic regression (68% accuracy). Aljasmi et al. [13] experi-
mented with machine learning models multiple linear regression, K-nearest neighbor,
random forest, and multilayer perceptron to predict the opportunity of a student to
get admitted to a master’s program. Experiments showed that the multilayer percep-
tron model surpasses other models. Apoorva et al. [14] were helped the students by
providing an open-source machine learning model to know their chance of admission
into a particular university in the USA with high accuracy. Naveen et al. [15] used
different regression models and chose the best among them all regression models to
advise students to plan and choose a career for them and join the university as per
choice. The random forest has given good results among all models.
3 Research Methodology
The proposed work was carried out by performing exploratory analysis and
experiments on various machine learning algorithms.
3.1 Exploratory Analysis
When we start to develop an ML model, it is essential to analyze the data first, which
has been done using statistical and visualization techniques. It will bring our focus
on important aspects of it for further analysis. This process helps in the following
manner.
• Get an understanding of the statistical properties of the data, the schema of the
data,
• Get an understanding of the missing values and inconsistent data types.
• And also get an understanding of the predictive power of the data, such as the
correlation of features against target variable.
In the process of EDA, we have drawn and analyzed the following graphs [16].
Table 1 Plots for target variable as college name

S. No. Plot name Plot details
1 Heatmap Heatmap with all the features
2 Heatmap with positively related features
3 Countplot College name and gender-wise admissions
4 College name and category-wise admissions
5 College name and HSC marks-wise groups
6 College name and merit marks-wise groups
7 College name and home university-wise admissions
8 College name and branch-wise admissions
9 Boxplot Merit marks for college prediction
10 HSC eligibility for college prediction
3.2 Algorithms Used
As per objective of the research, we have used classification methods. So we have

used the following classification algorithms of ML [17].
• Logistic Regression
• K-Nearest Neighbors
• Decision Tree
• Random Forest
• Support Vector Machine
• Naïve Bayes
4 Steps in Building Predictive Models Using Machine

Learning
We followed the cross-industry standard process (CRISP) methodology.

Understanding of problem and objectives of the research: understanding

students datasets regarding their admission process and selection of the appro-
priate features for college prediction research objective is important.
Data Understanding: students preadmission data required for the research was
collected. Different features of the dataset were analyzed based on their impor-
tance and relevance. The dataset used for this research work is explained in detail
in Sect. 5.
Feature Engineering: In this phase, the data from multiple data sources were
integrated into two datasets. The next step is that the data was cleaned by removing
unwanted columns, handling missing values, creating unique classes, performing
transformation for numerical data, and all the cleaning activities on the data.
Feature engineering used for this research work is explained detail in Sect. 6.
Experimentation: Several machine learning algorithms were tested and experi-
mented with parameter tuning mentioned in Tables 4 and 5. Its results are discussed
in Sect. 9.
Evaluation: All the developed models were evaluated based on their performance
for accuracy metric [15]. More information about evaluation is presented in Sect. 8.
Result and Discussion: It is discussed in detail in Sect. 9.
Implementation: Once the model is finalized, it is used to evaluate unseen data,
which is discussed in Sect. 10.
5 About the Dataset
The researchers have collected pre-admission hard copy data of 27 Engineering

colleges from the Joint Director, Technical Education, Pune, Maharashtra, India.
Then converted it into a csv file with 14,766 records of 20 attributes, which is essential
to read by python code to implement machine learning algorithms. This dataset has
the following 20 attributes.
Table 2 Columns of the

S. No. Name of the feature S. No. Name of the feature
pre-admission dataset
1 ‘Main Serial No.’ 11 ‘Home University’
2 ‘Sr. No.’ 12 , ‘PH Type’
3 ‘College Name’ 13 ‘Defence Type’
4 ‘College Code’ 14 ‘HSC Eligibility’
5 ‘Merit No’ 15 ‘Seat Type’
6 ‘Merit Marks’ 16 ‘Fees Paid’
7 ‘Candidate Name’ 17 ‘CAP Round’
8 ‘Gender’ 18 ‘AdmittedLate’
9 ‘Candidate Type’ 19 ‘BRANCH’
10 ‘Category’ 20 ‘NATIONALITY’
6 Feature Engineering
In general, every machine learning algorithm takes some input data to generate
desired outputs. These input data are called features, which are usually presented in
structured columns. As per goal or objectives, algorithms require input features with
some specific characteristic to get desired output. Hence, there is a need for feature
engineering. Feature engineering efforts mainly have two goals:
• Generating the proper input dataset, as per the requirement of the machine learning
algorithm.
• Improving the performance of machine learning models.
In the machine learning project, data preparation is very important. The following
steps are carried out to prepare the dataset ready.
• Missing Values
• Handling categorical data (Label Encoder)
• Change in data type
• Drop columns
7 Feature Selection
Every time, domain experts of the problem may not be available to decide independent
features to predict the category of the target feature. Hence, before fitting the model,
we must make sure that all the features that we have selected are contributing to
the model properly and weights assigned to it are good enough so that our model
gives satisfactory accuracy. For that, we have used three feature selection techniques:
univariate selection, recursive features importance, and feature importance [18]. We
used the Python scikit-learn library to implement it.
Univariate selection method shows the highest score for the following features.
Table 3 Univariate feature

Feature index Feat_names Feature score
selection for college selection
9 H.S.C. eligibility 1.726485e + 02
2 Merit marks 1.345437e + 02
While using the recursive feature importance method, the following features are
selected and the remaining are rejected.
Selected Features: [‘Candidate Type’, ‘Category’, ‘PH Type’, ‘Defense Type’,
‘HSC Eligibility’, ‘BRANCH’].
Inbuilt class feature importance comes with tree-based classifiers, we used an extra
tree classifier from Python scikit-learn library for extracting the top 7 features of the
dataset (Fig. 1).
Fig. 1 Feature selection using feature importance for college selection
8 Experimentation
The models have experimented with K-fold cross-validation (K-FCV), train–test

split (T–TS) method, and tuning different parameters. In this process, Python sklearn
library has played a very important role. So detail is mentioned in Table 4, [19].
Apart from the above methods, while doing parameter tuning, we have used the
following ensemble algorithms [20].
After the discussion of the accuracy, a Web module named ‘Free guide to engi-
neering admission aspirant parent and student (FGEAAPS)’ has been suggested
through which engineering aspirant parents and students can take the guidance for
college prediction.
Table 4 List of experiments with model combinations

S. Name of Data Data splitting Parameter tuned Number of
No. algorithm splitting folds/ratio parameter
method tested
used
1 Logistic K-FCV 3 5 10 Label encoding 7 to 15
regression One hot encoding 7 to 15
T-TS 70:30 80:20 90:10 label encoding 7 to 15
One hot encoding 7 to 15
2 Support K-FCV 3 5 10 Estimator 7 to 15
vector param_grid 7 to 15
machine
(SVC) T-TS 70:30 80:20 90:10 Estimator 7 to 15
param_grid 7 to 15
(continued)
Table 4 (continued)
S. Name of Data Data splitting Parameter tuned Number of
No. algorithm splitting folds/ratio parameter
method tested
used
3 Decision K-FCV 3 5 10 max_depth 7 to 15
tree min_impurity_decrease 7 to 15
max_leaf_nodes 7 to 15
min_leaf_nodes 7 to 15
max_features 7 to 15
T-TS 70:30 80:20 90:10 max_depth 7 to 15
min_impurity_decrease 7 to 15
4 Random K-FCV 3 5 10 max_depth 7 to 15
forest min_impurity_decrease 7 to 15
T-TS 70:30 80:20 90:10 max_depth 7 to 15
min_impurity_decrease 7 to 15
5 Gaussian K-FCV 3 5 10 7 to 15
NB T-TS 70:30 80:20 90:10 7 to 15
6 K K-FCV 3 5 10 leaf_size 7 to 15
neighbors n_neighbors 7 to 15
classifier
T-TS 70:30 80:20 90:10 leaf_size 7 to 15
n_neighbors 7 to 15
Table 5 List of experiments

S. No Name of the Data splitting method used
with advanced algorithms
algorithm
1 AdaBoost classifier T-TS
(DT)
2 Gradient boosting T-TS
classifier
3 Grid search CV T-TS
After cleaning all the data, removing all the noise, selecting relevant features, and
encoding it into machine learning form, the next step is to build a predictive model
by applying various ML techniques to find out the best model which gives us more
accuracy for train and test both.
Table 6 Results of ML techniques using K-fold cross-validation

S. No. Algorithm Train accuracy Test accuracy
1 Logistic regression 0.5055861526357199 0.49155653450807635
2 Support vector machine 0.7320220298977184 0.7173274596182085
(SVC)
3 Decision tree 0.9866247049567269 0.987885462555066
4 Random forest 0.32132179386309995 0.32929515418502203
5 Gaussian NB 0.9537372147915028 0.9610866372980911
6 K neighbors 0.7082612116443745 0.5895741556534508
Table 7 Ensemble learning

S. No. Algorithm Train accuracy Test accuracy
results for college prediction
objectives 1 AdaBoost 0.64 0.64
classifier (DT)
2 Gradient boosting 1.00 0.99
classifier
3 Grid search CV 0.63 0.64
9.1 Model Selection for College Prediction
After implementing all the above methods mentioned in Tables 6 and 7, we found
decision tree is the best classifier to predict the probability list of top colleges for any
individual.
As we have seen, in our dataset, 27 college records are present. And being college
as the target variable, it will have 27 different classes. Also, it is observed that
though each class is having a different number of records present in the dataset,
the values of selected features are quite similar. Hence, the dataset was balanced.
This understanding is very important while selecting a model as per performance
accuracy.
We can see using K-fold cross-validation we found 0.98 accuracies for the decision
tree classifier for training as well as testing accuracy. Under EL, out of 3 classifiers,
we found the highest accuracy for the AdaBoost classifier (DT), which is 0.64 for
both train and test. But which is comparatively very less. And for gradient boosting
classifier model was found to be overfitted with 1 for training and 0.99 for testing
accuracy. Some datasets are easy to get high accuracy because though the difference
between classes is very high, similarity among classes sample is also high. Hence,
we have chosen a decision tree classifier to implement the model.
10 Implementation
This section describes the implementation of the module.
10.1 Free Guide to Engineering Admission Aspirant Parents

and Students (FGEAAPS)
While selecting a college for engineering admission we have proposed the following
FGEAAPS Web module. The aspirant parent or student has to submit some basic
information like HSC marks, merit marks, home university, branch, etc. Then, they
will get a probability list of 3 colleges from higher to lower probability, where they
will get admission.
Fig. 2 College prediction Web module
11 Conclusion
In this research paper to predict college for an engineering admission, EDA, feature
selection, label encoding, feature scaling, normalization, and standardization are
rigorously implemented on the dataset using various Python libraries to prepare the
dataset ready to apply ML algorithms. For this study, 8 input features are selected
out of 20 features, which are ‘Merit Marks’, ‘Candidate Type’, ‘Category’, ‘Home
University’, ‘PH Type’, ‘Defense Type’, ‘HSC Eligibility’ and ‘BRANCH’. These
features are very important according to Univariate Selection, Recursive Features
Importance and Lasso feature selection methods. Massive EDA is used by checking
and plotting correlations between each input feature with the target feature. We have
built the ML models for the prediction of the college by a testing suite of ML classifi-
cation algorithms and EL methods. The suit contains Logistic Regression, K Nearest
Neighbors’, Decision Tree Classifier, Random Forest Classifier, Naive Bayes and
Support Vector Machine classifiers. Under EL we have tested Adaptive Boosting,
Gradient Boosting, and GridSearchCV methods. Nevertheless, EL methods are
popular and give the best performance on a predictive modeling project, we have
received opposite results. After comparison, the Decision tree gave higher accu-
racy for college prediction project when compared with other approaches. Finally,
“A Free guide to engineering admission aspirant parent and student (FGEAAPS)”

web module has been suggested through which engineering aspirant parents and
students will get a probability list of 3 colleges from higher to lower, where they will
get admission.
Also, it has been observed that results have been more improved after feature
engineering.
References
1. D. Kabakchieva, K. Stefanova, V. Kisimov, Analyzing university data for determining student

profiles and predicting performance, in 4th International Conference on Educational Data
Mining (EDM 2011) (The Netherlands, 2011), pp. 347–348
2. M. Nie, L. Yang, J. Sun, H. Su, H. Xia, D. Lian, K. Yan, Advanced forecasting of career choices
for college students based on campus big data. Front. Comput. Sci.: Chin. Univ. 12(3), 494–503
(2018). https://doi.org/10.1007/s11704-017-6498-6
3. D. Kalathiya, R. Padalkar, R. Shah, S. Bhoite, Engineering college admission preferences based
on student performance. Int. J. Comput. Appl. Technol. Res. 8(09), 379–384 (2019). ISSN:
2319-8656. https://doi.org/10.7753/IJCATR0809.1009
4. S. Khandale, S. Bhoite, Campus placement analyzer: using supervised machine learning algo-
rithms. Int. J. Comput. Appl. Technol. Res. 8(09), 379–384 (2019). ISSN: 2319-8656, 358–362.
https://doi.org/10.7753/IJCATR0809.1004
5. K. Sripath Roy, K. Roopkanth, V. Uday Teja, V. Bhavana, J. Priyanka, Student career prediction
using advanced machine learning techniques. Int. Mach. Learn. Algorithm J. Eng. Technol.
26–29 (2018)
6. M.L. Waghmode, P.P. Jamsandekar, Expert system for career selection: a classifier model. Int.
J. Adv. Res. Comput. Sci. Manage. Stud. (2016). ISSN: 2321-7782
7. N. Gorad, I. Zalte, A. Nandi, D. Nayak, Career counselling using data mining. Int. J. Innov.
Res. Comput. Commun. Eng. (2017). ISSN (Online): 2320-9801, ISSN (Print): 2320-9798
8. S.P. Borgavakar, A. Shrivastava, Evaluating student’s performance using k-means clustering.
Int. J. Eng. Res. Technol. (IJERT) (2017). ISSN: 2278-0181
9. V. Saikuswanth, G.K. Chaitanya, B. Sekhar Babu, U.L. Soundharya, Machine learning approach
for student’s career assessment in the modern world. Int. J. Recent Technol. Eng. (IJRTE) 7(6S)
(2019). ISSN: 2277-3878
10. A. Padmapriya, Prediction of higher education admissibility using classification algorithms.
Int. J. Adv. Res. Comput. Sci. Softw. Eng. (2012). ISSN: 2277128X
11. J. Bibodi, A. Vadodaria, A. Rawat, J. Patel (n.d.), Admission prediction system using machine
learning. California State University, Sacramento (2019). https://educationdocbox.com/735
72967-Homework_and_Study_Tips/Admission-prediction-system-using-machine-learning.
html
12. H. Sonawane, P. Dondio, Student admission predictor. School of computing, National College
of Ireland (2017). http://norma.ncirl.ie/3102/1/himanshumahadevsonawane.pdf
13. S. Aljasmi, A.B. Nassif, I. Shahin, A. Elnagar, Graduate admission prediction using machine
learning. Int. J. Comput. Commun. 14, 79–83 (2020). ISSN: 2074-1294
14. D.A. Chithra Apoorva, M.C. Nath, P. Rohith, S. Bindu Shree, S. Swaroop, Prediction for
university admission using machine learning. Int. J. Recent Technol. Eng. (IJRTE) 8(6) (2020).
ISSN: 2277-3878
15. N.S. Sapare, S.M. Beelagi, Comparison study of regression models for the prediction of post-
graduation admissions using machine learning techniques, in 11th International Conference on
Cloud Computing, Data Science and Engineering (Confluence), pp. 822–828 (2021). https://
doi.org/10.1109/Confluence51648.2021.9377162
16. M.L. Waskom, Seaborn: statistical data visualization. J. Open Source Softw. 6(60), 3021 (2021).
https://doi.org/10.21105/joss.03021
17. S. Bhoite, (2021). https://www.kaggle.com/drsachinbhoite/engineering-technology-students-
admission-data
18. P. Moulos, I. Kanaris, G. Bontempi, Stability of feature selection algorithms for classification
in high-throughput genomics datasets, in 2013 IEEE 13th International Conference on Bioin-
formatics and Bioengineering (BIBE), pp. 1–4 (2013). https://doi.org/10.1109/BIBE.2013.670
1677
19. T.O. Ayodele, Types of machine learning algorithms, in New advances in machine learning,
vol. 3, pp. 19–48 (2010)
20. X. Wang, G. Gong, N. Li, Automated recognition of epileptic EEG states using a combination
of symlet wavelet processing, gradient boosting machine, and grid search optimizer. Sensors
19(2), 219 (2019)
21. S. Ahmed, A. Zade, S. Gore, P. Gaikwad, M. Kolhal, Smart system for placement prediction
using data mining. Int. J. Res. Appl Sci. Eng. Technol. (IJRASET) 5 (2017). ISSN: 2321-9653.
www.ijraset.com
22. S. Bhoite, (2021). https://sbresearchproject.herokuapp.com/
Investigating the Impact of COVID-19
on Important Economic Indicators
Debanjan Banerjee, Arijit Ghosal, and Imon Mukherjee
Abstract COVID-19 has impacted the world unlike any other world event in our
recent memory. Entire humanity has been afflicted by this pandemic. As a conse-
quence of the pandemic, the governments around the world have decided to impose
lockdowns restricting economic interactions and relationships in a scale and form
which has not been witnessed by the modern man ever in his memory. The general
assumption here is that growing COVID-19 patient and mortality counts give rise
to a greater sense of uncertainty, and this greatly impacts the prices. It is imperative
thus for both the researcher community to observe and investigate the influence of
COVID-19 patient and mortality counts on geopolitical and economic index indica-
tors as well as the influence of these COVID-19 indicators upon important economic
indicators such as the gold price as well as stock market prices. For this specific pur-
pose, this work investigates the influence of COVID-19 patient and monthly death
counts on the economic indicators of gold and stock market prices.
Keywords COVID-19 · Influence · Gold price · Geopolitical · Economic

uncertainity
1 Introduction
In the contemporary world, COVID-19 pandemic has spread all over the world and
created major health crisis causing deaths and dislocations to millions of people
across the globe. This has caused severe lockdowns all over the world which results
in a complete halt toward any economic activities throughout the world. The current
D. Banerjee (B)
Sarva Siksha Mission, Kolkata, India
e-mail: debanjanbanerjee2009@gmail.com
A. Ghosal
St. Thomas’ College of Engineering and Technology, Kolkata, India
I. Mukherjee
Indian Institute of Information Technology, Kalyani, India
e-mail: imon@iiitkalyani.ac.in
434 D. Banerjee et al.
work looks at investigating the impact of these severe economic disruptions on the
Indian economy. The current work makes the assumption that the rising monthly
COVID-19 positive and fatality counts influence the geopolitical and economic
uncertainity, and these very COVID-19-related indicators also influence important
macroeconomic indicators such as gold price, inflation and stock market prices. The
current work obtains monthly COVID-19 patient growth and monthly COVID-19
fatality counts from the Worldometer source alongside obtaining geopolitical and
economic uncertainity alongside gold, inflation and stock market prices from var-
ious Internet sources. The current work performs linear regression between these
continuous variables to understand the influence of COVID-19-related indicators on
uncertainity and economic indicators.
2 Related Work
The researcher community has been investigating whether and how much the political
events of the day impact the macroeconomics of a country. Abdel-Latif et al. [1]
investigated the aspect of financial liquidity in the space of the primarily oil-producing
West Asian economies. Ahir et al. [2] had come up with their own world uncertainity
index based upon the word uncertainity with respect geopolitical and economic
considerations as described in the annually published International Economic Unit
country reports. Alqahtani et al. [3]. worked upon the importance of geopolitics
with respect to GCC countries. Antonakakis et al. [4] analyzed the relation between
geopolitics and its influence upon oil prices and stock prices.
Apergis et al. [5] discussed how geopolitical uncertainity impacts stock market
investment returns. Aysan et al. [6] discussed the impact of geopolitical uncertainity
on the Bitcoin and other cryptocurrencies. Baker et al. [7] investigated possible impact
of events such as the US presidential elections or American foreign interventions
with respect to the US economic growth. Balcilar et al. [8] investigated the relation
between uncertainty and stock market risks among the BRICS countries. Banerjee
et al. [9] applied the geopolitical risk for predicting the price of London Gold Price
fix prices.
Barro and Ursua [11] and similarly Barro [11] as well as Blum [13] have observed
that severe impact on stock market by unpredicted macroeconomic events. Baur et
al. [12] analyzed the hedging relationship between geopolitical risk uncertainity and
precious metals. Caldara et al. [14] first came out with the geopolitical risk factor
while using the total amounts of references in news media.
Das et al. [15] utilized panel data techniques to understand how FDI impacts the
labor productivity in the Indian banking sector. Jiaying and Stanislav [16] performed
panel data regression on multivariate data for econometric considerations. Saiz and
Simhonson [17] have observed that specific newspaper items like terrorism and war
significantly impact Western economies with the example of the USA.
Investigating the Impact of COVID-19 on Important Economic Indicators 435
3 Proposed Method
The present work investigates the influence of the COVID-19-related indicators, i.e.,
monthly change in total COVID-19 positive counts and monthly change in total
COVID-19 death counts on geopolitical and economic uncertainity as well as gold
price and stock market prices.
• The work assumes that growth in monthly COVID-19 positive and fatality counts
influences geopolitical and economic uncertainity as well as economic indicators
such as gold price, stock market price.
• The change in percentage in the total monthly COVID-19 positive counts and
COVID-19 fatality counts is computed. These are continuous-type variables; thus,
it is possible to perform linear regression with these cross-sectional-type data. The
data are collected from Worldometer online source.
• The geopolitical and economic uncertainity indicators are also obtained from
online resources like Google search trend Web site as well as policyuncertain-
ity.com.
• The work also obtains important economic indicators such as gold price, stock
price.
• The work applies linear regression with at first uncertainity indicators as the Y
variable and the COVID-19 indicators and the X variable and then economic
indicators as the Y variable and the COVID-19 indicators as the X variables.
• The work operates based upon the null hypothesis that the COVID-19 indicators
do not have any influence over either uncertainity or economic indicators.
• Based upon the regression statistics, i.e., p-values and F-statistics values, the null
hypothesis is either accepted or rejected. If the p-values and F-statistics values are
less than 0.01, then only null hypothesis is accepted, otherwise it is rejected.
3.1 COVID-19 Growth Counts Derivation
• The current work assumes that most uncertainity in the popular minds is created
by rapid growth in COVID-19 patient and fatality counts.
• The COVID-19 monthly statistics have been derived from Worldometer online
resources.
• The work obtains these data on a monthly basis from March 2020 to August 2021.
• This work utilizes the percentage of change in the monthwise total COVID-19
patient and fatality counts.
• These computed variables are continuous-type data and therefore in normal dis-
tribution.
• The continuous nature of these types of data allows linear regression.
• The current work computes the COVID-19 patient change percentage on the below
general formula: s * 100/t and COVID-19 fatality change percentage as p * 100/r
• Here, s represents total COVID-19 patient count in the month (i+1), and t represents
total COVID-19 patient count in the monthi
• Here, p represents total COVID-19 fatality count in the month (i+1), and r repre-
sents total COVID-19 fatality count in the monthi
The equation to express COVID-19 positive monthly growth percentage is as follows:
mcpcp = (s ∗ 100)/t (1)
In the above equation, Eq: 1, mcpcp represents monthly COVID-19 positive change
percentage. The equation to express monthly COVID-19 death growth percentage
can also be expressed as the following:
mcdcp = ( p ∗ 100)/r (2)
In Eq. 2, mcdcp represents monthly COVID-19 fatality change percentage. In Eq. 2,

covdeatht is considered as total COVID-19 death count in month t and covdeatht + 1
can be expressed as total COVID-19 death count in month t+1. covdeathpercentage
which represents the monthly growth percentage in total COVID-19 death cases is
derived from these two above-mentioned variables.
4 Geopolitical and Economic Uncertainity Features
The present work utilizes geopolitical uncertainity and economic uncertainity monthly
indicators for utilizing the regression technique in order to determine causation
between geopolitical and economic uncertainity factors and important economic
indicators such as gold price, stock market price and cryptocurrency prices.
• Geopolitical risk index : This work obtains the count of total utterances of the
word “geopolitical uncertainit” from Jan 2020 till May 2021 by using Google
trends web data analytics online web resource. This count has been considered
by the current work as the geopolitical risk index. We calculate the monthly
growth percentage of geopolitical risk index variable.
• Economic uncertainity index : This work utilizes the economic uncertainity index
by Baker et al. [4]. We calculate the monthly growth percentage of economic
uncertainity index since this is a continuous-type variable in normal distribution,
therefore, important for causation with respect to the economic indictors gold
price, stock market and cryptocurrency prices.
4.1 Geopolitical Risk, Economic Risk Indicator Growth

Percentage Derivation
In this subsection, the formulae through which geopolitical and economic risk indi-
cator growth percentages have been derived have been discussed.
economic risk indicator change percentage (3)

= (economicalr m(t + 1) ∗ 100)/economicalr mt
– whereas economical r mt = economic risk in month t

– whereas economical r m(t + 1) = economic risk in month (t + 1)
In Eq. 3, economic risk change percentage has been defined as a percentage of change
for economical risk in month t + 1 and economical risk in month t. The above Eq.
3 is valid if economical risk in month t + 1 is greater than economical risk in month
t, else we obtain the geopolitical risk change percentage from the following Eq. 4:
economical risk change percentange (4)

= (economicalr mt ∗ 100)/economicalr m(t + 1)
in Eq. 5, the following observations can be observed :

• whereas economical r mt = economic risk in month t
• whereas economical r m(t + 1) = economic risk in month (t + 1)
Geopolitical risk indicator change percentage (5)

= (Geopoliticalr mt + 1 ∗ 100)/Geopoliticalr mt
• whereas Geopolitical r mt = Geopolitical risk in month t

• whereas Geopolitical r m(t + 1) = Geopolitical risk in month (t + 1)
In Eq. 6, geopolitical risk change percentage has been defined as a percentage
of change for geopolitical risk in month t + 1 and geopolitical risk in month t. The
above equation is valid if geopolitical risk in month t + 1 > geopolitical risk in month
t, else we obtain the geopolitical risk change percentage from the following equation:
Geopolitical risk change percentage (6)

= (Geopoliticalr mt ∗ 100)/Geopoliticalr m(t + 1)
• whereas Geopolitical r mt = Geopolitical risk in month t

• whereas Geopolitical r m(t + 1) = Geopolitical risk in month (t + 1)
5 Experimental Results for the COVID-19 Features
This work performs the regression techniques with the help of the R-programming
language library lm. The lm is utilized since all the four main independent and
influencing variables do have continuous type, and therefore, using linear regression
technique in this case will be very useful. The target variable that has been used in
these regression techniques is the Bombay Nifty Gold Price, BSE Sensex, Bombay
Stock Exchange cryptocurrency price. Table 1 depicts all the results obtained by
experimenting with the above mentioned techniques.
The work performs regression based upon following criterion.
• The work utilizes a null hypothesis to begin with the regression procedures. The
null hypothesis is applicable to all the other features. The null hypothesis is that
the utilized feature does not influence various target variables which are in this
case, i.e., BSE gold price, BSE Sensex, BSE cryptocurrency price.
• The null hypothesis can be rejected based upon two given conditions, i.e., the prob.
value and the F-statistic prob. value; both are having a value lesser than 0.01, and
henceforth, the alternate hypothesis can be accepted.
• The target variables in these regression procedures are BSE gold price, BSE Sen-
sex, BSE cryptocurrency price, respectively.
• The work considers the Y -variable in this case as Bombay Stock Exchange gold
price (monthly average value), and the X variables in this case are COVID-19
positive case growth monthly percentge, COVID-19 death count growth monthly
percentage, monthly geopolitical risk indicator value, monthly economic policy
indicator value, respectively.
• The formulae in this case will be (Y − X 1 + error), whereas X 1 is COVID-19
positive case growth monthly percentage.
• Tthe formulae in this case will be lm which comes from a R-programming language
programming library.
• Once the above regression statistics are gathered particularly the probability p-
value and the F-statistics value, then the work utilizes the same formulae using
other independent variables like Bombay Stock Exchange (BSE) Sensex price,
Bombay Stock Exchange cryptocurrency price.
• The formulae in this case will be (Y − X 2 + error), (Y − X 3 + error) and (Y − X 4
+ error), whereas Y represents the Bombay Stock Exchange gold price. The X2, X3
and X4 are COVID-19 death count growth monthly percentage, monthly geopo-
litical risk indicator value, monthly economic policy indicator value, respectively.
Similarly for these variables also, we need to obtain the probability p-value and
the F-statistics values.
• Once the regression statistics are obtained in the above-mentioned manner, the
work introduces BSE Sensex, BSE cryptocurrency price as Y variables, respec-
tively. The Y -variable here will be first used as BSE Sensex and then Y -variable
that will be used as BSE cryptocurrency price.
• The p-values and F-statistic values are always calculated.
Table 1 Corelationship between COVID-19 factors economic uncertainty factors

X-features Y -features Correlation value
COVID-19 patient count Geopolitical uncertainity index 0.2134
growth percentage
COVID-19 death count growth Geopolitical uncertainity index 0.1719
percentage
COVID-19 patient count Economic uncertainity index 0.206273
growth percentage
COVID-19 death count growth Economic uncertainity index 0.106273
percentage
COVID-19 patient count Gold price 0.1038
growth percentage
COVID-19 death count growth Gold price 0.0013
percentage
• All the above-explained data explanation points have been depicted as observed
during the experiments with the aforementioned variables.
• The software tool R has been utilized for this purpose.
• All data that have been utilized during this regression process are actually open
source-based data.
• Given the nature of the regression, we can say that this is clearly a case of cross-
sectional data that have been utilized for regression purpose.
• The cross-sectional regression has been performed since all the variables, i.e.,
the dependent as well as independent variables involved in this occasion are all
continuous variables, and moreover, no time-related variables are involved in these
experiments.
• All the geopolitical risk indicators have been computed based upon the complete
monthly search counts of terms “geopolitical" as well as “uncertainity" as shown
in the Google search trends for India as a country during the months from January
2020 till June 2021.
• All the cross-sectional data-related regression have been rechecked after perform-
ing the same equations while using the Excel sheet-based regression formulae.
Only after similar results have been found between both the Excel formulae and
R-programming language programming tools, we come to a conclusion and pub-
lish it.
• It can be derived from Table 1 that COVID-19-related indicators are weakly cor-
related with geopolitical and uncertainity indexes as in both cases, the correlation
value is less than 0.3.
• The conclusion that can be made from Table 1 is that COVID-19-related indicators
are weakly correlated with the gold prices as in both cases, the correlation value
is less than 0.3.
6 Discussion
The current work concludes from the observations as depicted in Table 2 that the
following features do not influence the target variables since
• COVID-19 patient growth percentage has probability 1.0125 and F-statistic prob-
ability 3.2071 when the target variable is BSE gold price. Both of these values are
more than 0.01.
• COVID-19 death count growth percentage feature has probability 1.1192 and F-
statistic probability 7.1296 when the target variable is BSE gold price. Both of
these values are more than 0.01.
Table 2 Statistical relationship between COVID-19 factors and economic indexes

Feature Slope P-value R2 error F-statistic Target
p-value
COVID-19 1.683 1.0125 0.107 3.2071 BSE gold
patient count price
growth
percentage
COVID-19 3.605 1.1192 0.001 7.1296 BSE gold
death count price
growth
percentage
COVID-19 3.683 7.0125 0.007 2.1271 BSE Sensex
patient count price
growth
percentage
COVID-19 2.605 9.1742 0.711 6.0106 BSE Sensex
death count price
growth
percentage
COVID-19 4.683 8.0692 0.507 7.4071 BSE cryp-
patient count tocurrency
growth price
percentage
COVID-19 2.129 4.1802 0.238 7.6496 BSE cryp-
death count tocurrency
growth price
percentage
COVID-19 1.683 6.0178 0.601 3.2671 Inflation India
patient count
growth
percentage
COVID-19 3.014 5.9198 0.001 9.1071 Inflation India
death count
growth
percentage
ability 2.1271 when the target variable is BSE Sensex price. Both of these values
are more than 0.01.
statistic probability 6.0106 when the target variable is BSE Sensex price. Both of
ability 7.4071 when the target variable is BSE cryptocurrency price. Both of these
values are more than 0.01.
• COVID-19 death count growth percentage feature contains probability 4.1802
and F-statistic probability 7.6496 when the target variable is BSE cryptocurrency
price. Both of these values are more than 0.01.
ability 3.2671 when the target variable is inflation India. Both of these values are
more than 0.01.
statistic probability 9.1071 when the target variable is inflation India. Both of these
values are more than 0.01.
6.1 Geopolitical and Economic Uncertainity
The current work also makes conclusions from the observations as depicted in Table
3 that the following features do not influence the target variables since
ability 6.7812 when the target variable is geopolitical uncertainity index. Both of
statistic probability 7.6957 when the target variable is geopolitical uncertainity
index. Both of these values are more than 0.01.
ability 2.6273 when the target variable is economic uncertainity index. Both of
• COVID-19 death count growth percentage attribute contains probability 4.1179
and the attribute F-statistic probability 4.0109 when the target variable is economic
uncertainity index. Both of these values are more than 0.01.
Table 3 Statistical relationship between COVID-19 factors and economic geopolitical uncertainty
Feature Slope P-value R2 error F-statistic Target

p-value
COVID-19 7.683 5.1984 0.916 6.7812 Geopolitical
patient count uncertainity
growth index
percentage
COVID-19 3.605 8.1192 0.001 7.6957 Geopolitical
death count uncertainity
growth index
percentage
COVID-19 3.683 2.0125 0.007 2.6273 Economic
patient count uncertainity
growth index
percentage
COVID-19 2.605 4.1179 0.711 4.0109 Economic
death count uncertainity
growth index
percentage
7 Conclusion
A very nuanced and careful observation over the impact of COVID-19 features such
as COVID-19 total positive count as well as COVID-19 total death count indicates
to us that they are not as influential when it comes to important target variables such
as geopoltical and economic uncertainity as well as important economic indicators
such as BSE Sensex price, gold and cryptocurrency prices and inflation.
It can be deduced that initially people were apprehensive and anxious about the
uncertain natures of the pandemic; however, people usually adopted to this anxiety as
a new normal. Therefore, though initially the COVID-19 indicators did influence the
economic and uncertainity indicators in the sense that as the COVID-19 indicators
increased the other indicators also increased, however, as people later adopted to the
uncertainity of the situation and the government came up with economic incentives,
the COVID-19 indicators began to lose influence over the uncertainity and economic
indicators.
However, since the COVID-19 is still creating many variants according to the
World Health Organization, it is important that further information is collected from
around the world to ensure that a timely and accurate picture can be inferred from
the rapidly expanding information base of COVID-19-related information especially
the total patient counts and total death counts and its influence over the important
economic indicators such as BSE Sensex price, gold and cryptocurrency prices and
Inflation.
8 Future Work
Future work would include that more COVID-19 indicators such as R-factor monthly
growth and hospitalization rates other than the ones discussed in the present work
should be investigated for their influence on geopolitical and economic uncertainity
indicators in this country. Considering the huge population of India, a region-wise
data collection and regression processes need to be considered. More diverse sources
of data gathering are more important in order for further investigation.
References
1. H. Abdel-Latif, M. El-Gamal, Financial liquidity, geopolitics, and oil prices. Energy Econ. 87,
104482 (2020)
2. H. Ahir, N. Bloom, D. Furceri, The world uncertainty index. The World Uncertainity, pp 1–37
(2018)
3. A. Alqahtani, E. Bouri, X.V. Vo. Predictability of GCC stock returns: the role of geopolitical
risk and crude oil returns. Econ. Anal. Policy 171–181 (2020)
4. N. Antonakakis, R. Gupta, C. Kollias, S. Papadamou, Geopolitical risks and the oil-stock nexus
over 18992016. Fin. Res. Lett. 23, 165–173 (2017)
5. N. Apergis, M. Bonato, R. Gupta, C. Kyei, Does geopolitical risks predict stock returns and
volatility of leading defense companies? Evidence from a nonparametric approach. Defence
Peace Econ. 29(6), 684–696 (2018)
6. A.F. Aysan, E. Demir, G. Gozgor, C.K.M. Lau, Effects of the geopolitical risks on bitcoin
returns and volatility. Res. Int. Bus. Finance 47, 511–518 (2019)
7. S.R. Baker, N. Bloom, S.J. Davis, Measuring economic policy uncertainity. Q. J. Econ. 131(4),
1593 (2016)
8. M. Balcilar, M. Bonato, R. Demirer, R. Gupta, Geopolitical risks and stock market dynamics
of the BRICKs. Econ. Syst. 42(2), 295–306 (2018)
9. D. Banerjee, A. Ghosal, I. Mukherjee, Prediction of gold price movement using geopolitical
risk as a factor, in Emerging Technologies in Data Mining and Information Security (Springer,
Singapore, 2019), pp. 879–886
10. R.J. Barro, J.F. Ursua, Rare macroeconomic disasters. Ann. Rev. Econ. 4(1), 83–109 (2012)
11. R.J. Barro, Rare disasters and asset markets in the twentieth century. Q. J. Econ. 121(3), 823–
866 (2006)
12. D.G. Baur, L.A. Smales, Hedging geopolitical risk with precious metals. J. Bank. Fin. 117,
105823 (2020)
13. N. Bloom, The impact of uncertainity shocks. Econometrica 77(3), 623–685 (2009)
14. D. Caldara, M. Iacoviello: Measuring geopolitical risk. Board of Working Paper (2016)
15. G. Das, B. Ray Chaudhuri,Impact of FDI on labour productivity of Indian IT firms: horizontal
spillover effects. J. Soc. Manag. Sci. XLVII(2) (2018)
16. G. Jiaying, V. Stanislav, Panel data quantile regression with grouped fixed effects. J. Econ.
213(1), 68–91 (2019)
17. A. Saiz, U. Simonsohn, Proxying for unobservable variables with internet document frequency.
J. Eu. Econ. Assoc. 11(1), pp. 137–165 (2013)
Classification of Tumorous
and Non-tumorous Brain MRI Images
Based on a Deep-Convolution Neural
Network Model
Debkumar Chowdhury, Sanjukta Mishra, Sonu Kumar,

Shiwam Kumar Prasad, Sourav Kumar Mandal, Gourab Biswas,
Devesh Sharma, Vishal Lohia, and Kartik Sau
Abstract A brain tumor is one of the most critical diseases, observed due to
unwanted cell division. The survival rate of brain tumor is very high if it is iden-
tified in the grade I stage, and early treatment is started. There are different well-
appreciated techniques, introduced by different researchers, which are available to
classify tumorous and non-tumorous MRI images. Still, these techniques are not
sufficient to meet the current requirement. To satisfy the present requirement, we
proposed a novel reinforcement learning-based deep-convolution neural network
(CNN) model to identify tumors in brain MRI images. In this technique, the region
of interest is extracted, and labeling of each image is performed. This step is followed
by a preprocessing (noise removal) technique. Then, we consider a standard dataset,
and it is segregated into three classes—70% for training, 15% for validation, and 15%
for testing. Based on the layered architecture of the deep-convolution neural network
model, the hyper-parameters of our proposed method are calculated for the training
dataset. This step is followed by the validation and testing step. The combination of
reinforcement learning and deep learning while designing CNN, the layered architec-
ture makes our model unique. After consulting all the cases, we gained commendable
results compared to other methods. The novelty, high-performance capability of this
method can be appreciated. This method may be implemented as a mobile or Web
application.
Keywords Deep learning · Convolution neural network (CNN) · Image

classification
D. Chowdhury (B) · S. Kumar · S. K. Prasad · S. K. Mandal · G. Biswas · D. Sharma · V. Lohia ·

K. Sau
University of Engineering and Management, Kolkata 700040, India
K. Sau
e-mail: kartik.sau@uem.edu.in
S. Mishra
Brainware University, Kolkata, India
446 D. Chowdhury et al.
1 Introduction
The normal brain contains multiple brain tissues and segments [20] (Shown in
Fig. 1a). Abnormal growth of brain tissue is known as brain tumor [3]. Among various
types of brain tumors (Shown in Fig. 1b), based on their types (primary, secondary,
etc.,), grades (in terms of aggressiveness), etc., their viability is disclosed. Based on
the understanding of its location, generation point of the tumor cell, and its type,
proper treatment of brain tumor can be started. Damaging different areas of brain
tissues is considered an adverse effect of brain tumors. Non-identification at its initial
stage and carelessness in treatment may increase the mortality rate. Its patients often
find various symptoms such as headaches, loss of consciousness, vomiting, etc. [3].
4% of the world’s cancer population is suffering due to brain and central nervous
system (CNS) cancer. As per world cancer research journal statistics, 12.7 males and
10.7 females per 100,000 people in the world [17] are affected by this type of cancer.
It is addressed as a predominant issue of society, and an automated-computer-based
model is required which will increase the precision level of classification and detec-
tion of a brain tumor among doctors. As a result, it will help the doctors to start early
and accurate treatment for the patient so that the survival rate of the patient can be
increased. We found that many of the previous methods are incomplete. Some of
them operate up to the brain segmentation level only. Some of them do the feature
selection only, and some of them perform only classification. Hence, to generate
a completely automated method for doctors, we use the concept of reinforcement
learning and deep learning-based [1]. CNN to build a model for image classification
[2]. If we look into the organization of the paper, then we can find that it is segregated
into several sections. Section 2 describes the main methodology, merits, and demerits
of some of the previous research papers; Sect. 3 describes a detailed explanation of
our proposed model to overcome the various difficulties as mentioned in Sect. 2,
whereas the experimental results along with conclusion are presented in Sects. 4 and
5, respectively.
(a) Brain structure (b) Tumorous Brain MRI (c) Non-Tumorous Brain MRI
Fig. 1 Brain structure, brain tumorous, and non-tumorous MRI

Classification of Tumorous and Non-tumorous Brain MRI Images … 447
2 Literature Survey
In this section, we have presented the different existing methods, their advantages,
and disadvantages with the help of a chronological table (Table. 1) to display different
types of brain tumor classifications and detection strategies so that we can compare
them, identify the drawbacks, and proposed our method in such a way that we can
overcome most of them.
Region of tumor is detected using features that are developed manually. It is so
happed for old methodologies. Detection of gliomas tumors or glioblastomas tumors
is rather difficult due to their rare discerning features. These generic features are rarely
worked in reality. To tackle this problem, we propose an end-to-end reinforcement
learning-based deep learning pipeline that learns the features in iterations for better
performance on the given task.
For brain MRI classification and detection, we are proposing a newly developed
reinforcement learning-based deep-CNN-based strategy for MRI classification. Our
proposed technique is divided into multiple steps as follows:
3.1 Image Extraction from the Dataset
To extract the images from the selected dataset, we are doing the following steps:
(a) setting the augmented path, (b) load up the data. We are setting the augmented
path from the directory from where we can receive augmented data (both “yes” and
“no”, “yes” folder in the directory contains the brain tumorous MRI images, and
“no” folder in the directory contains the brain non-tumorous MRI images (Shown in
Figs. 1c and 5) contains both doctors identified samples and non-identified samples.
Table 1 Brain MRI detection—chronological table

Methodology Advantages Disadvantages
Artificial neural network More applicable to Small size dataset, limited
(ANN) [4] astrocytoma type of cancer feature selection, and inability
to identify other types of
tumors or cancers
Computer-aided system [5] Identifies and segments the Cannot classify all categories
brain tumor of tumors
Watershed and edge detection Can segment brain tumor Dataset consists of only 20
algorithms [6] images, and performance
accuracy is not calculated
Discrete cosine transform and Faster and high recognition Dataset consists of only 20
probabilistic neural network rate, high-speed processing, images
(DCT-PNN) [7] and low computational
requirements
Hidden Markov random fields High segmentation accuracy Dataset is limited and cannot
and threshold techniques [8] calculate the size of the brain
tumor
Fuzzy c-means and support Provides accurate and more The accuracy rate is not
vector machine (SVM) [9] effective results with an satisfactory; the dataset set is
average of 87.66% accuracy limited, the scope of
improving data training
K-means and fuzzy C-means Can detect tumors with Not applicable to all types,
clustering algorithm [10] multiple intensities formats, and sizes of brain
images, MRI, and dataset are
limited
Synergetic neuro-fuzzy [11] Simultaneous selection of The accuracy score is very
features and prediction of the limited (approx. 86%), and the
tumor gradation is achieved dataset size is very small
SVM [12] Successfully classify abnormal Multiple features are missing
brain tumors in the LGG and for accurate classification, and
HGG the dataset size is small
Wavelet transforms and SVM The proposed method has Longer computational time
[13] significant computational and limited size of dataset
advantages and can detect and
classify brain tumors
Naive classifier [14] Brain tumors are predicted Very low accuracy score
with accuracy (approx. 84%) and data set
size is limited
K-means clustering and Provides a good accuracy rate Not applicable to all types,
artificial neural network formats, and sizes of brain
(ANN) approach [15] images MRI and dataset are
limited
To load the data, we are using the loading_data_up algorithm using Eqs.1 and 2.
The following algorithm takes two arguments; the first one is a list of directory paths
for the folders “yes” and “no” that contain the image data, and the second argument
is the image size and for every image in both directories and does the following:
3.2 Loading Labels and Pre-Image Processing
It starts with the cropping_brain_contour_generation algorithm (discuss below) using

Eqs. (3), (4), (5), (6), which is responsible for cropping only the brain portion of the
MRI. A cropping technique is applied to find the extreme bottom, left, top, and right
points. This algorithm also preprocesses the image by removing the noise. This step
is followed by finding the largest contour, extreme points, and image cropping.
3.3 Segregating Tumorous and Non-Tumorous Dataset

and Data Partitioning
In this section, we are segregating the tumorous and non-tumorous data. We are
placing the cropped MRI image in the “yes” set; if it has a tumor otherwise, we
are keeping the cropped MRI image in the “no” set. Then, we are splitting the
X and y objects into training, validation (development), and testing sets. In the current
method, we are splitting the data in the following way: 70% of the data for training,
15% of the data for validation, and 15% of the data for testing.
3.4 Building the Convolution Neural Network Architecture
We are building the CNN architecture (Shown in Fig. 2) as follows:

Each input x of size (240, 240, 3) is used to fed into the CNN based on the concepts
of reinforcement and deep learning. The following layers may be observed:
(a) One zero-padding layer with a p_size of (2, 2).
(b) One CNN layer with no. of filters ←32, while each filter is 7 × 7 in size and
the stride-value is 1.
(c) One layer is for the normalization of batches for values of the pixel
normalization and to achieve computational speed.
(d) One activation function of rectified linear type.
(e) Two MAX_P_ layers with f value of 4 and s value of 4.
(f) One layer is required to flatten the 3D matrices into 1D vectors.
Fig. 2 Proposed architecture of reinforcement learning-based deep-CNN model

(g) One O/P unit with a completely interrelated layer. It holds one neuron and a
sigmoid activation method for binary classification.
3.5 Network Training and Performance
To perform the network training, we can use the train_data_model algorithm using
Eq. (7), which is working as follows:
The performance of the algorithm can be calculated with the help of accuracy
score and F1 score matrices on the validation dataset and test dataset. Our proposed
method can also be represented using the following block diagram (Shown in Fig. 3)
to understand the algorithms at a glance.
Fig. 3 Block diagram representation of the model
4 Experimental Results
To evaluate the performance of the proposed method, here we consider the Kaggle
brain tumor dataset [16], which consists of 1085 tumorous and 980 non-tumorous
images. The size of brain tumor images in the dataset may vary, and it can be of
different types (color, gray, etc.,), formats (.jpg, jpeg, etc.,), and resolutions. The
proposed method is applied in the Python environment, version 3.7, with the hardware
configuration of an Intel processor (Core i5), a RAM of 8 GB size, and a graphics
card of 4 GB size. We have used the anaconda as a distributor of Python. Through
an anaconda navigator, we are running the base environment, which will be useful to
access the Jupiter notebook as an open Web application source and to run our code.
It produces good results concerning earlier methods after F1 and accuracy score
calculation.
While the execution of augmented_path algorithm (Sect. 3.1), we are getting
2065 no. of samples, and X is in the shape of (2065, 240, 240, 3), where 2065 is
the no. of samples, 240 is the image width, and 240 is the image height, and 3
denotes the no. of channels. After running the loading_data_up (Sect. 3.1) and crop-
ping_brain_contour_generation algorithm (Sect. 3.2), we are finding the cropping
results of the MRI images (Shown in Fig. 4).
After cropping of each image is segregated into two categories, “yes” (tumorous)
and “no” (non-tumorous) and we obtain the following results (Fig. 5 shows a portion
of non-tumorous, Fig. 6 shows a portion of tumorous images):
Fig. 4 Image cropping
Fig. 5 Non-tumorous images
Fig. 6 Tumorous images

Table 2 An example of a training phase
Train on 1445 samples, validate on 310 samples

Epoch 1/3
1445/1445 [==============================]—536s 371ms/step—loss: 0.15
86—acc: 0.9453—val_loss: 0.4005—val_acc: 0.8548
Epoch 2/3
1445/1445 [==============================]—427s 296ms/step—loss: 0.12
Epoch 3/3
1445/1445 [==============================]—429s 297ms/step—loss: 0.10
Elapsed time: 0:23:11.9
We are splitting the X and Y objects into training, validation (development), and
testing sets. We split the dataset as: for training → 70%, for validation → 15%,
and for testing → 15%. After execution, we receive no_of_traing_samples = 1445;
no_of_development_samples = 310; no_tast_samples = 310; training shape values
are (1445, 240, 240, 3) and (1445, 1), respectively; development shape values are
(310, 240, 240, 3), and (310, 1); test shape values are (310, 240, 240, 3) and (310,
1). After developing the proposed reinforcement-based deep-CNN-based multi-
layered model, it is trained through the trainning_data_modell algorithm (discuss
in Sect. 3.4), and it trains on n_no_of_samples and validates on m_no_of_samples,
which may be observed from Table 2. No. of iterations/epochs have been changed,
and the total time consumed by the algorithm is calculated every time. The layered
architecture and hyperparameters of this model can be observed in Fig. 7.
After calculation of loss (Shown in Fig. 8) and accuracy (Shown in Fig. 9)
concerning training and validation, following two graphs are generated:
We are experimenting with our building model with the best models so that the
best validation accuracy can be achieved. After the 23rd iteration, we achieve 91%
accuracy. We find that regarding testing data best model gives accuracy as follows:
test loss = 0.33 and test accuracy = 0.89. The F1 score is 0.88. For validation, the
data model gives a 0.91 F1 score. The percentage of positive (tumorous samples
are indicated by pos variable or positive in Table 3) and negative (non-tumorous
samples are indicated by neg variable or negative in Table 3) examples/samples are
given below:
In the end, we are finding that our model can be used for brain tumor detec-
tion and classification from MRI images with test_set_accuracy of 88.7% and
test_set_F1_score of 0.88. As the data are balanced, we are considering the results
to be satisfied. If you compare our algorithm with the rest of the classification-based
algorithms (given in Table 4), then we may find a good considerable result.
Fig. 7 The value of hyperparameters used in training the model
Fig. 8 Loss graph
5 Conclusion
This paper illustrates the classification of tumorous and non-tumorous brain MRI
images based on reinforcement learning-based deep-CNN model. The combination
of reinforcement learning and deep learning while designing CNN, the layered archi-
tecture makes our model unique. The proposed method is tested on the Kaggle brain
MRI dataset [16]. The experimental results show the effectiveness of the proposed
method in terms of accuracy and F1 score. Some concluding observations regarding
this paper are—the proposed algorithm gives a better result in terms of accuracy
and F1 score concerning other existing techniques. The accuracy score of 91% for
Fig. 9 Accuracy graph
Table 3 Final Training data:

result—positive and negative
samples Number of examples: 1445
Percentage of positive examples: 52.8719723183391%,
number of pos examples: 764
Percentage of negative examples: 47.1280276816609%,
number of neg examples: 681
Validation data:
Number of examples: 310
Testing data:
Number of examples: 310
validation data, 89% for the test data, and F1_score of 0.91 on validation data and
0.88 on test data is obtained for our proposed method, as shown in Table 4. It applies
to large datasets. We find that it is very simple and efficient. In the future, accu-
racy and F1 score may be improved. Several other matrices can be included in our
proposed technique to measure the performance such as precision score, recall score,
and roc. In the future, this model can be utilized to categorize other kinds of cancer
or diseases.
Table 4 Classification algorithm comparisons on the considered dataset

Serial No Name of the Accuracy_score (%) F1_score
classification Validation set Test set Validation set Test set
algorithm
1 SVM [12] 85.68 86 0.87 0.86
2 Naïve Bayes 87 86 0.86 0.85
[14]
3 Decision tree 84.61 85 0.84 0.82
[7]
4 K-nearest 86.32 86 0.86 0.87
neighbor [18]
5 ANN [4] 83.10 88 0.87 0.88
6 Fuzzy classifier 85.23 85 0.86 0.85
[9]
7 Genetic 85.32 84 0.84 0.83
algorithm [19]
8 Proposed 91 89 0.91 0.88
method
References
1. R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (MIT press, 2018), pp. 56–
58
2. M. Jain, P.S. Tomar, Review of image classification methods and techniques. Int. J. Eng. Res.
Technol. (IJERT) 2(8) (2013). ISSN: 2278-0181
3. D.G. Macenka, L. Hays, A. Varner, E. Weiss, P.Y. Wen, Frankly speaking about
cancer: Brain tumors. Cancer Support Community and the National Brain Tumor
Society web resource. http://blog.braintumor.org/files/public-docs/frankly-speaking-about-can
cer-brain-tumors.pdf. Accessed 30 January 2021
4. D.M. Joshi, N.K. Rana, V.M. Misra, Classification of Brain cancer using artificial neural
network, in 2nd International Conference on Electronic Computer Technology (IEEE, 2010),
pp. 112–116. https://doi.org/10.1109/ICECTECH.2010.5479975
5. M.U. Akram, A. Usman, Computer-aided system for brain tumor detection and segmentation,
in International Conference on Computer Networks and Information Technology (IEEE, 2011),
pp. 299–302. https://doi.org/10.1109/ICCNIT.2011.6020885
6. I. Maiti, M. Chakraborty, A new method for brain tumor segmentation based on watershed and
edge detection algorithms in HSV color model, in 2012 National Conference on Computing
and Communication Systems (IEEE, 2021), pp. 1–5. https://doi.org/10.1109/NCCCS.2012.641
3020
7. D. Sridhar, I.M. Krishna, Brain tumor classification using discrete cosine transform and
probabilistic neural network, in 2013 International Conference on Signal Processing, Image
Processing & Pattern Recognition (IEEE, 2013), pp. 92–96. https://doi.org/10.1109/ICSIPR.
2013.6497966
8. H.S.M. Abdulbaqi, Z. Mat, A.F. Omar, I.S.B. Mustafa, L.K. Abood, Detecting brain tumor in
magnetic resonance images using hidden Markov random fields and threshold techniques, in
2014 IEEE Student Conference on Research and Development (IEEE, 2014), pp. 1–5. https://
doi.org/10.1109/SCORED.2014.7072963
9. Parveen, A. Singh, Detection of a brain tumor in MRI images, using a combination of fuzzy
c-means and SVM, in 2015 2nd International Conference on Signal Processing and Integrated
Networks (SPIN) (IEEE, 2015), pp. 98–102. https://doi.org/10.1109/SPIN.2015.7095308
10. R. Ahmmed, M.F. Hossain, Tumor detection in brain MRI image using template-based K-
means and fuzzy C-means clustering algorithm, in 2016 International Conference on Computer
Communication and Informatics (ICCCI) (IEEE, 2016), pp. 1–6. https://doi.org/10.1109/
ICCCI.2016.7479972
11. S. Banerjee, S. Mitra, B.U. Shankar, Synergetic neuro-fuzzy feature selection and classification
of brain tumors, in 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) (IEEE,
2017), pp. 1–6. https://doi.org/10.1109/FUZZ-IEEE.2017.8015514
12. F.P. Polly, S.K. Shil, M.A. Hossain, A. Ayman, Y.M. Jang, Detection and classification of
HGG and LGG brain tumor using machine learning, in 2018 International Conference on
Information Networking (ICOIN) (IEEE, 2018), pp. 813–817. https://doi.org/10.1109/ICOIN.
2018.8343231
13. M. Gurbină, M. Lascu, D. Lascu, Tumor detection and classification of MRI brain image
using different wavelet transforms and support vector machines, in 2019 42nd International
Conference on Telecommunications and Signal Processing (TSP) (IEEE, 2019), pp. 505–508.
https://doi.org/10.1109/TSP.2019.8769040
14. D. Divyamary, S. Gopika, S. Pradeeba, M. Bhuvaneswari, Brain tumor detection from MRI
images using Naive classifier, in 2020 6th International Conference on Advanced Computing
and Communication Systems (ICACCS) (IEEE, 2020), pp. 620–622. https://doi.org/10.1109/
ICACCS48705.2020.9074213
15. A. Biswas, M.S. Islam, Brain tumor types classification using K-means clustering and ANN
approach, in 2021 2nd International Conference on Robotics, Electrical and Signal Processing
Techniques (ICREST) (IEEE, 2021), pp. 654–658. https://doi.org/10.1109/ICREST51555.
2021.9331115
16. N. Chakrabarty, Brain MRI images for brain tumor detection. Kaggle web resource. https://
www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection. Accessed 1 April
2021
17. K.H. Kalan Farmanfarma, M. Mohammadian, Z. Shahabinia, S. Hassanipour, H. Salehiniya,
Brain cancer in the world: an epidemiological review. World Cancer Res. J. (WCRJ) 7(1356)
(2019). https://doi.org/10.32113/wcrj_20197_1356
18. Aiwale, Ansari, Brain tumor detection using KNN. https://doi.org/10.13140/RG.2.2.35232.
12800
19. G. Rajesh Chandra, K.R.H. Rao, Tumor detection in brain using genetic algorithm. Proc.
Comput. Sci. 79, 449–457 (2016). ISSN 1877-0509. https://doi.org/10.1016/j.procs.2016.
03.058
20. The Human Brain Atlas, Proteinatlas we resource, https://www.proteinatlas.org/humanprot
eome/brain. Accessed 5 Aug 2021
Social Distance Monitoring and Face
Mask Detection Using Deep Learning
K. Yagna Sai Surya, T. Geetha Rani, and B. K. Tripathy
Abstract As there is no evidence of COVID-19 slowing down in several components

of the world, maintaining “social distancing” (also mentioned as physical distance)
between indoor and outside people is more vital than ever. It is counseled that two
people keep a distance of 1.8 m (approximately, vi feet) apart. Python can be used
to capture people and monitor social distancing. Deep learning, TensorFlow, Keras,
and OpenCV are used to acknowledge masks; it uses a good computer vision-based
technique that focuses on the automatic period observance of individuals to sight
safe social distance twenty-four hours a day, seven days every week, for inspectors
in public places, commercial centers, and different locations.
Keywords Face mask · Social distance · Deep learning · COVID-19
1 Introduction
Monitoring social distancing regulations and manually checking people’s masks are
most likely to lead to scarcity of resources and are supposed to allow errors creeping
in due to human intervention. There is an urgent need to solve the virus transmission
by studying the ideal social distancing rules. This system includes the detection of
people violating social distance regulations and the classification of masks. Citizens
have ensured safety with strict rules by checking whether sufficient distance and mask
usage are observed. The availability of several cameras in public places improved
K. Yagna Sai Surya · T. Geetha Rani · B. K. Tripathy (B)

SITE, VIT, Vellore, Tamil Nadu 632014, India
e-mail: tripathybk@vit.ac.in
K. Yagna Sai Surya
e-mail: kothayagna.saisurya2018@vitstudent.ac.in
T. Geetha Rani
e-mail: thotageetha.rani2018@vitstudent.ac.in
462 K. Yagna Sai Surya et al.
the enormous applicability of the system. It has the scope to have techniques to
process which can be used to keep an eye on many utilities. This piece of work
discusses a lightweight system that is likely to help in the prevention of COVID-19
as it uses surveillance through video mode for identifying and confirming that the
social distancing is maintained by everyone so that the spread can be reduced which
further helps in the slowdown of the virus transmission.
2 Literature Survey
A virtual model of social distancing was evaluated to help people who are reminded
in public places. Use a distance ruler to measure the gap. The method involves
two-dimensional person recognition and person recognition from different angles,
social distance monitoring, and mask recognition [1]. The method proposed is a
balance between ResNet-50 and other standard machine learning algorithms classi-
fiers (support vector machines (SVM)), decision tree, and set) to enhance the perfor-
mance of the model [2] to achieve good accuracy (68%) for the classification of
fake masks. As per suggestions in [3], integration of inverse perspective mapping
(IPM) technology with a proposed deep neural networks (DNN) is more effective.
The SORT tracking algorithm provides accurate and universal human tracking by
detection which can be used in other applications.
Context-attention RCNN is a structure that is capable of detecting the structure
of wearing face covers, and it amplifies the intra-class abbreviate and extricates
by recognizing highlights for this purpose [4]. Another method is a model called
SocialdistancingNet-19 which can detect a person’s frame and display a label. If the
distance is below a certain value, it can help determine whether it is safe or dangerous
and uses the gravitational center to calculate the distance between people (violent
method) [5].
One of the accurate face mask detectors is the RetinaFace Mask [6]. In a single
stage, it detects it, and for fusing multiple feature maps with high-level semantic
information, it uses a feature pyramid network. For detecting face masks, it is a
context-attention module [7]. The research proposed an intelligence system based on
thermal imaging to classify people’s social distance and implemented an algorithm
to measure and classify the distance among people and automatically check and
monitor whether social distance rules are followed or violated [8, 9].
Social Distance Monitoring and Face Mask Detection Using Deep … 463
The human identification system was classified to monitor social distancing and
security breaches during the pandemic [10]. The pre-trained repeated CNN model
is to identify different models [11–13]. The human recognition process is done by
segmenting the spots. This location is being tracked. Using single object recognition
(SSD) deep learning methods, MobileNetV2 and OpenCV will determine social
distance and mask [14, 15]. The YOLOv3 model which identifies and detects objects
and OpenCV image processing library are used to start the project [16]. The project
will play a major role in areas where many people are expected, such as shopping
malls, movie theaters, or airports. Through this project, we can ensure that people
follow the socialization process.
A new facial recognition machine uses principal component analysis (PCA) and
convolutional neural networks is the faster-RCNN for human imaging detection
[17]. From a higher perspective, the forms of people are very different. By using
transfer learning, the new training level is also integrated with the previously trained
architecture [18] overall image performance by achieving 96% accuracy and 0.6%
false alarm rate. Taking into consideration several features like a collection of data
quickly, evaluation of policy accuracies, adjustment of policies which are responding
well, which indicate that the understanding of economic consequences and health of
public a cost–benefit analysis was performed in [19].
3 Method
The methods followed in this paper use deep learning. We use OpenCV, Keras, and
TensorFlow to detect masks and the MobilenetV2 as the basis of the classifier. Also,
the definition of CNN architecture with weights being rained beforehand is carried
out through YOLO object detector files. OpenCV’s DNN module has compatibility
with this YOLO model. Object recognition is to identify everyone (only people) in
the stream and calculate the Euclidean distance between all recognized people. Use
these distances (Fig. 1) to check if the distance between two people is less than N
Pixels and display safe or unsafe. If a person wears a mask and follows the social
distance protocol, they will be in a safe zone where if a person does not wear a mask,
then it will be shown in a red rectangle box and with the message of Alert as well.
3.1 Dataset
We train our dataset for the face mask detection model in this module. The dataset
contains two sub-folder. With_mask (WM) and without_mask (WOM) (each folder
Fig. 1 Workflow of the social distance between the people in the frame
contains 2 k samples of images). Face images are having or not having masks that
have an average width of 278.77 and an average height of 283.68.
3.2 Training
Using TensorFlow/Keras, the model is trained, and the face mask detector (MD) is
serialized to the disk of the computer. When MD is trained with the images from the
dataset, load the MD, then start detecting the faces, and then divide each face as WM
(if a mask is detected) or WOM (if no facial mask is detected). We will cover each
stage and its sub-divisions in detail in the rest of this guide, but now, let us look at the
dataset we used to train the COVID-19 facial mask detector. This process comprises
three steps as detailed below.
(1) Pre-trained ImageNet weights are loaded into MobileNet, and then take your
attention away from the network.
(2) Create a new FC header and insert it into the database to replace the old header.
(3) Freeze network benchmarks: Table 1 shows the benchmarks of this model.
Table 1 Evaluating the network benchmarks (h = 0.99)

Heading level Precision Recall F1-score Support
WM H H H 433
WOM H H H 386
Accuracy H 819
Macro-avg H H H 819
Weighted avg H H H 819
3.3 Face Mask Detection
To train a customized mask detection mechanism, we divided our design into two
independent stages, then further divided into sub-stages (Fig. 2).
The base layer weights are not updated during the back-propagation of errors.
Instead, they are adjusted. Finally, the Adams optimizer, binary cross-entropy, and
learning rate reducer are used to compile the model. If classes more than two are
observed, then classification cross-entropy must be used.
After completing the training, evaluate the resulting model in the test set. The
next step is to make predictions in the test set to get the most likely category label
index. Then, print out the classification report and check it on the terminal. Then,
serialize the classification through the mask. After completion, the last step is to draw
the accuracy and loss curve of this model. Once the plot is ready, it is automatically
saved in the path of the plot file on the disk.
Fig. 2 Phases and steps to train a custom mask detector

Fig. 3 Plot diagram of accuracy and curves of the face detection model
Algorithm For Mask Detection
Step 1: Start Video Streaming to detect people

cap = cv2.VideoCapture(camera_no)
Step 2: Store the results of the detected people and label them
Outcome = detect_people(image, net, ln,
personIndex=LABELS.index("PERSON"))
Step 3: Mask Detection for an image (Extract ROI)
for(a,b,wid,hei) in face:
face_img = img[b:b+hei,a:a+wid]
Step 4: Predicting the MASK or NO MASK case
prediction=mymodel.predict(test_image)[0][0]
if prediction ==1
NO MASK
else
MASK
Step 5: Testing for the Video Stream
(mask, with_out_mask) = maskNet.predict(face)[0]
Step 6: Colour & Label as mask or no mask based on the case
if mask > with_out_mask
label = "MASK"
if label = "MASK"
colour = (0,255,0)
else
colour = (0,0,255)
else
“NO MASK"
3.4 Social Distance Monitoring
Before calculating Euclidean distance to determine the distance between people,

we must count the number of people in the frame. To this, import the packages
NMS_THRES, MIN_CONF, People_Counter, NumPy, and computer vision. The
frame creates a blob and then passes directly through the YOLO object listener by
specifying the bounding box and related probabilities. The bounding box, centroids,
and confidences are initialized accordingly.
Algorithm For Social Distance Detection
Step 1: Start video streaming to detect people
cap = cv2.VideoCapture(camera_no)
Step 2: Store the results of the detected people and label them
Outcome = detect_people(image, net, ln,
personIndex = LABELS.index(“PERSON”))
Step 3: Calculate the distance between two people
dist = math.sqrt(a_dist * a_dist + b_dist * b_dist)
distance.append(dist)
Step 4: Extract and compute the Euclidean distances between all pairs of the
centroids from the results
Centroids_result = np.array([r[2] for r in Outcome])
d = dist.cdist(Centroids_result, Centroids_result, metric = “euclidean”)
Step 5: Check if the distance between any two of the centroid pairs is less than
the given no. of pixels.
if d[i, j] < 50:
# Update with centroid pairs indexes
if (d[i, j] < 80) and not serious:
# update if the centroid distance is below max distance limit
Step 6: Loop over on the results and initialize the color of the observation
for (i, prob, bbox, Centroids_result) for the list of results:
if i in serious:
color = (255, 0, 0)
elif i in abnormal:
color = (0, 255, 255)
4 Procedure
With an eye on the property of YOLO, which returns the center coordinates first and
then returns the height and weight, the coordinates of the bounding box coordinates
are scaled back. From the center of the bounding box, its left and top corners can be
generated.
To calculate the Euclidean distance, import libraries TensorFlow, Keras, NumPy,
and SciPy. SciPy is used to compute distance using Euclidean distance. There is a file
named coco.names (pre-trained model) (Fig. 4), which contains a list of 80 object
classes that the model can recognize. This model is only trained for these 80 object
classes.
The network object is ready for execution, throws an exception in failure cases,
and assigns the pixel values about the threshold value 0.4. The webcam sets the
Fig. 4 Flowchart of social distance monitoring

value in VideoCapture () to 1.2. Start video streaming to download real-time face

recognition and verification images and maintain social recognition.
Initialize a and b lists and append all points such as (x, y),
(w, h) to x and y. After

calculating the distance between two points using formula x 2 + y 2 , convert it
into a list from the dictionary.
Concentrate the bounding box and centroid directions, and afterward introduce
the shade of the explanation. If the pair recorded lies within the abnormal/violation
sets, shading is updated. Around an individual, a bounding box is drawn, and the
direction in which the centroid of an individual move is shown.
As per Fig. 5 to evaluate the distance between two or more persons, use the distance
differentiate method, i.e., Euclidean distance method. And, find the centroid of each
detected person to match the safe distance as well. Finally, it displays the results and
performs clean-up.
Fig. 5 Flowchart to find the number of people in proximity

5 Experiment
Now, the model is tested with input images of people in groups (Fig. 6). Figure 8
represents people wearing mask, while Fig. 10 represents people not wearing mask.
In Figs. 7 and 9, many violations are identified and alert message is displayed in red
color indicating that people are not following measures, whereas in Fig. 11, alert
message is displayed to wear a mask.
Fig. 6 People in crowded areas
Fig. 7 Monitoring and displaying alerts

Fig. 8 Group of people with face mask
Fig. 9 Displaying alert to maintain distance
5.1 Detecting Social Distance from an Input Image
See Figs. 6, 7, 8 and 9.

Fig. 10 People with and without mask
Fig. 11 Displaying “mask,” “no mask”
5.2 Detecting Facial Mask from an Input Image
See Figs. 10 and 11.
6 Results
When wearing masks, they are in a safe area. This will be displayed as a green
rectangle with a safety warning. If a person is not wearing a mask, Fig. 12, they will
be displayed as a red rectangle with a warning. Similarly, if they maintain social
Fig. 12 Displaying “no mask”, case 1: without_mask
distance, then Fig. 13 will be shown in inexperienced rectangular box with safe alert
message. Wherever if they do not follow social distancing and not wear a mask
(Fig. 14), then the system will show an associate alert message with red rectangular
box as shown in Fig. 15.
Here, we considered different cases:
7 Conclusion
We obtained high accuracy by taking ML tools and basic techniques which are
reasonable. Several different applications can use this method. A red frame and a
red line are used to measure the distance between pedestrians walking on the street.
Video confirms this approach. The visual results show that the proposed method can
identify social isolation measures, which can be developed for other environments,
such as offices and restaurants. Public health systems can be greatly influenced by
deploying this model.
Fig. 13 Displaying “safe”, case 2: with_mask
Fig. 14 Evaluating social distance and face mask, case 3: with more people in the frame
Fig. 15 Displaying alerts considering both social distance and face mask, case 4: with more people
in the frame
8 Future Scope
Detection of proper or improper use of masks can be an extension of this work.

Further, it can be extended to find the type and category of masks used by a person.
The system will operate effectively with the current lockdown removed and will help
track public places. The office risk of COVID-19 will not disappear anytime soon.
Therefore, closed offices can use this to ensure that social isolation continues until
the risk of COVID-19 disappears.
References
1. M. Sohan, So you need datasets for your COVID-19 detection research using machine learning.
arXiv:2008.05906
2. M. Loey, G. Manogaran, M.H.N. Taha, N.E.M. Khalifad, A hybrid deep transfer learning
model with machine learning methods for face mask detection in the era of the COVID-19
pandemic (2021). PMID: 32834324, PMCID: PMC7386450. https://doi.org/10.1016/j.measur
ement.2020.108288
3. M. Rezaei, M. Azarmi, DeepSOCIAL: social distancing monitoring and infection risk
assessment in COVID-19 pandemic. MDPI (2020). https://doi.org/10.1101/2020.08.27.201
83277
4. J. Zhang, F. Han, Y. Chun, W. Chen, A novel detection framework about conditions of wearing
face mask for helping control the spread of COVID-19. IEEE (2021). https://doi.org/10.1109/
ACCESS.2021.3066538
5. U. Singhania, B.K. Tripathy, Text-based image retrieval using deep learning, in Encyclopedia
of Information Science and Technology, 5th edn. (IGI Global, USA, 2020), pp. 87–97
6. V. Prakash, B.K. Tripathy, Recent advancements in automatic sign language recognition (SLR),
in Computational Intelligence for Human Action Recognition (CRC Press, 2020), pp. 1–24
7. M. Jiang, X. Fan, H. Yan, RetinaMask: a face mask detector (2020) (this version, v2). arXiv:
2005.03950
8. S. Saponara, A. Elhanashi, A. Gagliardi, Implementing a real-time, AI-based, people detection
and social distancing measuring system for Covid-19. J. Real-Time Image Proc. (2021). https://
doi.org/10.1007/s11554-021-01070-6
9. Vinitha, Velantina, Social distancing detection system with artificial intelligence using
computer vision and deep learning. Int. Res. J. Eng. Technol. (IRJET) (2020). e-ISSN:
2395-0056
10. R. Debgupta, B.B. Chaudhuri, B.K. Tripathy, A wide ResNet-based approach for age and
gender estimation in face images, in Proceedings of International Conference on Innovative
Computing and Communications (Springer, Singapore, 2020), pp. 517–530
11. M. Cristani, A.D. Bue, V. Murino, F. Setti, A. Vinciarelli, The visual social distancing problem.
IEEE Access 8, 126876–126886 (2020). https://doi.org/10.1109/ACCESS.2020.3008370
12. A. Adate, B.K. Tripathy, Understanding single image super-resolution techniques with genera-
tive adversarial networks, in Advances in Intelligent Systems and Computing, vol. 816 (Springer,
Singapore, 2019), pp. 833–840
13. A. Adate, B.K. Tripathy, Deep learning techniques for image processing, in Machine Learning
for Big Data Analysis. (Boston, De Gruyter, Berlin, 2018), pp. 69–90
14. P. Nagrath, R. Jain, A. Madana, R. Arora, P. Kataria, J. Hemanth, SSDMNV2: a real-time DNN-
based face mask detection system using single shot multibox detector and MobileNetV2. 66,
102692 (2021)
15. S. Yadav, Deep learning-based safe social distancing and face mask detection in public areas for
COVID19 safety guidelines adherence. IJRASET 8(VII), 4 (2020). https://doi.org/10.22214/
ijraset.2020.30560
16. S. Srivastava1, I. Gupta, G. Upadhyay, U. Goradiya, Social distance detector using YOLO v3.
Int. Res. J. Eng. Technol. (IRJET) (2021). e-ISSN: 2395-0056
17. A.H. Ahamad, N. Zaini, M.F.A. Latip, Person detection for social distancing and safety violation
alert based on segmented ROI, in 10th IEEE International Conference on Control Page |
29System, Computing and Engineering (ICCSCE) (Penang, Malaysia, 2020), pp. 113–118.
https://doi.org/10.1109/ICCSCE50387.2020.9204934
18. I. Ahmed, M. Ahmad, J.J.P.C. Rodrigues, G. Jeon, S. Din, A deep learning-based social distance
monitoring framework for COVID-19 (2020). https://doi.org/10.1016/j.scs.2020.102571
19. L. Thunström, S.C. Newbold, D. Finnoff, M. Ashworth, J.F. Shogren, The benefits and costs of
using social distancing to flatten the curve for COVID-19, Cambridge University Press (2020)
An Effective VM Consolidation
Mechanism by Using the Hybridization
of PSO and Cuckoo Search Algorithms
Sudheer Mangalampalli , Pokkuluri Kiran Sree, S. S. S. N. Usha Devi N,

and Ramesh Babu Mallela
Abstract VM Consolidation is one of the prodigious challenges in Cloud

Computing as VMs have to be automatically placed into a physical machine based
on the load running on the corresponding physical machine i.e., host is in overloaded
condition or it may be in underloaded condition. VM consolidation is enacted based
on the condition i.e., either overloading or underloading of VMs into a physical
host. Energy consumption in data centers is one of the huge challenges because
when we consolidate the VMs into a single physical machine based on the condi-
tions it reduces energy consumption in the data centers which is a huge advantage
for the cloud provider. Many of the authors proposed VM Consolidation algorithms
by addressing energy consumption as a parameter but those algorithms not meeting
the standards in terms of energy consumption. In this paper, we have proposed a
new hybridized Meta-heuristic approach by combining Particle swarm optimization
(PSO) and Cuckoo Search (CS) algorithms for consolidation of VMs based on the
status Index of VMs and thereby addressing the energy consumption as a parameter.
This work is simulated on Cloudsim and the workload is generated randomly in
clouds and is given as input to the algorithm. To evaluate the efficiency of the algo-
rithm in the view of energy consumption we have compared the proposed approach
against existing algorithms such as PSO and CS. Simulation results revealed that
our proposed approach is improved significantly over compared algorithms with
mentioned parameters.
Keywords Cloud computing · Particle swarm optimization (PSO) · Cuckoo search

(CS) · VM consolidation · Energy consumption · VM—virtual machines
S. Mangalampalli (B) · P. K. Sree · R. B. Mallela

Department of CSE, Shri Vishnu Engineering College for Women, Bhimavaram, AP, India
R. B. Mallela
e-mail: rameshbabucse@svecw.edu.in
S. S. S. N. Usha Devi N
Department of CSE, University College of Engineering, JNTUK, Kakinada, AP, India
478 S. Mangalampalli et al.
1 Introduction
Cloud Computing is a model which is a combination of Grid, Cluster, and utility

computing models through which cloud users provided services of computing,
storage, and network in a flexible way and on-demand by using the Pay-Per-Use
model. NIST [1] has defined Cloud Computing clearly and concisely. Cloud envi-
ronments can be deployed in different ways as public, private, and hybrid models.
The architectures of cloud computing are categorized in two ways i.e., Generic archi-
tecture and Market-oriented architecture [2]. Every cloud application which needs
to be deployed in a cloud environment needs a front-end application through which
the cloud console can be connected via a network and at the back end, your data is
stored in a virtualized storage at different data centers. This is the typical requirement
for any cloud application and it is represented as architecture [2]. Commercial cloud
vendors customize their architecture according to their needs. In these architectures,
cloud users initially give requests into the cloud interface but these requests will be
handled by a broker policy that is maintained by the cloud provider. These requests
are carried to the task manager on behalf of the customer and the task manager will
give these requests to the scheduler which maps tasks onto VMs. VM consolidation
is a prodigious challenge because load onto the physical machines will be varied i.e.,
either they are overloaded or underloaded. If we can consolidate VMs effectively in
physical hosts we can reduce energy consumption in the data centers. Many of the
authors used metaheuristic algorithms for mapping of VMs to physical machines but
the existing algorithms were not shown a significant impact on energy consumption
in cloud computing. In this paper, we have proposed an algorithm by hybridization
of PSO and CS algorithms which effectively consolidate VMs based on VM status
index and thereby minimizing energy consumption in Datacenters.
The highlights of this paper are presented below.
1. A VM Consolidation mechanism is proposed based on the hybridization of PSO
and CS algorithms.
2. We have used a VM status index based on utilization of CPU which is directly
related to energy consumption.
3. Simulation is carried out by using Cloudsim Simulator [3] and workload is
generated randomly in Cloudsim.
4. Makespan and energy consumption are considered as parameters, and our
proposed approach with existing PSO and CS algorithms for the mentioned
parameters.
The rest of the manuscript is organized as follows. Section 2 describes related
works, Sect. 3 describes problem formulation and proposed system architecture,
Sect. 4 describes proposed algorithm, Sect. 5 describes Simulation and Results and
finally, Sect. 6 describes Conclusion and Future Work.
An Effective VM Consolidation Mechanism … 479
2 Related Works
In [4], mapping of VMs to appropriate Physical machines by classifying them

according to the processing capacity of those VMs. Simulation carried out on
Cloudsim. It was compared against Round Robin and FCFS algorithms. Simula-
tion results shown that it outperforms compared algorithms in terms of mentioned
metrics.
In [5], efficient mapping of VMs onto physical machines in different data centers.
It is modeled by using ILWOA (“Improved Levy based Whale Optimization Algo-
rithm”). It was implemented in Cloudsim and standard datasets are used to eval-
uate the efficacy of the algorithm. It is evaluated against PSO, GA (“Genetic Algo-
rithm”), WOA (“Whale Optimization Algorithm”), MVO (“Multiverse Optimizer”),
and ITHS (“Intelligent Tuned Harmony Search”) and it improves over compared
algorithms in terms of mentioned metrics.
In [6], optimal mapping VMs onto the physical hosts which address metrics named
as energy consumption and SLA violation. It is using a stochastic model for reserving
VMs in physical machines, Prediction of underload, and deciding out of them. It was
simulated against THR, MAD, IQR, LR, and LRR and it was got better improvement
with mentioned parameters.
In [7], consolidation of VMs onto physical machines in the data center was
proposed. The methodology used is the dragonfly algorithm. It was compared with
variations of the dragonfly algorithm and it is outperformed over existing algorithms
by minimizing resource wastage.
In [8], minimization of energy consumption and resource wastage was addressed
as parameters. Binary PSO algorithm was used to solve VM consolidation in cloud
computing. It was compared against PSO and it is showing great improvement over
the above-mentioned parameters.
In [9], Power consumption and resource wastage were addressed as parame-
ters. The methodology used was hybrid multi-objective whale optimization. It was
compared with the existing first fit, VMPACS (“Virtual Machine Placement Ant
Colony System”) and MBFD (“Modified Best Fit Decreased”) algorithms and it is
outperformed over them with specified parameters.
In [10], power consumption, resource wastage were addressed as parameters.
DCMSCAALO (“Discrete Chaotic Multiobjective Sine Cosine and Ant Lion Opti-
mizer”) algorithm used to solve VM Consolidation in cloud computing. The workload
is generated randomly in Cloudsim and it is evaluated against the First fit, VMPACS
(“Virtual Machine Placement Ant Colony System”), and MGGA algorithms, and it
outperforms compared approaches with specified parameters.
In [11], Energy consumption and SLA violations were considered as parameters.
Enhanced CS algorithm was used to solve VM consolidation in cloud computing
and it was compared with existing OFS (“Online Feature Selection”), ACO (“Ant
Colony Optimization”), and GA (“Genetic Algorithm”) algorithms and it has shown
a huge impact over existing approaches with mentioned parameters.
From the above-mentioned algorithms, earlier authors have proposed several

VM and Server consolidation mechanisms by using different strategies, and most
of the authors used nature-inspired algorithms as a methodology to design their
algorithm. Many of the authors tried to address energy consumption and resource
wastage in data centers. In this paper, we also want to address energy consumption
by the hybridization of PSO and CS algorithms as a methodology. The below section
formally describes the problem formulation and proposed system architecture.
3 Problem Formulation and Proposed System Architecture
In this section, we have systematically formulated the problem. Let us assume that
we have k number of tasks {ta1 , ta2 , ta3 , ta4 , ta5 , … tak }, n number of VMs are to
be represented as {Vm1 , Vm2 , Vm3 , Vm4 … Vmn }, i number of physical machines
are to be indicated as {Ph1 , Ph2 , Ph3 , Ph4 … Phi } and j number of Datacenters to be
represented as {d 1 , d 2 , d 3 , d 4 … d j }. Now, these n number of virtual machines are to
be intelligently mapped onto i number physical machines based on status which is
given by resource manager in the cloud paradigm thereby we can address the energy
efficiency parameter in cloud paradigm.
For effective allocation of VMs onto Physical hosts, we need to identify the status
of virtual machines whenever they are getting tasks onto them and then we are
keeping a threshold value i.e., VM status index which is based on the utilization of
CPU.
We have assumed that the VM status index is divided into three types. i.e., over-
utilized VM, underutilized VM, and normal utilized VMs.
We have written the rules for the VM status index
• If utilization of CPU is less than 25% then it is to be represented as underutilized
and VM status index is set to be 1.
• If utilization of CPU is between 25 and 59% then it is to be represented as in
normal condition and VM status index is said to be 2.
• If utilization of CPU is more than 60% then is to be represented as over-utilized
and VM status index is set to be 3.
In the Proposed System architecture, initially users will submit tasks which were
submitted onto cloud console and these tasks are to be submitted onto task manager
and based on size and processing capacity of the task they will be given as input to
the scheduler which maps tasks to appropriate VMs. For every allocation of a new
VM onto a physical machine which was resided in the datacenters based on the VM
status index given by the resource manager.
In the cloud computing paradigm, a load balancer is connected to the resource
manager and verifying whether the VMs allocated onto the physical machines were
appropriate and any physical machines were overloaded or underloaded, or balanced.
In this paper, we have used the VM status index as the parameter to know the
status of VMs, and based on that VM status index load balancer have to balance and
Fig. 1 Proposed system architecture
allocate the corresponding VM onto a physical host, and thereby we want to address
the parameters named as energy consumption and makespan in cloud computing.
The proposed architecture is shown in Fig. 1.
The following are the metrics we need to address in this algorithm.
3.1 Makespan
The primary objective of any cloud computing paradigm is the minimization of

makespan. So we also need to address this primary objective to prove the efficiency
of the algorithm. It is defined as “Total time taken for a task to complete its execution”
[12]. It is to be indicated as shown in Eq. 1.
m t = availk + exectn (1)
where avail indicates the availability of a VM and exectn indicates execution time of
a task on a corresponding virtual machine.
3.2 Energy Consumption
Our next objective is to minimize energy consumption in data centers by using an

appropriate VM consolidation mechanism. Energy consumption in cloud computing
solely depends on the consumption of energy due to the computation time of the
CPU which was incurred in active time and idle time. It is to be represented as
m t
ecs (vmn ) = comp
ecs (vmn , t) + ecs
idle
(vmn , t)dt (2)
0
Overall energy consumption can be represented as

ecs = ecs (vmn ) (3)
From the above Eqs. 2 and 3, we will calculate total energy consumption in cloud
computing.
3.3 Fitness Function
We are using nature-inspired algorithm which hybridizes PSO and CS algorithms to

evaluate the efficiency of the proposed algorithm by minimizing energy consump-
tion and makespan by intelligently mapping VMs onto Physical machines based on
the VM status index. The following equation represents the fitness function of the
algorithm through which we can minimize energy consumption and makespan.

f (x) = min m t (x), ecs (x) (4)
x
3.4 Hybridization of Cuckoo Search and Particle Swarm

Optimization Algorithm (Hybrid CSPSO)
Hybrid CSPSO was proposed in [13], where CS is used as a local search process
and PSO is used as a global search process for optimization of solutions because
in Cloud Computing search space is dynamic and we cannot use a single algorithm
either PSO or CS to optimize solution. Each of these algorithms has its limitations
i.e., PSO could be trapped into local minima and CS can be used for the local search
process and that in the classical cuckoo search step size is limited to 1. In this search
process, CS is used for local search and PSO is used for the global search process and
below equations represent the process of hybridization of CS and PSO algorithms.
Yi (k + 1) = Yi (k) + b ⊕ Levy(s, λ) (5)
Levy(s, λ) ∼ s −λ , (1 < λ ≤ 3) (6)

Vi j (k + 1) = Vi j (k) + C1 R1 localibest
j (k) − x i j (k)

j (k) − x i j (k)
+ C2 R2 globalibest (7)
xi j (k + 1) = xi j (k) + Vi j (k + 1) (8)
−→ −→
lbest (k + 1) = −
→
xi (k + 1), if f (xi (k + 1)) < f lbest (k) (9)
−→
if not lbest (k)
−→
−
g−
→
best (k + 1) = arg min f lbest (k + 1) (10)
The below Sect. 4 discusses about the proposed VM Consolidation algorithm by

hybridization of CS and PSO algorithms.
4 Proposed VM Consolidation Algorithm by Using

Hybridization of CS and PSO Algorithms
5 Simulation and Results
We have simulated Cloudsim [3]. For this simulation, we have considered 10 data-
centers, number of physical hosts used were 500, number of VMs used were 1000
and the capacity of each VM is 2048 MB and the hypervisor we have used is Xen.
For this simulation, we have used 1000 cloudlets and these were generated randomly
in Cloudsim and were evaluated against existing PSO and CS algorithms.
5.1 Evaluation of Makespan
For 100 tasks PSO, CS, and CSPSO algorithms have a makespan of 1358.7,
1311.5, 1285.46 respectively. For 500 tasks PSO, CS, and CSPSO algorithms have
a makespan of 1845.8, 1756.8, 1723.78 respectively. For 1000 tasks PSO, CS, and
CSPSO algorithms have a makespan of 1285.46, 1723.78, 2056.78 respectively.
From Table 1 and Fig. 2, we can identify that our proposed approach outperforms
existing CS and PSO algorithms in terms of makespan.
Table 1 Evaluation of
Tasks PSO CS Proposed CSPSO
makespan
100 1358.7 1311.5 1285.46
500 1845.8 1756.8 1723.78
1000 2520.5 2146.9 2056.78
Fig. 2 Evaluation of
makespan
Table 2 Evaluation of
Tasks PSO CS Proposed CSPSO
energy consumption
100 1145 947.8 925
500 2435 2097.8 1989.9
1000 3256 2476 2378
Fig. 3 Evaluation of energy

consumption
5.2 Evaluation of Energy Consumption
For 100 tasks PSO, CS, and CSPSO algorithms have consumed the energy of 1145,
947.8, and 925 W respectively. For 500 tasks PSO, CS, and CSPSO algorithms
have consumed energy of 2435, 2097.8, 1989.9 respectively. For 1000 tasks, PSO,
CS, and CSPSO algorithms have consumed energy of 3256, 2476, and 2378 watts
respectively. From Table 2 and Fig. 3, we can identify that our proposed algo-
rithm is outperformed over existing algorithms PSO and CS by minimizing energy
consumption.
6 Conclusion and Future Works
VM consolidation is a prodigious challenge in cloud computing as VMs are needed

to be dynamically consolidated into corresponding physical machines based on the
utilization of a physical machine. In this paper, we have used a VM status Index
which identifies the load on the VMs and based on the load on VMs the algorithm
will decide to map the VMs on to respective Physical machines. We have used
hybridized CSPSO algorithm as a methodology to appropriately allocate VM onto
physical machines and thereby minimizing energy consumption and makespan. In
the future, we want to evaluate the algorithm by using a grey wolf optimizer to
evaluate the efficiency of the algorithm.
References
1. F. Liu, J. Tong, J. Mao, R. Bohn, J. Messina, L. Badger, D. Leaf, NIST cloud computing
reference architecture. NIST Spec. Publ. 500, 1–28 (2011)
2. M.S. Sudheer, M. Vamsi Krishna, Dynamic PSO for task scheduling optimization in cloud
computing. Int. J. Recent Technol. Eng. 8(2), 332–338 (2019)
3. R.N. Calheiros et al., CloudSim: a toolkit for modeling and simulation of cloud computing
environments and evaluation of resource provisioning algorithms. Softw. Pract. Exp. 41(1),
23–50 (2011)
4. S.K. Mishra et al., Energy-efficient VM-placement in cloud data center. Sustain. Comput.
Inform. Syst. 20, 48–55 (2018)
5. M. Abdel-Basset, L. Abdle-Fatah, A.K. Sangaiah, An improved Lévy based whale optimization
algorithm for bandwidth-efficient virtual machine placement in cloud computing environment.
Clust. Comput. 22(4), 8319–8334 (2019)
6. E. Barlaskar, N. Ajith Singh, Y. Jayanta, Energy optimization methods for virtual machine
placement in cloud data center. ADBU J. Eng. Technol. 1 (2014)
7. A. Tripathi, I. Pathak, D.P. Vidyarthi, Modified dragonfly algorithm for optimal virtual machine
placement in cloud computing. J. Netw. Syst. Manage. 28, 1316–1342 (2020)
8. A. Tripathi, I. Pathak, D.P. Vidyarthi, Energy efficient VM placement for effective resource
utilization using modified binary PSO. Comput. J. 61(6), 832–846 (2018)
9. S. Gharehpasha, M. Masdari, A. Jafarian, Virtual machine placement in cloud data centers
using a hybrid multi-verse optimization algorithm. Artif. Intell. Rev. 1–37 (2020)
10. S. Gharehpasha, M. Masdari, A discrete chaotic multi-objective SCA-ALO optimization algo-
rithm for an optimal virtual machine placement in cloud data center. J. Ambient Intell. Humaniz.
Comput. 1–17 (2020)
11. E. Barlaskar, Y.J. Singh, B. Issac, Enhanced cuckoo search algorithm for virtual machine
placement in cloud data centres. Int. J. Grid Util. Comput. 9(1), 1–17 (2018)
12. S. Mangalampalli, V.K. Mangalampalli, S.K. Swain, Energy aware task scheduling algorithm
in cloud computing using PSO and cuckoo search hybridization. Solid State Technol. 63(6),
13995–14010 (2020)
13. R. Chi et al., A hybridization of cuckoo search and particle swarm optimization for solving
optimization problems. Neural Comput. Appl. 31(1), 653–670 (2019)
Customer Segmentation via Data Mining
Techniques: State-of-the-Art Review
Saumendra Das and Janmenjoy Nayak
Abstract Customers are more vigilant, intelligent, and dynamic in society. They
change their preferences and habits according to their needs. Knowing the needs of
customers is an important part of marketing where a company should discover the
loyal customers in this heterogeneity. The concept of dividing heterogeneity into
homogeneous forms is termed as customer segmentation. Customer segmentation is
an integral part of marketing where companies can easily develop relationships with
customers with a huge set of customer data in an organized manner. Understanding the
customer’s hidden knowledge is a resourceful idea of computational analysis where
accurate information could be optimized for the taste and preference of the customer.
This type of computational analysis is termed as data mining. This paper discussed
on a systematic review of customer segmentation via data mining techniques. It is
a systematic review of supervised, unsupervised and other data mining techniques
used in segmentation.
Keywords Customer segmentation · Data mining · Supervised · Unsupervised
1 Introduction
Understanding consumer behaviour is a resourceful idea in marketing that makes

customers profitable. Always, the manufacturer provides high-quality goods or
services for customers to fulfil their needs and wants by providing adequate knowl-
edge. Basically, the needs and wants of customers are closely observed through their
habits and preferences. So, knowledge is an important asset for companies to make
customers loyal. Any marketer should assemble the information seamlessly to satisfy
them by providing customized services at each point of delivery to avoid negative
S. Das
Department of MBA, Aditya Institute of Technology and Management (AITAM), Tekkali 532201,
India
J. Nayak (B)
Department of Computer Science, Maharaja Sriram Chandra BhanjaDeo (MSCB) University,
Baripada, Odisha 757003, India
e-mail: jnayak@ieee.org
490 S. Das and J. Nayak
reaction from consumers [1]. Over the years, consumers’ behaviour has changed
continuously. Now, consumers are more volatile than before. Often, they change
their habits and preferences. Therefore, it is impossible for a seller or manufacturer
to identify the consumer’s needs and wants in the mass markets. The idea of dividing
the market into various groups or sub-groups is typically known as segmentation.
The concept of segmentation is justified and explained by different experts to iden-
tify the needs and wants of customers rationally. This strategic application of market
targeting will ensure to anticipate consumer reaction, because they may have varied
preferences for consuming goods or services according to their profile [2]. Neverthe-
less, the selection of segmentation techniques consistently depends on the variables
input, such as the geographic, demographic, behavioural, or psychological profile of
consumers forecasted with some statistical or non-statistical approaches.
According to Smith [3], segmentation is a distinctive marketing strategy closely
associated with product differentiation and homogeneity. The customer may obtain
a variety of alternatives from manufacturers. In this diversified market structure,
manufacturers may get confused about selecting or retaining the customer. To attract
and retain customers, often marketers adopt selective techniques through advertising
or sales promotion rather than to understand the customer’s motives. In the gener-
alization of the mass market, it is difficult to identify the needs and wants of the
customer through all kinds of promotional techniques. Therefore, customer segmen-
tation could be a choice for the marketer to provide preferential goods or services to
the customer. The basic idea of customer segmentation is to cluster/group customers
to identify, understand and target their needs. This concept of customer segmentation
was initially introduced by Smith in 1956 as an unconventional technique for product
differentiation strategy. A segment or group of customers can be depicted as a set of
customers who have similar types of demographic, psychological, and behavioural
profiles [4]. Now the selection of segmentation techniques is a sophisticated area of
research in this information and communication age, particularly in the areas of data
mining (DM) and database management systems (DBMS). With the huge data sets,
now the traditional market forecasting techniques are becoming of no use. Several
statistical techniques, like multivariate analysis, time series and so on, are also failing
to perform satisfactory clustering or segmentation. In this connection, a new form
of knowledge management technologies with soft computing and hard computing
like data mining, machine learning, artificial intelligence, etc. will definitely solve
market-related problems [5].
In this competitive world, today, most sellers want to know the needs and pref-
erences of the customer. Now they profusely maintain good relationships with
customers at every stage of business operations. The concept of maintaining a
good relationship with the customer is known as customer relationship management
(CRM). This theory of customer relationship management is becoming an integral
part of marketing strategy. With the proliferation of the Internet, the idea of relation-
ship management has become popular due to several computational approaches. The
company and customers can easily interact and understand each other by learning
the hidden knowledge from the enormous quantity of data. The concept of under-
standing and analysing the hidden knowledge of the customer is data mining. Data
Customer Segmentation via Data Mining … 491
mining is a computational analysis process that discovers the consumer’s taste and
preferences through customer segmentation, dividing huge sets of data [6]. The data
mining approach is also useful for manufacturers who have lost their quality when
the products decay. In this case, the recency, frequency, and monetary (RFM) form of
segmentation failed to quantify the exact preference rather than other methods like the
Fuzzy Analytic Network Process (FANP) [7]. Sometimes, data mining techniques
are useful for profiling the customer base, targeting, aligning the right channels,
cross-selling products, enhancing customer relationships and providing value to the
customer [8]. However, prioritising the customer within the existing customer base is
also an important technique in data mining. To improve the service quality and effec-
tiveness of the product, importance-performance analysis (IPA) is also a part of data
mining [9]. Customer segments are highly volatile; they may change according to the
preference of the customer, which creates confusion about the re-computation of data.
These uncertainties require streaming of data in a proper form where data mining
helps to cluster the data. As a result, customer segmentation performs continuously
[10]. Data mining techniques are also predicting the future probability and behaviours
that allow businesses to be more practical and knowledge-driven [11]. Data mining
techniques also provide the advantage of customer segmentation functions [12]. Data
mining also classifies blogs into supervised and unsupervised learning models for
extracting knowledge from voice over the Internet protocol [13].
After a meticulous review of 550 academic literature, 57 research articles and
17 conference papers were considered in this review process. This paper discusses
customer segmentation via data mining techniques from a review perspective. This
paper is a systematic investigation into supervised, unsupervised and other data
mining techniques. The supervised approaches, such as neural networks, naive
Bayes, linear regression, logistic regression, support vector machine (SVM), K-
nearest neighbour, boosting and decision tree (DT), hidden Markov model (HMM),
and random forest have an enormous contribution to object detection and classifica-
tion. In unsupervised approaches, complex classification of data, identification and
processing of variables have more emphasis through K-means clustering, K-nearest
neighbours (KNN), hierarchal clustering, anomaly detection, neural networks, prin-
ciple component analysis, independent component analysis, apriori algorithm, etc.
Some of the research articles on other data mining techniques, such as chi-square
automatic interaction detector (CHAID), RFM, genetic algorithm (GA), and logistic
regression, etc., have revealed classification and relationship management. The paper
is organised into 5 sections. Section 2 presents various issues involved with customer
segmentation. Section 3 explains various segmentation techniques. Section 4 is about
critical investigation. Section 5 concludes the discussion and conclusion.
2 Various Issues Involved with Customer Segmentation
Consumers have different needs and expectations as per their characteristics.

In consumer behaviour research literature, we can observe several segmen-
tation variables, such as demographic, geographic, psychographics, decision-
making, behavioural, purchase behaviour, personality, lifestyle, situation factors, etc.
However, from a broader perspective, the researcher classified the customer segmen-
tation into four major areas, such as geographic characteristics, demographic profile,
psychographic profile, and behavioural aspects. On the other hand, some researchers
have classified it into two distinct forms. They are observed and unobserved vari-
ables. The observed variables, in general, are geographic, demographic, and socio-
economic, whereas purchase frequency and customer loyalty are considered as
product-specific or brand-specific variables. Sometimes, variables like lifestyle and
psychographics are unobserved in general and product benefits, intention, preference,
etc. are considered product-specific [5]. So, customer segmentation is an emerging
area of research that has several issues replicating consumer behaviour on a product
or brand. In the decision-making process, customer segmentation is an integral part
of the marketing strategy which builds customer relationships, segregates customers
into different groups, and provides different facilities in the niche market. In partic-
ular, for mobile users, it develops VIP customer segmentation which can easily
identify their needs and facilitate the service [14].
The rapid development of computer technologies across the globe has changed
the tastes of telecom subscribers. Now it is high time for a telecom company to under-
stand the characteristics of the consumer to provide distinct services. Segmentation
is the only way to cluster customers into different bases and provide the service to
attract and retain customers [15]. Customer segmentation is also important in the
retail sector today. With the huge quantity of customer data, a retail firm may not be
able to keep the customer informed. So, data mining techniques could help to mine
the data among lost customers and help the retailer to build customer relationships
[16]. In this regard, customer segmentation will provide a wealth of information about
customers. Customer segmentation is the strategic resource for an enterprise to gain
competitive advantages and make customers profitable [17]. Segmentation is impor-
tant for providing customer lifetime value (LTV). But now the statement has become
vague due to a lot of competition. Therefore, customer values like current value,
potential value, and customer loyalty will be an important asset for any marketer to
understand the customer better [18]. Customer segmentation is classifying the value
via the RFM model and rough set theory (RST) theory to understand the customer
and maintain the relationship [19]. According to the previous literature, segmen-
tation has various critical issues like problem recognition, design of the research,
data collection, data analysis, and implementation [5]. Table 1 depicted major issues
related to customer segmentation issues to counter the problems.
Customer segmentation offers a tactic decision for supporting services and prof-
itability for businesses. It supports all kinds of business decisions for financial growth
and development. Therefore, making a good customer segmentation method is a
Table 1 Issues related to

Issues of customer Major consideration
customer segmentation
segmentation
Recognize the problem • Segmentation concept
• Information related to customer
• Classification of the variables
• Customer segmentation base
selection
• Finance and other limitations
Design the research • Collection of data
• Instrument validity
• Objectives of segmentation
• Stability of variables
Data collection • Source of data
Data analysis • Data analysis and classification of
segmentation
• Clustering data sets
• Reliability and validity of
information
Implementation • Implement on target customer
• Select segments
systematic way of defining the tools that help the business to grow and develop.
Therefore, selection of the right tools involves cross-functional cause to deal with
the business goal. Customer segmentation has a lot of pros and cons while classi-
fying customers into different profiles. Sometimes it procures, retains, and attracts
the customer. It clusters the customers according to the market demand. However,
it could be successful when accurate data interpretation, knowledge discovery, and
information dissemination are properly done. Often, due to inexact information, it is
not effective. The manual process of segmentation is time-consuming, un-scalable,
and not agile. Therefore, segmentation does not help one-to-one marketing. With
the help of the latest technologies like data mining, artificial intelligence, machine
learning, etc., accurate segmentation is possible and makes the customer profitable.
3 Segmentation Techniques
In general, customer segmentation involves a broad variety of techniques like cluster

analysis [10], cluster-wise-regression, AID/CHAID, multiple regression, discrimi-
nation analysis, latent class structure, inductive learning techniques, soft computing
techniques [5], and data mining (the detailed theory proposed in the next section)
are used in different market conditions. However, it is difficult to classify the group
of customers according to their attributes. We have to consider the classical method.
In the classical theory, some researchers gave importance to the data preparation
framework and data analysis framework, which include supervised, unsupervised
Segmentation
techniques
Data preparation Data analysis

framework framework
Supervised Unsupervised Other data-

approach approach mining approach
Fig. 1 Customer segmentation techniques
and other methods of data mining approach (Fig. 1). Most of the techniques related
to artificial neural networks (ANNs), fuzzy logic (FL), machine learning (ML), RST
and evolutionary methods (EM) such as GA are the main data mining tools to analyse
data perfectly. These technologies have been widely used in data preparation and data
analysis. It is a challenging task for modern marketing professionals to consider the
right technique or algorithm. Most of these algorithms have significant advantages
and disadvantages also. To avoid this problem, researchers should consider either
a supervised or unsupervised approach. The supervised approach is a classifica-
tion method where the inputs and outputs are mapped properly. In the supervised
approach, all the common algorithms, i.e. support vector machines, logistic regres-
sion, artificial neural networks, naive Bayes, and random forests, significantly work
further. These approaches follow a hierarchical process to maintain a good relation-
ship between input and output datasets. The unsupervised approaches are clustering
of data inherently. Some familiar algorithms include k-means clustering, principal
component analysis, and auto encoders. Since no labels are provided, there is no
specific way to compare model performance in most unsupervised approaches. In
this connection, DM techniques using neural networks, decision trees, genetic algo-
rithms, fuzzy logic, and K-nearest neighbour could be able to predict, comprehend,
and cluster the customers properly [20]. Besides the non-traditional methods, some
traditional techniques like self-organizing maps (SOM) can also be used to make
segmentation. In this approach, a set of initial cluster prototypes are made before
applying the K-means to get the final clusters of data sets through near visualization.
Some researchers said that the U-matrix is also one of the best options for clustering
the data for analysing the results by time of hits.
3.1 Data Preparation Framework
Data preparation is a systematic way of transforming raw data into a basic form
of data for predictive analysis to remove errors or mistakes. Data preparation is a
challenging task to acquire proper prediction analysis. It uses automatic search like
grid and random search to find unity in data preparation. Often it is difficult to gather
a variety of data. For example, the data might be stored in a CSV file for classifica-
tion and regression consists of rows, columns, and values for any data preparation
method. However, most of the authors articulated that data preparation techniques
are inferred using statistical and non-statistical techniques. Statistical techniques like
exploratory factor analysis and correspondence analysis; and computational tech-
niques such as soft computing tools (e.g. RST or GA) are typically used in data
preparation [5]. Exploratory factor analysis (EFA) is a common statistical method
applicable to multivariate statistics to uncover a relatively large set of data. Most of
the time, researchers use this technique for scaling the data sets through the question-
naire. EFA is accurate as each factor is symbolized by multiple measured variables.
EFA is based on common factors, unique factors, and errors of measurement. With
this EFA model, we can easily identify the common factors and other related manifest
variables. The correspondence analysis (CA) is an expansion of principal compo-
nent analysis appropriate for discovering relationships amongst qualitative variables
(or categorical data). Like principal component analysis, it also offers a solution for
summarizing and visualizing the data in two-dimension plots. Correspondence anal-
ysis is a significant form of geometric approach for visualizing rows and columns
of a two-way contingency table appropriately. The main aim of this tabular form is
to provide a global view of the data for easy interpretation. However, these statis-
tical techniques have been replaced by soft computing to segment or classify the
data and provide accurate results. In particular, soft computing (SC) is an improved
technique over conventional traditional systems. It is also part of hard computing.
It has many intelligent and user-friendly features. Soft computing consists of FL,
ANNs, RST, and EM. The principal component of soft computing is to eliminate the
uncertainty and vagueness of data through fuzzy tools and EM, which are involved in
the optimization and searching process. Furthermore, ANNs and RST will solve the
classification and rule generation problems. Recently, soft computing technologies
have been used for resolving data mining problems. Soft computing is widely used
for the analysis and interpretation of data. RST is mathematical computation and
granular approximation which discovers the hidden pattern in an uncertain environ-
ment widely used in soft computing. Therefore, soft computing is a computational
method that is useful for data preparation.
3.2 Data Analysis Framework
Segmenting the customer into different groups, such as geography, demography,

psychographic, and behavioural, is an easy form of classification of customer data
to analyse the customer’s needs and expectations. There are various approaches
to classifying the market into different groups, popularly known as cluster anal-
ysis. In an article, Calantone and Johar [21] proposed that cluster analysis could
classify customer data explicitly. They proposed that benefits customers should be
analysed properly in the tourism industry, where the marketing strategy formulation
such as understanding customers, product positioning, advertising copy testing, and
new market development will help to establish the market. However, the analysis
used by statistical analysis like factor analysis may extend the resultant output. In
this context, computational approaches like supervised, unsupervised and other data
mining approaches are widely used for data analysis.
3.2.1 Supervised Approach
A supervised approach is a systematic application of artificial intelligence (AI) where

a computer algorithm is absolutely trained on input data for assumed output. It
creates labelled data according to the specific question asked by the customer. The
supervised approach is also the finest learning approach for machine learning, useful
for forecasting financial results, identifying fraud, recognizing objects in images, and
also evaluating risk. In a supervised approach, the input and output data are known
in advance for better prediction with the appropriate classification. Object detection
is one of the important aspects of the supervised approach to computer vision. The
classical object detection approaches, such as background subtraction and saliency
detection, do not have manual collection and labelling of samples. Generally, they
do not train the samples for the classification of labelled data like the supervised
approach. But sometimes it is absolutely affected by noise issues like changes in
luminance and the cluttered background. On the other hand, supervised approaches
like support vector machine, boosting and decision tree have good performance in
object detection. But it needs a substantial human interface to train the data for
labelling. In this connection, Wang et al. [22] developed a model to avoid manual
detection of objects or videos where the extension of the boosting algorithm (soft
label boosting) will help to train the samples with a soft (probabilistic) label in place
of a hard (binary) label. Tracking the emotions in the images or video clips is also
an important feature of the supervised approach.
In their paper, Malandrakis et al. [23] proposed an emotion tracking system in the
movie where the valance-arousal scale was detected through a continuous annotated
database. A supervised approach is proposed in their paper using hidden Markov
models in each dimension. They used HMMs to predict arousal and valance features
in the movie. They found that the sensor could be captured microscopically and
detect emotions. However, evaluation of the supervised approach is also important
for image segmentation with the use of a proper algorithm [24]. Sentiment anal-
ysis (SA) is a newly emerged research topic which unlocks a new future for busi-
nessmen, writers, and bloggers. It is an emerging form of computational algorithm
to understand the percentage of product acceptance and rejection where the business
acumen builds up their strategy to improve the product performance. In this regard,
opinion mining will be possible to find the exact intention of the customer through
supervised machine learning models [25]. The supervised approach also detects the
musical boundaries between verse and chorus segments. Here the perceptual aspects
such as timbre, harmony, melody, and the rhythm of music through boosting [26].
Graph base spectral algorithm is a recent topic in research today, which detects image
objects through a clustering algorithm in a meaningful enlarge structure [27]. The
fault diagnosis system (FDS) is also an improved method of supervised learning
using a support vector machine for appropriate decision-making [28]. The decom-
position of nuclear waste objects through robotics is a matter of concern where the
RGBD-based detection and categorization is applied by a deep convolutional neural
network (DCNN) from unlabelled RGBD videos. It helps to make an object detection
benchmark to recognize waste objects perfectly [29]. In this connection, supervised
learning is a leading algorithm that was developed to identify the data, cluster and
recognize to perceive the individual customer expectations. This type of segmen-
tation will be helpful for researchers and business leaders to develop the product
quality and meet the needs of the customers.
3.2.2 Unsupervised Approach
An unsupervised approach is a form of an algorithm that learns patterns from unla-

belled data. In particular, it captures patterns such as neural prediction or prospect
density. It develops imaginative content through the internal representation of data.
Unlike the supervised approach, it has no human interaction, rather segmentation of
data by neural network and probabilistic method. It finds an interesting pattern from
various unlabelled sensor data without prior information. One of the popular tech-
niques of the unsupervised approach is data mining for the activity recognition task.
Though it has no human interaction, the classification of complex data is possibly
effective in the customer segmentation process through pattern recognition. Often,
data sets have larger features and fewer occurrences are a relatively challenging task
for machine learning. However, with these multiple features of data sets, there may
be irrelevant or redundant information that causes damage in terms of correctness
or training time. To deal with these complex situations, the feature selection (FS)
and feature discretization (FD) methods will be helpful to recognize the data sets.
In particular, in the pre-processing stage, some classification algorithms deal with
discrete features where the FD technique finds the representation of each feature. On
the other hand, FS is aiming at dropping features to target the curse of dimensionality
problems, often permitting learning algorithms to be better-performing classifiers.
Therefore, feature discretization-based algorithms could reduce the redundancy and
classify the data set [30]. In an unsupervised approach, fuzzy-based clustering is
evaluated through the fuzzy joint points (FJP) method where the data set is classified
in hierarchical order [30].
DNA array analysis is a functional algorithm to measure the expression of multiple
genes in an unsupervised approach. Just like supervised learning, a two-way clus-
tering framework is also able to identify gene patterns and perform cluster discovery
on samples where connectivity among the groups of genes could be possible [31].
Speech recognition and grouping of voices through co-channel (two-talker) speech
separation is also a part of the unsupervised learning approach. For voice segre-
gation and segmentation of speech, a differential algorithm like tandem will work
to separate the unvoiced speech [32]. This unsupervised approach is also applied
for the summarization of opinions. The state-of-the-art algorithm has been used
in this process where the summarization method is informative and readable [33].
This approach also detects human activity recognition from raw data by wearable
sensors to identify expectations [34]. The segmentation of data classification could
be possible through multidimensional time series using the hidden Markov model,
which predicts human activity accurately. Automatic summarization of documents
is a recent development in the summarization of documents where the algorithms
classify the data into words, sentences, and phrases and finally process the docu-
ment. It also observes the relevancy, redundancy, and length of the document while
summarizing it [35]. Most researchers used the unsupervised learning approach for
different perspectives, such as facial landmark detectors, protocol features of word
extraction, product attribute extraction, clusters of pixel images, and so on.
3.2.3 Other Data Mining Approaches
In recent years, customer segmentation in direct marketing has become more effec-
tive with the development of database marketing techniques. These types of data
mining approaches ensure direct marketers segment customers in a better way to
perform with a different marketing strategy. The data mining approaches such as
CHAID, RFM, GA, and logistic regression were used as the analytical tools for direct
marketing segmentation with two types of data sets. It was found that amongst all the
approaches, RFM is the perfect approach. However, CHAID is also an optimal solu-
tion for segmenting the data into sequence. So an empirically based RFM approach
could replace both CHAID and logistic regression in database marketing systems
[36]. Therefore, it can be observed from several studies that RFM technology has
been used vividly to segment customers to access information. The marketing repre-
sentatives of commercial banks can segment through k-means classification to obtain
potential customers. To obtain useful information from the customer, four types of
data mining methods, such as neural network, C5.0, classification and regression tree,
and chi-squared automatic interaction detector, will definitely be helpful to detect
the background information for credit card holders [37, 38]. Market segmentation
has a key role in continuing the relationship with a loyal customer. In this regard,
there must be a correlation between the retailer and the customer. By the use of the
divisive cluster analysis technique of data mining, the retailer can find all kinds of
information from the customer database [39].
The advent of technology for data optimization and screening is an important
technique for data mining that mines vast data sets and classifies the market accord-
ingly. In particular, ANN and particle swarm optimization (PSO) methods are recent
developments for market decision strategy. With the integration between statistical
analysis and particle swarm optimization, we can reduce redundant data and segment
the market properly [40]. Data mining techniques have become an indispensable
method in market segmentation. The classification of larger data sets from databases
is a recent form of market research where some intelligent solutions, such as neural
networks, evolutionary algorithms (EA), fuzzy theory, RFM, hierarchical clustering,
K-means, bagged clustering, kernel methods, Taguchi method, multidimensional
scaling, model-based clustering, rough sets, and others, will be very effective and
time-bound [41]. So, clustering the data is an important feature of data mining tech-
niques where latent class analysis (LCA), prior clustering, and some description
of similarity or distance measures of data are used for segmenting large groups of
customers for individual expectations [42]. To understand the various research arti-
cles, we can confirm that data mining is vividly used for the exploration and prediction
of expected outcomes in the heterogeneous market. Data mining is used for classifica-
tion, clustering, association, and sequential analysis. In this regard, certain statistical
applications such as regression, time series, association and sequential analysis will
be beneficial for mining large data sets [43].
4 Critical Investigation
Customer segmentation is an integral approach to target the customer and position

the brand in the mindset of the customer. Though there are several approaches,
such as supervised, unsupervised, and other data mining techniques that have been
used since 1990 by various researchers at different points of time, it has become a
part of customer segmentation to classify and cluster large data. In this paper, we
have extracted articles from various online bibliographies of academic articles on
customer segmentation, such as ABI/INFORM database, Science Direct, Emerald,
IEEE Transactions, JOSTER, Springer, Google scholar, and Wily online library. The
academic articles are searched for keywords like customer segmentation, market
segmentation, and customer segmentation and data mining. Among 550 articles,
the relevant literature on customer segmentation using data mining techniques has
been considered as a state-of-the-art review. In this paper, we considered almost
57 articles and 17 conference papers. After detailed observation of the literature, it
found that data mining techniques like K-means, RFM, GA, and other algorithms
are used in research for classification of large data sets to target customers and create
meaningful marketing strategy.
4.1 Impact of Segmentation Variables
Consumers have an extensive variety of characteristics. Based on their variables,

we can find several segmentation variables, such as geographic, demographic, firm
graphics, decision-making processes, situational factors, personality, profitability,
benefits sought, and so on. According to Kotler and Keller [44], segmentation
variables are classified into four important areas, such as demographic features,
geographic characteristics, psychographic and behavioural variables. On the other
hand, several authors have articulated the levels of variables, e.g. general variables,
domain-based and brand-specific; and the objectivity-oriented and subjectivity of the
variables. The number of variables that have been developed over different periods
has faced a massive challenge. Too many have been proposed to make it practical
for the market to empirically compare them all when trying to segment a market.
In this regard, the classification can be broadly divided into general observed vari-
ables (e.g. geographic features, demographic profile, socio-economic variables) and
unobserved variables (e.g. lifestyle and psychographics); product-oriented observed
variables (e.g. usage frequency and loyalty) and product-oriented unobserved vari-
ables (e.g. benefits, preferences, and intentions) [45]. Therefore, the selection of the
proper segmentation variable is a significant point to consider. In his article, wind [46]
articulated that most of the segmentation studies were on consumer goods. However,
the process of segmentation is also applicable in the industrial market. So, before
selecting an appropriate segmentation method, we must think about the problems
and prospects of segmentation. To select the proper method, the priori segmentation
design and cluster-based design are most essential. In prior segmentation designs,
the marketer was able to segment through product purchase, loyalty, and type of
customer wherein cluster-based design segments determine the benefits, needs, and
attitudes of customers. Further, the advantages and disadvantages of segmentation
are also necessary. After observing several academic literatures, we found that there
is an equally importance on variety of segmentation models. But we must be careful
to select the segmentation method based on management’s specific objectives and
also on current trends in the consumer market (Table 2; Fig. 2).
4.2 Model Reliability in Segmentation
Despite the importance of segmentation analysis on different data sets, minor atten-
tion has to be paid to check the reliability and validity. Because some variables, like
demographics (age, gender, income, religion, etc.) are more reliable than behavioural
or psychological characteristics. In particular, in the case of an attitude survey, proper
care should be considered and should test the reliability of data. To check the relia-
bility of data, statistical measures like factor analysis, conjoint analysis, co-relation,
component matrix, etc. will be beneficial for data analysis. However, these tradi-
tional methods could not provide accuracy due to several exceptions to the number
Table 2 Impact study of

Types of segmentation Focus area References
segmentation variables
General observable variables Demographics [47–50]
Socio-economic [51, 52]
Behavioural [53, 54]
Cultural [55, 56]
General unobservable Lifestyle [50, 57–60]
variables Psychographic [61, 61–64]
Product specific variables Usage frequency [65]
Loyalty [66–68]
Product specific unobservable Benefits [69, 70]
variables Attitude [71, 72]
Fig. 2 Types of customer

segmentation variables 14%
General observed
variable
36%
Genaral unobserved
14%
variable
Product oriented
observed variable
Product oriented
36% unobserved variable
of items. In this connection, perceptual studies like a generalization of data could

provide better analytical results. Therefore, there is a need for instrument devel-
opment in data reliability [46]. Commonly, there are two potential approaches to
measure the reliability, such as degree of consistency and cross-validation [73]. The
former approach can be executed through clustering or classification of data sets,
which requires multiple time verification. The latter approach can be performed by
dividing the data into two different parts and performing the analysis to check the
reliability of the sample parts. When the clustering process is executed, the latter
method can be modified by obtaining the cluster centroids from the first part and
using them to describe clusters in the second part. Cross validation is a more gener-
alized approach compared to the first approach. Concerning cross-validation of the
data discriminate measure of the Wilk’s Lambda (k) and the Kappa, the index is the
most famous method applied in marketing research [74]. Before examining the expe-
riential task, we will immediately believe whether any type of reliability has been
taken into account or not. The distance between the clusters should be measured
through squares within and between the clusters, a scatter matrix of data points, and
indexes. Further, different indexes could be employed to determine the number of
fuzzy clusters in the datasets. Some of the indexes also compare the clusters. Hence,
inherently, the data sets should be checked and rechecked through the proper method
to test their reliability.
4.3 Selection of Proper Data Mining Model
Data mining is the significant procedure of analysing large volumes of data to ascer-
tain business acumen, which helps companies to resolve problems, mitigate risks,
and grasp new opportunities. This particular division of data science derived from the
similarities in data between searching for important information in a large database
and mining a peak. Both processes need sifting through wonderful amounts of mate-
rial to find hidden value. Data mining can answer all kinds of business questions that
traditionally took more time to resolve manually. Using a wide range of statistical
techniques to analyse data from a different perspective, users can identify patterns,
trends, and relationships. Customer segmentation is a measure of concern for market
analysis where proper data classification is important. Though there is the applica-
tion of several statistical techniques in a customer database, data mining techniques
could help predict, analyse and profile the customer in a significant way. Several
academic literature has given the importance of various data mining techniques, like
supervised, unsupervised, and other data mining techniques, but it could be difficult
to identify the exact data mining techniques for their study. So the researchers should
have domain knowledge of business, techniques, and also a fitness model. Here, we
proposed a data mining model (Fig. 3) based on the suitability of customer needs
and expectations.
5 Discussion and Conclusion
Customer segmentation using data mining is a recent study where most of the
academic literature suggests the classification of data. Some of these studies empha-
sized different clustering methods also. However, the selection of segmentation tech-
niques is a challenging task for a business concern. About the selection of the segmen-
tation, we must think about two important aspects, i.e. the objective of management
and the recent trends in the market. The classical methods like factor analysis, regres-
sion, conjoint analysis, or co-efficient determinants may not provide accurate predic-
tions. Therefore, in this review, we observed that computational algorithms could
justify businessmen for analysis and prediction. As we know, most business acumen
are expanding their products or services into different markets and also searching for
a better customer portfolio where they can target customers and position their brands.
In this connection, we highlighted four types of segmentation techniques, such as
general observable variables, unobservable variables, product-specific observable
variables, and product-specific unobservable variables. In the first case, the variables
Criteria of selection: computational complexity,

optimization, flexibility, scalability, interpritablity,
encoding the problem, assesibility
Data mining techniques: ANN, GA, RFM, SVM, EA,

CHAID, K means, bagged clustering, kernel methods,
multidimensional scaling, taguchi method, model-
based clustering, rough set
Number of data mining of task: Classification,

prediction, association, cluster analysis, time series,
regression
Fig. 3 Proposed model for selection of data mining techniques
are geographic, demographics, socio-economic, and culture; in the second case,

they include lifestyle, psychographics, attitude, and emotions; in the third case, the
variables are frequency of purchase and loyalty; and finally, in the fourth case, the
variables are benefits, preference, and intention. Therefore, segmenting the customer
through data mining techniques like K-mean, RFM, GA, ANN, kernel method, PSO
could be helpful to the marker to segment properly.
In the future, the marketing strategy will rely on these customer segmentation
techniques with a large data bank. For example, a credit card service provider will
collect all kinds of customer information from the bank and facilitate the credit
card. An insurance company is collecting prospect customer information to sell its
services. Though the data sets are large, little human interaction is also necessary
for prediction. Data mining techniques use algorithms to quantify the labelled and
unlabelled training inputs for a valid output. So, the supervised and unsupervised
approaches will justify the adequate output. With the use of data mining techniques,
the business will grow with a stringent marketing strategy to expand and diversify
the product or services.
References
1. G. Lefait, T. Kechadi, Customer segmentation architecture based on clustering techniques,

in 2010 Fourth International Conference on Digital Society (IEEE, 2010). https://doi.org/10.
1109/ICDS.2010.47
2. P.Q. Brito et al., Customer segmentation in a large database of an online customized fashion
business. Robot. Comput.-Integr. Manuf. 36, 93–100 (2015). https://doi.org/10.1016/j.rcim.
2014.12.014
3. W.R. Smith, Product differentiation and market segmentation as alternative marketing
strategies. J. Mark. 21(1), 3–8 (1956). https://doi.org/10.1177/002224295602100102
4. A. Nairn, P. Berthon, Creating the customer: the influence of advertising on consumer market
segments—evidence and ethics. J. Bus. Ethics 42(1), 83–100 (2003). https://doi.org/10.1023/
A:1021620825950
5. A. Hiziroglu, Soft computing applications in customer segmentation: state-of-art review and
critique. Expert Syst. Appl. 40(16), 6491–6507 (2013). https://doi.org/10.1016/j.eswa.2013.
05.052
6. A. Hajiha, R. Radfar, S.S. Malayeri, Data mining application for customer segmentation based
on loyalty: an Iranian food industry case study, in 2011 IEEE International Conference on
Industrial Engineering and Engineering Management (IEEE, 2011). https://doi.org/10.1109/
IEEM.2011.6117968
7. V. Golmah, G. Mirhashemi, Implementing a data mining solution to customer segmentation
for decayable products—a case study for a textile firm. Int. J. Database Theory Appl. 5(3),
73–90 (2012)
8. M.M.T.M. Hassan, M. Tabasum, Customer profiling and segmentation in retail banks using
data mining techniques. Int. J. Adv. Res. Comput. Sci. 9(4), 24–29 (2018)
9. S.Y. Hosseini, A.Z. Bideh, A data mining approach for segmentation-based importance-
performance analysis (SOM–BPNN–IPA): a new framework for developing customer retention
strategies. Serv. Bus. 8(2), 295–312 (2014). https://doi.org/10.1007/s11628-013-0197-7
10. M. Carnein, H. Trautmann, Customer segmentation based on transactional data using stream
clustering, in Pacific-Asia Conference on Knowledge Discovery and Data Mining (Springer,
Cham, 2019). https://doi.org/10.1007/978-3-030-16148-4_22
11. W. Wang, S. Fan, Application of data mining technique in customer segmentation of shipping
enterprises, in 2010 2nd International Workshop on Database Technology and Applications
(IEEE, 2010). https://doi.org/10.1109/DBTA.2010.5659081
12. J. Ranjan, R. Agarwal, Application of segmentation in customer relationship management: a
data mining perspective. Int. J. Electron. Custom. Relat. Manag. 3(4), 402–414 (2009). https://
doi.org/10.1504/IJECRM.2009.029298
13. L.-S. Chen, C.-C. Hsu, M.-C. Chen, Customer segmentation and classification from blogs by
using data mining: an example of VOIP phone. Cybern. Syst. Int. J. 40(7), 608–632 (2009).
https://doi.org/10.1080/01969720903152593
14. Z. Yihua, Vip customer segmentation based on data mining in mobile-communications industry,
in 2010 5th International Conference on Computer Science & Education (IEEE, 2010). https://
doi.org/10.1109/ICCSE.2010.5593669
15. C. Qiuru et al., Telecom customer segmentation based on cluster analysis, in 2012 International
Conference on Computer Science and Information Processing (CSIP) (IEEE, 2012). https://
doi.org/10.1109/CSIP.2012.6309069
16. H. Gong, Q. Xia, Study on application of customer segmentation based on data mining tech-
nology, in 2009 ETP International Conference on Future Computer and Communication (IEEE,
2009). https://doi.org/10.1109/FCC.2009.66
17. X. Lai, Segmentation study on enterprise customers based on data mining technology, in 2009
First International Workshop on Database Technology and Applications (IEEE, 2009). https://
doi.org/10.1109/DBTA.2009.96
18. H. Hwang, T. Jung, E. Suh, An LTV model and customer segmentation based on customer
value: a case study on the wireless telecommunication industry. Expert Syst. Appl. 26(2),
181–188 (2004). https://doi.org/10.1016/S0957-4174(03)00133-7
19. C.-H. Cheng, Y.-S. Chen, Classifying the segmentation of customer value via RFM model and
RS theory. Expert Syst. Appl. 36(3), 4176–4184 (2009). https://doi.org/10.1016/j.eswa.2008.
04.003
20. S. Kelly, Mining data to discover customer segments. Interact. Mark. 4(3), 235–242 (2003).
https://doi.org/10.1057/palgrave.im.4340185
21. R.J. Calantone, J.S. Johar, Seasonal segmentation of the tourism market using a benefit segmen-
tation framework. J. Travel Res. 23(2), 14–24 (1984). https://doi.org/10.1177/004728758402
300203
22. W. Wang et al., A weakly supervised approach for object detection based on soft-label boosting,
in 2013 IEEE Workshop on Applications of Computer Vision (WACV) (IEEE, 2013). https://
doi.org/10.1109/WACV.2013.6475037
23. N. Malandrakis et al., A supervised approach to movie emotion tracking, in 2011 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2011). https://
doi.org/10.1109/ICASSP.2011.5946961
24. L. Yang et al., A supervised approach to the evaluation of image segmentation methods, in
International Conference on Computer Analysis of Images and Patterns (Springer, Berlin,
Heidelberg, 1995). https://doi.org/10.1007/3-540-60268-2_377
25. Md.S. Islam et al., Supervised approach of sentimentality extraction from Bengali Face-
book status, in 2016 19th International Conference on Computer and Information Technology
(ICCIT) (IEEE, 2016). https://doi.org/10.1109/ICCITECHN.2016.7860228
26. D. Turnbull et al., A supervised approach for detecting boundaries in music using difference
features and boosting, in ISMIR (2007)
27. L. Yang et al., A supervised approach to the evaluation of image segmentation methods, in
International Conference on Computer Analysis of Images and Patterns (Springer, Berlin,
Heidelberg, 1995). https://doi.org/10.1016/j.neucom.2011.09.002
28. I. Monroy et al., A semi-supervised approach to fault diagnosis for chemical processes. Comput.
Chem. Eng. 34(5), 631–642 (2010). https://doi.org/10.1016/j.compchemeng.2009.12.008
29. L. Sun et al., A novel weakly-supervised approach for RGB-D-based nuclear waste object
detection. IEEE Sens. J. 19(9), 3487–3500 (2018). https://doi.org/10.1109/JSEN.2018.288
8815
30. A.J. Ferreira, M.A.T. Figueiredo, An unsupervised approach to feature discretization and
selection. Pattern Recogn. 45(9), 3048–3060 (2012). https://doi.org/10.1016/j.patcog.2011.
12.008
31. E.N. Nasibov, G. Ulutagay, A new unsupervised approach for fuzzy clustering. Fuzzy Sets
Syst. 158(19), 2118–2133 (2007). https://doi.org/10.1016/j.fss.2007.02.019
32. Ke. Hu, D.L. Wang, An unsupervised approach to cochannel speech separation. IEEE Trans.
Audio Speech Lang. Process. 21(1), 122–131 (2012). https://doi.org/10.1109/TASL.2012.221
5591
33. K. Ganesan, C.X. Zhai, E. Viegas, Micropinion generation: an unsupervised approach to gener-
ating ultra-concise summaries of opinions, in Proceedings of the 21st International Conference
on World Wide Web (2012)
34. D. Trabelsi et al., An unsupervised approach for automatic activity recognition based on hidden
Markov model regression. IEEE Trans. Autom. Sci. Eng. 10(3), 829–835 (2013). https://doi.
org/10.1109/TASE.2013.2256349
35. R.M. Alguliyev, R.M. Aliguliyev, N.R. Isazade, An unsupervised approach to generating
generic summaries of documents. Appl. Soft Comput. 34, 236–250 (2015). https://doi.org/
10.1016/j.asoc.2015.04.050
36. J.A. McCarty, M. Hastak, Segmentation approaches in data-mining: a comparison of RFM,
CHAID, and logistic regression. J. Bus. Res. 60(6), 656–662 (2007). https://doi.org/10.1016/
j.jbusres.2006.06.015
37. W. Li et al., Credit card customer segmentation and target marketing based on data mining,
in 2010 International Conference on Computational Intelligence and Security (IEEE, 2010).
https://doi.org/10.1109/CIS.2010.23
38. Z. Lu et al., Customer segmentation algorithm based on data mining for electric vehicles, in 2019
IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA)
(IEEE, 2019). https://doi.org/10.1109/ICCCBDA.2019.8725737
39. V.L. Miguéis, A.S. Camanho, J. Falcão e Cunha, Customer data mining for lifestyle segmen-
tation. Expert Syst. Appl. 39(10), 9359–9366 (2012). https://doi.org/10.1016/j.eswa.2012.
02.133
40. C.-Y Chiu et al., An intelligent market segmentation system using k-means and particle
swarm optimization. Expert Syst. Appl. 36(3), 4558–4565 (2009). https://doi.org/10.1016/j.
eswa.2008.05.029
41. S. Dutta, S. Bhattacharya, K.K. Guin, Data mining in market segmentation: a literature review
and suggestions, in Proceedings of Fourth International Conference on Soft Computing for
Problem Solving (Springer, New Delhi, 2015). https://doi.org/10.1007/978-81-322-2217-0_8
42. E.R. Swenson, N.D. Bastian, H.B. Nembhard, Healthcare market segmentation and data
mining: a systematic review. Health Mark. Q. 35(3), 186–208 (2018). https://doi.org/10.1080/
07359683.2018.1514734
43. S. Mckechnie, Integrating intelligent systems into marketing to support market segmentation
decisions. Intell. Syst. Account. Finance Manag. Int. J. 14(3), 117–127 (2006). https://doi.org/
10.1002/isaf.280
44. P. Kotler, K.L. Keller, Marketing Management, ed. by W. Lassar, international 11th edn.
(Prentice Hall, New Jersey, 2003)
45. M. Wedel, W.A. Kamakura, Market Segmentation: Conceptual and Methodological Founda-
tions, vol. 8 (Springer Science & Business Media, 2012)
46. Y. Wind, Issues and advances in segmentation research. J. Mark. Res. 15(3), 317–337 (1978).
https://doi.org/10.1177/002224377801500302
47. L. Alfansi, A. Sargeant, Market segmentation in the Indonesian banking sector: the relationship
between demographics and desired customer benefits. Int. J. Bank Mark. (2000). https://doi.
org/10.1108/02652320010322976
48. D.G. Tonks, Validity and the design of market segments. J. Mark. Manag. 25(3–4), 341–356
(2009). https://doi.org/10.1362/026725709X429782
49. M. Taks, J. Scheerder, Youth sports participation styles and market segmentation profiles:
evidence and applications. Eur. Sport Manag. Q. 6(2), 85–121 (2006). https://doi.org/10.1080/
16184740600954080
50. J. Bruwer, E. Li, Wine-related lifestyle (WRL) market segmentation: demographic and
behavioural factors. J. Wine Res. 18(1), 19–34 (2007). https://doi.org/10.1080/095712607015
26865
51. P. Vyncke, Lifestyle segmentation: from attitudes, interests and opinions, to values, aesthetic
styles, life visions and media preferences. Eur. J. Commun. 17(4), 445–463 (2002). https://doi.
org/10.1177/02673231020170040301
52. A. Vellido, P.J.G. Lisboa, K. Meehan, Segmentation of the on-line shopping market using
neural networks. Expert Syst. Appl. 17(4), 303–314 (1999). https://doi.org/10.1016/S0957-
4174(99)00042-1
53. J. Swait, A structural equation model of latent segmentation and product choice for cross-
sectional revealed preference choice data. J. Retail. Consum. Serv. 1(2), 77–89 (1994). https://
doi.org/10.1016/0969-6989(94)90002-7
54. T. Teichert, E. Shehu, I. von Wartburg, Customer segmentation revisited: the case of the airline
industry. Transp. Res. Part A Policy Pract. 42(1), 227–242 (2008). https://doi.org/10.1016/j.
tra.2007.08.003
55. A. Lindridge, S. Dibb, Is ‘culture’ a justifiable variable for market segmentation? A cross-
cultural example. J. Consum. Behav. Int. Res. Rev. 2(3), 269–286 (2003). https://doi.org/10.
1002/cb.106
56. F. Casarin, A. Moretti, An international review of cultural consumption research. SSRN
Electron. J. Department of Management, Università Ca’ Foscari Venezia working paper 12
(2011)
57. A.M. Gonzalez, L. Bello, The construct “lifestyle” in market segmentation: the behaviour of
tourist consumers. Eur. J. Mark. (2002). https://doi.org/10.1108/03090560210412700
58. D.B. Valentine, T.L. Powers, Generation Y values and lifestyle segments. J. Consum. Mark.
(2013). https://doi.org/10.1108/JCM-07-2013-0650
59. U.R. Orth et al., Promoting brand benefits: the role of consumer psychographics and lifestyle.
J. Consum. Mark. (2004). https://doi.org/10.1108/07363760410525669
60. C.-S. Yu, Construction and validation of an e-lifestyle instrument. Internet Res. (2011). https://
doi.org/10.1108/10662241111139282
61. A.M. Thompson, P.F. Kaminski, Psychographic and lifestyle antecedents of service quality
expectations: a segmentation approach. J. Serv. Mark. (1993). https://doi.org/10.1108/088760
49310047742
62. J.L.M. Tam, S.H.C. Tai, Research note: the psychographic segmentation of the female market
in Greater China. Int. Mark. Rev. (1998). https://doi.org/10.1108/02651339810205258
63. T.F. Srihadi, D. Sukandar, A.W. Soehadi, Segmentation of the tourism market for Jakarta:
classification of foreign visitors’ lifestyle typologies. Tour. Manag. Perspect. 19, 32–39 (2016).
https://doi.org/10.1016/j.tmp.2016.03.005
64. B. Oates, L. Shufeldt, B. Vaught, A psychographic study of the elderly and retail store attributes.
J. Consum. Mark. (1996). https://doi.org/10.1108/07363769610152572
65. T.M.M. Verhallen, R.T. Frambach, J. Prabhu, Strategy-based segmentation of industrial
markets. Ind. Mark. Manag. 27(4), 305–313 (1998). https://doi.org/10.1016/S0019-850
1(97)00064-3
66. E.J. Cheron, R. McTavish, J. Perrien, Segmentation of bank commercial markets. Int. J. Bank
Mark. (1989). https://doi.org/10.1108/EUM0000000001458
67. S.W. Clopton, J.E. Stoddard, D. Dave, Event preferences among arts patrons: implications for
market segmentation and arts management. Int. J. Arts Manag. 48–59 (2006)
68. A. Buratto, L. Grosset, B. Viscolani, Advertising a new product in a segmented market. Eur. J.
Oper. Res. 175(2), 1262–1267 (2006)
69. R. Sánchez-Fernández, M. Ángeles Iniesta-Bonillo, A. Cervera-Taulet, Exploring the concept
of perceived sustainability at tourist destinations: a market segmentation approach. J. Travel
Tour. Mark. 36(2), 176–190 (2019)
70. K. Bijak, L.C. Thomas, Does segmentation always improve model performance in credit
scoring? Expert Syst. Appl. 39(3), 2433–2442 (2012). https://doi.org/10.1016/j.eswa.2011.
08.093
71. A. Sell, P. Walden, Segmentation bases in the mobile services market: attitudes in, demographics
out, in 2012 45th Hawaii International Conference on System Sciences (IEEE, 2012)
72. A. Sell, J. Mezei, P. Walden, An attitude-based latent class segmentation analysis of mobile
phone users. Telemat. Inform. 31(2), 209–219 (2014)
73. D.J. Ketchen, C.L. Shook, The application of cluster analysis in strategic management research:
an analysis and critique. Strateg. Manag. J. 17(6), 441–458 (1996)
74. G. Punj, D.W. Stewart, Cluster analysis in marketing research: review and suggestions for
application. J. Mark. Res. 20(2), 134–148 (1983)
Solar Radiation Prediction Using
Artificial Neural Network:
A Comprehensive Review
Bireswar Paul and Hrituparna Paul
Abstract Solar energy has a great potential along with wind energy to fulfill the
energy demands of any nation. In the last few decades, considerable research interest
has been shown to harness solar energy. This paper is prepared to consider the role
of artificial neural network (ANN) in predicting the solar radiation value of any
location. The prime objective of this paper is to review the most recent study on
ANN-based techniques for finding out the best available methods in the literature
for the prediction of solar radiation in comparison with the conventional methods. A
comprehensive discussion has been presented to find out some of the research gaps
in this domain.
Keywords Solar radiation · Prediction · Artificial neural network · Performance

indices
1 Introduction
The integration of renewable energy sources mainly the unpredictable ones such as
wind and solar, into existing or future energy supply networks, is going to be one
of the most challenging tasks for the global energy market in the future. This high
demand needs to be fulfilled by new power plants. Now because most of the power
plants of present-day run-on fossil fuels, any new addition will lead to an increase in
greenhouse gas emission which leads to global warming [1]. Also, it is expected that
fossil fuels reserves will end in the coming few decades; hence, they may work as a
temporary alternative only. Hence, permanent source has to be found, and research
must be done to make those sources more efficient and reliable. Considering India’s
geographical conditions, solar and wind energy are two major alternative sources
B. Paul (B)
Department of Mechanical Engineering, Motilal Nehru National Institute of Technology
Allahabad, Prayagraj, UP, India
e-mail: bipaul@mnnit.ac.in
H. Paul
Department of Computer Science and Engineering, LDC Institute of Technical Studies, Soraon,
Prayagraj, UP, India
510 B. Paul and H. Paul
of energy which has the potential of behaving as a backhand to supply additionally

required energy. However, the biggest advantage of these two sources of energy is the
inconsistency in energy supply. Solar being a priceless source requires a device called
a solar PV cell to convert it to electrical energy directly. The conversion can also take
place using a concentrated solar power plant [2]. By looking at past decades’ records,
a continuous growth can be seen in new plants installation every year. This is due to
the supportive government policies and interest of investment companies. The other
factor which catalyzed this process is the new development in technology which is
making it a cheap and reliable source of energy to compete with conventional sources.
The problem that made solar less popular in past is less technical advancement and
irregular performance due to uncertain output generation which causes problems
for grid connection and high cost of the energy storage system. These problems are
addressed effectively in the past decade or so which can be seen from the continuous
rise in installed solar capacity throughout the world. The output of PV cells is severely
affected by two main factors: one day-night cycle and other cloud structures. Hence
before integrating solar plant with grid, one needs to perform accurate availability
evaluation. Effective integration of large-scale solar power plants while keeping
reliability is the biggest challenge for solar energy supply. This challenge gives rise
to the accessibility of accurate information regarding solar radiation forecasting.
During the last five years, the solar energy capacity has increased from 2.6 GW in
March 2014 to 30 GW in July 2019. Recently, India achieved the fifth global position
in solar power deployment by surpassing Italy [3].
However, the incorporation of these renewable energy technologies into the
conventional electrical grid has been found to have many challenges. Some of the
challenges are related to the consistency and reliability of the generation systems
being used, but the main drawback of these renewable sources of energy is their inter-
mittent nature and dependency on climatic conditions. The stability of the network
is very much important to maintain a constant grid frequency. A load lesser than
the supply makes the frequency increase, and a load greater than the supply makes
the frequency decrease. To mitigate this challenge, a lot of research activities are
taking place worldwide to develop more accurate models for renewable energy supply
prediction.
The use of ANN can be seen in various fields of engineering applications. This
technique is been used in the modeling of the intricate thermo-physical phenomenon
as in heating, refrigeration, ventilating and air conditioning systems, power genera-
tion systems, solar thermal energy systems, solar radiation modeling and controls,
load prediction, etc. The ANN technique has become progressively popular in the
last few decades (as shown in Fig. 1) [4]. Most of the works show that ANN are far
more accurate than conventional methods.
The objective of this paper is to review the most recent study on ANN-based
techniques for finding out the best available methods in the literature for the prediction
of solar radiation in comparison with the conventional methods. A comparative study
of some of the important papers is also presented to choose the best suitable model
based on the input parameters available.
Solar Radiation Prediction Using Artificial Neural … 511
70
60
50
Publications
40
30
20
10
Year
Fig. 1 Increasing trend of research in the field of solar radiation prediction using ANN. Source:
https://www.webofscience.com/wos/woscc/basic-search
2 ANN Models for Prediction in the Field of Solar

Radiation
Generally, solar radiation is measured with the help of solar radiation measurement
devices. Nevertheless, due to the higher installation and maintenance cost and also
due to the requirement of time-to-time calibration, these are very sparse at most
of the research institutes. Hence, it is therefore very much important to forecast
solar radiation of a particular location with easily measurable climatic variables such
as temperature, relative humidity, and wind speed. Keeping this in mind, several
numbers of models have been proposed in the literature to predict the solar radiation
data. In the following section, some of the important papers are discussed.
Bilgili and Ozgoren [5] have predicted the daily total global solar radiation using
three different techniques mainly multi-nonlinear regression (MNLR), multi-linear
regression (MLR), and feed-forward ANN methods in Adana city of Turkey. They
have reported that the ANN method performed better than the other two methods in
predicting the daily global solar radiation. They have found that the average mean
absolute percentage errors (MAPE) of testing for the MNLR, MLR, and ANN models
are 18.41%, 32.1%, and 12.89% respectively. This shows that the performance ANN
model performs better than the other two models.
Al-Shamisi et al. [6] have studied two different ANN techniques mainly multilayer
perceptron (MLP) and radial basis function (RBF) to predict monthly average global
solar radiation of Al Ain city, UAE. They have trained the neural network model using
the weather data from 1995 to 2004. However, for testing and model validation, the
data between 2005 and 2007 were used. Eleven different models were tested with
MLP and RBF techniques with different input combinations. They have reported
that the RBF technique performs better than the MLP technique in most cases. The
statistical parameters such as MAPE, mean bias error (MBE), root mean square error
(RMSE), and correlation coefficient (R2 ) for the RBF model have been reported to
be 35%, 0.307%, 3.88%, and 92%, respectively.
Here n refers to the number of data; oi and t i are the ith predicted value and
measured values, respectively.
Xue [7] has predicted the daily diffuse solar radiation using the ANN technique.
They have used mainly genetic algorithm (GA) and particle swarm optimization
(PSO) techniques to improve the accuracy of the back-propagation neural network
(BPNN) model. A total of seven input parameters mainly comprise the month of the
year, mean temperature, relative humidity, wind speed, sunshine duration, rainfall,
and daily global solar radiation that have been used for the study. The author has
observed that the BPNN optimized by the PSO model has outperformed the BPNN
optimized by GA models.
Alsina et al. [8] works on the prediction of monthly average daily global solar
radiation over Italy using ANN models. They have used the data from 45 locations
for the training and testing of multi-location ANN. At least 13 input parameters
were considered for each location including the geographical coordinates and the
monthly values of climatologic parameters. They have found that by using all the
available input, the best suitable ANN leads to a normalized root mean square error
(NRMSE) of 1.65%, a MAPE of about 2.66%, and a mean percentage bias error
(MPBE) of—0.20%, respectively.
Antonopoulos et al. [9] have investigated the Hargreaves method, multi-linear
regression methods (MLR), and ANN technology to estimate solar radiation. The
daily meteorological measurements of radiation, air temperature, relative humidity,
and wind velocity were used to develop the solar radiation models. They have
suggested the use of Hargreaves and multi-linear regression over ANN. They have
found in their study that the ANN models cannot be recommended due to their higher
difficulty involved, which is not in proportion to the significant gain inaccuracy.
Ağbulut et al. [10] have studied four different machine learning algorithms such as
support vector machine (SVM), kernel and nearest neighbor (k-NN), deep learning
(DL), and ANN to predict daily global solar radiation data of four different locations
in Turkey. The training of these algorithms has been done using daily maximum
and minimum ambient temperature, day length, cloud cover, extraterrestrial solar
radiation, and solar radiation of these locations. They have reported that the RMSE,
MABE, and R2 values of all the models are ranging from 2.273 to 2.820 MJ/m2 ,
from 1.870 to 2.328 MJ/m2 , from 0.855 to 0.936, respectively. They concluded that
the ANN model is the best model among all models. However, they have also stated
that all the machine learning models can be used to predict solar radiation with high
accurateness.
A similar kind of study has also been reported by many researchers. It can be
observed that in addition to the different pragmatic models available in the literature,
different AI-based techniques such as kernel nearest neighbor (k-NN), SVM, deep
learning (DL), genetic algorithms (GA), and ANN have recently gained a lot of atten-
tion from the scientific community in the prediction of solar radiation data [11–25].
Fig. 2 Representative ANN model for solar radiation prediction
Figure 2 shows a representative ANN model for solar radiation forecasting. Almost
in all the studies, it is reported that the ANN methods have offered more accurate
results in comparison with the conventional methods available for the prediction of
solar radiation. The different statistical prediction accuracy indices reported in the
literature are presented in Table 1.
3 Comparative Study of the Different Techniques Used
In this section, a comparative study of the research work done in the field of solar
radiation prediction has been presented. From the literature review, it can be observed
that different conventional models, linear, nonlinear, fuzzy logic, and neural network-
based models have been used to predict solar radiation prediction. Diez et al. [26]
have compared the performance parameters of the ANN model with the classic
models (CENSOLAR typical year, weighted moving mean with two days delay,
linear regression, and Fourier and Markov analysis). They have found that the ANN
model is better and easy to implement as they require fewer inputs in comparison with
the classic models. Citakoglu [27] has compared the performance of ANN, adaptive
network fuzzy inference system (ANFIS), multiple linear regression (MLR) models,
and four empirical equations (Angstrom, Abdalla, Bahel, and Hargreaves–Samani)
used for the estimation of solar radiation. They have reported that the ANN model has
Table 1 Different performance indicators used in ANN [4–25]

Performance indicators Formula
n
oi − n1 in (oi )
i ti − n1 in (ti )
R (correlation coefficient)
n
1 n
2
n 1 n
2
i oi − n i (oi ) i ti − n i (ti )
n 2
i (ti −oi )
R2 (coefficient of determination) 1− n
i (o i ) 2

n
i (oi − ti )
1 2
RMSE (root mean square error) n

1 n oi −ti
MAPE (mean absolute percentage error) n i ti × 100
n
i |oi −ti |
MAE (mean absolute error) n

n
i (oi − ti )
1
MBE (mean bias error) n
n
i (oi −ti )
RMBE (relative mean bias error) 1 n × 100
n i (oi )
n
(ti −oi )2
MRV (mean relative variance) i 2
ti − n in (ti )
1
better performance accuracy among all the techniques compared. Table 2 summarizes
a comparative representation of the research on solar radiation prediction reported
in the literature in the last decade.
4 Discussion and Conclusion
From the existing literature, it can be observed that the ANN-based techniques are
very popular in predicting solar radiation. However, some of the key observations
and research gaps experienced during the study have been summarized below.
• It has been observed that to test, train, and validate the solar radiation predic-
tion models, prolonged climatic parameters datasets are required. The higher the
volume of data, the higher is the accuracy. Moreover, such a huge volume of
data is not easily available for most of the locations due to the expensiveness
of the measuring instruments. Along with that the difficulty in ease of access of
the measuring locations also puts severe constraints in considering accurate and
correct models.
• From the published literature, it can also be observed that the precise choice of
meteorological and geographical input parameters plays a key role in predicting
solar radiation with reliability and better accuracy.
• Sunshine hour and air temperature have been observed to be the most effective
input parameters to predict solar radiation. However, a generalized study can be
Table 2 Comparative representation of the research on solar radiation prediction
Component Authors Different techniques Input parameters Performance indicators Remarks
used used
Global Bilgili and Ozgoren [5] MLR, MNLR, and Sunshine duration, Testing ANN model is most useful
ANN air temperature, MLR among all the models due to its
wind speed, solar MAPE (%) = low MAPE value
radiation, wind 16.55–92.41
speed MNLR
MAPE (%) =
14.58–28.23
ANN
MAPE (%) =
9.23–18.50
Global Al-Shamisi et al. [6] ANN (radial basis Maximum RMSE = 35% In most the cases, RBF
function, multilayer temperature, mean MBE = 0.307% technique performed better than
perceptron) wind speed, MEP = 3.88% the MLP technique
sunshine, and mean R2 = 92%
Solar Radiation Prediction Using Artificial Neural …
relative humidity
Global/daily Xue [7] BPNN, genetic The month of the R = 0.934–0.953 BPNN optimized by PSO model
algorithm (GA), and year, sunshine RMSE = 0.78–0.932 is better than BPNN and BPNN
particle swarm duration, mean MAE = 0.685–0.836 with GA
optimization (PSO) temperature,
rainfall, wind
speed, relative
humidity, and daily
global solar
radiation
Global Antonopoulos et al. [9] Hargreaves method, Air temperature, Correlation coefficient Hargreaves and the multi-linear
ANN, multi-linear radiation, humidity, (r) = 0.891–0.94 regression model outperformed
regression methods and wind velocity the ANN model
(continued)
515
Table 2 (continued)
516

used used
Global/daily Ağbulut et al. [10] ANN, SVM Maximum ambient R2 = 0.855–0.936 ANN has evolved to be the best
algorithms, DL, and temperature, MABE = fitting algorithms as the error
kernel and nearest minimum ambient 1.870–2.328 MJ/m2 magnitudes are very low
neighbor algorithms temperature, cloud RMSE =
cover, and solar 2.273–2.820 MJ/m2
radiation
Global Álvarez-Alvarado et al. ANN, SVM Maximum and RMSE = 1.86% SVM algorithm is much faster
[11] algorithms, search minimum air MEP = 11.51% than ANN in the training and
optimization temperature, testing phases
algorithms (SOA), maximum and
genetic algorithms minimum relative
(GA), and the particle humidity, wind
swarm optimization speed, evaporation,
algorithm (PSO) and vapor pressure
estimates
Global/daily Taki et al. [12] ANN, support vector Daily Testing The radial basis function is the
machine, adaptive extraterrestrial R2 = 0.77–0.97 best and most accurate model
network-based fuzzy radiation, daily Training for estimating solar radiation
inference system, and global solar R2 = 0.97–0.99 with less error by the small set
multiple linear radiation, day of data
regression length, daily
average relative
humidity, daily
maximum
temperature, daily
average
temperature, and
daily total rainfall
B. Paul and H. Paul
(continued)
Table 2 (continued)
used used
Global Pang et al. [13] ANN model, recurrent Solar radiation and ANN RNN model has the highest
neural network (RNN) other dry bulb R2 = 0.933–0.974 prediction accuracy in
model, and deep temperature, dew RNN comparison with ANN
learning algorithm point temperature, R2 = 0.97–0.983
wind speed, gust
speed, and wind
direction
Global Ozgoren et al. [14] ANN, multi-nonlinear Latitude, longitude, MAPE = 2.14% It was found that for the ANN
regression model altitude, month, 8.07% model, the error values were
monthly maximum, Correlation coefficient within the acceptable limits
atmospheric (r) = 0.9854–0.9990
temperature,
minimum
atmospheric
Solar Radiation Prediction Using Artificial Neural …
temperature, mean
atmospheric
temperature, soil
temperature, wind
speed, relative,
humidity,
atmospheric
pressure, rainfall,
vapor pressure,
cloudiness, and
sunshine duration
Surface Li et al. [15] Principle component Daily surface solar RMSE = They have proposed a hybrid
analysis, wavelet radiation maps, 30.78–30.98 W/m2 model for future mapping
transform analysis, clear sky index prediction
517
and ANN
(continued)
Table 2 (continued)
518

used used
Global/monthly/daily Wang et al. [16] Ensemble empirical Air temperature, EEMD-RE EEMD-RE model has been
mode decomposition, radiation, humidity, RMSE = 1.135 recommended for the daily
wavelet analysis, and wind velocity MAPE = 22.11% long-term prediction
reconfiguration R2 = 0.5484
methods, regression
model, and ANN
Global/daily Meenal and Immanuel SVM model, ANN, Month, latitude, SVM SVM model performs better
Selvakumar [17] and empirical model longitude, bright R = 0.9756 than ANN and empirical models
sunshine hours, day RMSE = 0.688
length, relative
humidity, and
maximum and
minimum
temperature
Global Feng et al. [18] Extreme learning Sunshine duration, RRMSE (%) = GANN > ELM > RF > GRNN
machine (EML), global solar 13.4–35.8
backpropagation radiation, and R2 = 0.890–0.921
neural networks diffuse solar
optimized by genetic radiation
algorithm (GANN),
random forests (RF),
and generalized
regression neural
networks (GRNN)
B. Paul and H. Paul
done to observe the influence of each parameter on the overall performance ANN
models.
• It has been experienced that different ANN models need to be developed using
different geographical input parameters, mainly latitude, longitude, altitude,
extraterrestrial radiation, and check for accuracy. This kind of study may be useful
to predict the solar radiation of any location without the dependency on the solar
radiation measurement instruments.
• It has also been observed that the performance indicators of various ANN models
do get altered with the impact of geographical and meteorological variables,
training algorithm, and ANN architecture configuration. Hence, a suitable selec-
tion of input parameters is very much significant for predicting solar radiation
with better accuracy.
• Some hybrid ANN models can be studied to check for the improvement in
accuracy.
This paper offered a comprehensive review of the most recent works to predict
solar radiation using ANN. Sustainable utilization of renewable solar energy involves
a precise understanding of the available solar radiation and its variation with climatic
parameters. In this direction, the ANN models are found to be the right choice to accu-
rately predict solar radiation. The ANN models are preferred due to their high poten-
tial to simulate the nonlinear and time-variant input–output systems in comparison
with other classical and empirical models available. This study provides an updated
state-of-the-art review to support further research in this direction. Moreover, unless
few studies working on this important topic of research, there is no existing method-
ological analysis available to perform the selection process of most pertinent input
variables for ANN models. The study can be done in this direction to choose the right
input parameter combinations for better accuracy of ANN models. A real-time solar
radiation prediction system using ANN requires a higher computation cost. There
may be challenges with real-time training in case of sudden changes in metrological
data. Finally, it has also been observed that there is a paucity of research work in the
direction to predict diffuse and beam solar radiation using ANN.
References
1. S. Mekhilefa, R. Saidur, A. Safari, A review on solar energy use in industries. Renew. Sustain.
Energy Rev. 15, 1777–1790 (2011)
2. J.C.R. Kumar, M.A. Majid, Renewable energy for sustainable development in India: current
status, future prospects, challenges, employment, and investment opportunities. Energy Sustain.
Soc. 10, 2 (2020)
3. https://mnre.gov.in/solar/current-status/. Accessed 15 July 2021
4. W. Yaïci, E. Entchev, Performance prediction of a solar thermal energy system using artificial
neural networks. Appl. Therm. Eng. 73(1), 1348–1359 (2014)
5. M. Bilgili, M. Ozgoren, Daily total global solar radiation modeling from several meteorological
data. Meteorol. Atmos. Phys. 112, 125–138 (2011)
6. M.H. Al-Shamisi, A.H. Assi, H.A.N. Hejase, Artificial neural networks for predicting global
solar radiation in Al Ain City—UAE. Int. J. Green Energy 10(5), 443–456 (2013)
7. X. Xue, Prediction of daily diffuse solar radiation using artificial neural networks. Int. J.
Hydrogen Energy 42(47), 28214–28221 (2017)
8. E. Federico Alsina, M. Bortolini, M. Gamberi, A. Regattieri, Artificial neural network opti-
misation for monthly average daily global solar radiation prediction. Energy Convers. Manag.
120, 320–329 (2016)
9. V.Z. Antonopoulos, D.M. Papamichail, V.G. Aschonitis, A.V. Antonopoulos, Solar radiation
estimation methods using ANN and empirical models. Comput. Electron. Agric. 160, 160–167
(2019)
10. Ü. Ağbulut, A.E. Gürel, Y. Biçen, Prediction of daily global solar radiation using different
machine learning algorithms: evaluation and comparison. Renew. Sustain. Energy Rev. 135,
110114 (2021)
11. J.M. Álvarez-Alvarado, J.G. Ríos-Moreno, S.A. Obregón-Biosca, G. Ronquillo-Lomelí, E.
Ventura-Ramos Jr., M. Trejo-Perea, Hybrid techniques to predict solar radiation using support
vector machine and search optimization algorithms: a review. Appl. Sci. 11, 1044 (2021)
12. M. Taki, A. Rohani, H. Yildizhan, Application of machine learning for solar radiation modeling.
Theor. Appl. Climatol. 143, 1599–1613 (2021)
13. Z. Pang, F. Niu, Z. O’Neill, Solar radiation prediction using recurrent neural network and
artificial neural network: a case study with comparisons. Renew. Energy 156, 279–289 (2020)
14. M. Ozgoren, M. Bilgili, B. Sahin, Estimation of global solar radiation using ANN over Turkey.
Expert Syst. Appl. 39(5), 5043–5051 (2012)
15. P. Li, M. Bessafi, B. Morel, J. Chabriat, M. Delsaut, Q. Li, Daily surface solar radiation
prediction mapping using artificial neural network: the case study of Reunion Island. ASME.
J. Sol. Energy Eng. 142(2), 021009 (2020)
16. S.-Y. Wang, J. Qiu, F.-F. Li, Hybrid decomposition-reconfiguration models for long-term solar
radiation prediction only using historical radiation records. Energies 11, 1376 (2018)
17. R. Meenal, A. Immanuel Selvakumar, Assessment of SVM, empirical and ANN based solar
radiation prediction models with most influencing input parameters. Renew. Energy 121, 324–
343 (2018)
18. Y. Feng, N. Cui, Q. Zhang, L. Zhao, D. Gong, Comparison of artificial intelligence and empirical
models for estimation of daily diffuse solar radiation in North China Plain. Int. J. Hydrogen
Energy 42(21), 14418–14428 (2017)
19. M. Laidi, S. Hanini, A. Rezrazi, M.R. Yaiche, A.A. El Hadj, F. Chellai, Supervised artificial
neural network-based method for conversion of solar radiation data (case study: Algeria). Theor.
Appl. Climatol. 128, 439–451 (2017)
20. M. Vakilia, S.-R. Sabbagh-Yazdi, K. Kalhorb, S. Khosrojerdi, Using artificial neural networks
for prediction of global solar radiation in Tehran considering particulate matter air pollution.
Energy Procedia 74, 1205–1212 (2015)
21. B. Ihya, A. Mechaqrane, R. Tadili, M.N. Bargach, Prediction of hourly and daily diffuse solar
fraction in the city of Fez (Morocco). Theor. Appl. Climatol. 120(3), 737–749 (2014)
22. A.K. Yadav, H. Malik, S.S. Chandel, Selection of most relevant input parameters using WEKA
for artificial neural network based solar radiation prediction models. Renew. Sustain. Energy
Rev. 31, 509–519 (2014)
23. Y.W. Kean, V. Karri, Comparative study in predicting the global solar radiation for Darwin,
Australia. ASME. J. Sol. Energy Eng. 134(3), 034501 (2012)
24. A. Mellit, A.M. Pavan, A 24-h forecast of solar irradiance using artificial neural network:
application for performance prediction of a grid-connected PV plant at Trieste, Italy. Sol.
Energy 84(5), 807–821 (2010)
25. M.A. Behrang, E. Assareh, A. Ghanbarzadeh, A.R. Noghrehabadi, The potential of different
artificial neural network (ANN) techniques in daily global solar radiation modeling based on
meteorological data. Sol. Energy 84, 1468–1480 (2010)
26. F.J. Diez, L.M. Navas-Gracia, L. Chico-Santamarta, A. Correa-Guimaraes, A. Martínez-
Rodríguez, Prediction of horizontal daily global solar irradiation using artificial neural networks
(ANNs) in the Castile and León region, Spain. Agronomy 10(96), 2–20 (2020). https://doi.org/
10.3390/agronomy10010096
27. H. Citakoglu, Comparison of artificial intelligence techniques via empirical equations for
prediction of solar radiation. Comput. Electron. Agric. 118, 28–37 (2015)
A Concise Review on Automatic Text
Summarization
Dishank Jani, Nehal Patel, Hemant Yadav, Sanket Suthar, and Sandip Patel
Abstract Today, data is the most important thing humanity needs, thus under-
standing the linguistics of such a large data is not practically possible so, text summa-
rization is introduced as the problem in natural language processing (NLP). Text
summarization is the technique to convert long text corpus such that the semantics of
the text does not change. This paper provides a study of different text summarization
methods till Q3 2020. Text summarization methods are broadly classified as abstrac-
tive and extractive. In this paper, more focus is given to abstractive summarization
a review for most of the methods of text summarization to date is written concisely
along with the evaluations and advantages-disadvantages also for each method. At
the end of the paper, the challenges faced by researchers for this task are mentioned
and what improvements can be done in every method for summarization is also
written in a structured way.
Keywords Abstractive summarization · Extractive summarization · Graph-based

summarization · Rule-based approach
D. Jani (B) · N. Patel · H. Yadav · S. Suthar · S. Patel

K D Patel Department of Information Technology, Faculty of Technology and Engineering (FTE),
Chandubhai S. Patel Institute of Technology (CSPIT), Charotar University of Science and
Technology (CHARUSAT), Changa, Gujarat 388421, India
N. Patel
e-mail: nehalpatel.it@charusat.ac.in
H. Yadav
e-mail: hemantyadav.it@charusat.ac.in
S. Suthar
e-mail: sanketsuthar.it@charusat.ac.in
S. Patel
e-mail: sandippatel.it@charusat.ac.in
524 D. Jani et al.
1 Introduction
NLP is among the most important research areas in machine learning. Text summa-
rization is an application of NLP among others. There are two types of text summa-
rization, abstractive and extractive summarization. In extractive summarization, the
sentences of the corpora are given a rank based on some techniques and the sentences
with the highest rank are selected. In abstractive summarization, the model has to
develop semantics of corpora and use its own ability to create its sentences with
help of various algorithms and techniques. The input structure used can be multi-
document or single document. In a single document, the length of the text is small
while in the case of multi-document length is unpredictable. Abstractive text summa-
rization was coined first time in 1958 by author Luan. Luan used the approach of
term frequency to generate the model; it is the simplest model which consists of five
steps. The classification of the summarization method is illustrated in Fig. 1. The
first full-scale summarizer was built in 2003 by Alonso I Alemany and Fuentes Fort,
which was based on the lexical chain concept. In 2007, Dipanjan et al. focused on
main three aspects of a fine text summarization model which are,
EvaLaxon
OntoClean
Ontology Based
Natural Language
Metrics
Graph Based
OntoMetric
Structured Based
Method
Rule Based
Tree Based
Supervised
Methods Mutimodal
Semantic
Semantic Based Information Item

(NLP) Based
Summarization
Methods Hybrid Sematic Graph
Summarization Based
Rank Based
Fuzzy Logic
Analysis
Unsupervised
Methods
Extractive Latent Semantic
Summarization Analysis (LSA)
Frequency Based
Fig. 1 Classification of summarization methods

A Concise Review on Automatic Text Summarization 525
1. It should be constructed from a proper input structure.

2. The summary must be as short as possible.
3. The summary must be clear and understandable.
The rest of the paper is structured as follows. Section 2 discusses Abstractive
text summarization in detail. Section 3 proposes Extractive text summarization
with advantages and limitations. Unsupervised text summarization is deliberated in
Sect. 4. In Sect. 5, we mentioned various datasets for Text Summarization. Evaluation
methods are discussed in Sect. 6 and provide a conclusion in Sect. 7.
2 Abstractive Text Summarization Techniques
In today’s world, abstractive text summarization has earned lots of popularity due to
its ability to generate sentences. Abstractive text summarization methods are gener-
ally hard to understand and generate [1]. Most abstractive approaches are supervised.
There are mainly three approaches for abstractive summarization structured-based,
semantic-based (using NLP), and hybrid-based methods [2]. Figure 2 shows the
common flow of text summarization.
2.1 NLP Architectures for Text Summarization
There are mainly two NLP architectures for abstractive text summarization which
holds 90% of research areas
• Pointer Generator Architecture
• Seq2Seq Architecture [3]
• Combination of encoder–decoder with pointer generator Network.
The use of long short-term memory (LSTM), of gated recurrent neural network
(GRU) in the model, gives faster and more efficient results compared to basic RNN.
LSTM is used when accuracy is more important, while GRU is used when speed
is more important. Today, abstractive text summarization using NLP approaches is
considered one of the most effective methods and is widely considered by researchers.
Text Training
Load Data Create Word Sunnarization the model
Lemmatization Embedding
Dataset Cleaning Vocabulary model and
methods evaluation
Fig. 2 Text summarization common flow

526 D. Jani et al.
2.2 Word Embeddings
For word embedding pre-trained GloVe embedding in a paper by Pennington et al.

[4], we can get a clear perspective that GloVe is more accurate than Word2Vec.
Word2vec approach was proposed by Mikolov et al. at Google [5]. GloVe is an
unsupervised learning algorithm. It performs with nearly 70% accuracy for Wiki
and Gigaword datasets. Thus, for a large dataset, GloVe performs significantly better
than word2Vec.
Two models were proposed by the author for predictions, namely Skip-gram and
continues bag of words (CBOW) [6]. In CBOW order of words does not matter,
while in SKIP-gram one hot vector encoding is established. The CPU utilization
of SKIP-gram is less. The time for training SKIP-Gram is three times more than
Google’s Word2Vec.
2.3 Structured-Based Methods
The following are the structured-based methods used.
2.3.1 Rule-Based Method
This method can be viewed in two categories, generation-based approach [7] and
revision-based approach. The rule-based approach consists of six steps according to
Pimpalshende and Mahajan [8]:
• Text preprocessing to classify text into different subfields based on its domain.
• Removing unnecessary and null values (decomposition).
• Sentence tokenization and feature vector generation by analysis of generated
subfields.
• Eliminating stop words.
• On basis of feature vector similarity matrix and phrases are generated.
• Selection of most likely sentences based on the probabilities of context concerning
the target (attention mechanism).
Advantages:
• Powerful at linguistics.
• Most likely, sentences will have appeared at the starting of the summary only.
Limitation:
• Not efficient for multi-document text corpus.
2.3.2 Graph-Based Method for Text Summarization
The graph-based method can be pragmatic to both single documents by Kumaresh and
Ramakrishnan [9] and multi-document [10] inputs corpus. The selection of proper
sentences in a structured manner for extraction plays a very crucial role. Selection
is done on basis of ranking algorithms. There are two types of ranking algorithms
available [11] which are TextRank and shortest path algorithm. TextRank algorithm
gives better accuracy compared to shortest path algorithm as TextRank algorithm
does not depend on the vertices of a graph; it highly depends on the context and
according to it, and this algorithm does prediction. Yasunaga et al. [10], GRU for the
graph-based model is used which is less accurate than GRU with similarity graph.
While using GRU with personalized discourse, graph performs with the most accu-
racy. Graph-based text summarization for both single as well as a multi-document
input format is proposed by Erkan and Radev [12]. The LexRank algorithm [12] for
graph-based summarization is implemented on DUC 2003 and2004 datasets with
an average ROUGE-1 (Recall-Oriented Understudy for Gisting Evaluation) score
of 0.3963 (DUC 2003) and 0.3966 (DUC 2004) for continuous LexRank. Mihalcea
and Tarau [13] has proposed a language-independent algorithm for graph-based text
summarization, but this method is of extractive output type as well as can be applied
only for single text document inputs.
2.3.3 Tree-Based Method
Data processing is done by Ajiambo et al. [14]. The whole corpus is divided into
different phrases and a decision tree is developed based on common words. Selection
of the common phrases is done with some searching algorithms like beam search or
sometimes theme algorithm also. Gupta and Lehal [15] nested tree implementation of
tree-based summarization is conveyed. Advantage: Summary is smaller with a higher
ROUGE score compared to traditional method, and Limitation: Nested Tree-based
method [16] is limited to single document input type only. It can be transformed to
multi-document by the addition of an RST Parser.
2.3.4 Ontology-Based Method
The ontology-based approach is used in both extractives by Hennig et al. [17]

and abstractive summarization. An ontology-based approach by Mohan et al. [18]
expressed ontology method as the subfield of information-based methods. This is
a domain-specific approach. Advantages: Helps identifying the structure of data
among different groups of people, allows scientists to use the domain knowledge
again, which enables them to apply query-based approach. Limitations: Adding more
constraints will make this model inefficient and inconsistent for large text corpus.
528 D. Jani et al.
Fig. 3 The flow of

Semanc
semantic-based text Pronominal POS
Role WordNet
Resoulon tagging
summarization Labeling
2.4 Hybrid Summarization Method
Sahoo et al. [19] mainly focus on hybrid summarization on single document input.
It uses the concept of sentence compression. According to them, hybrid summariza-
tion consists of five steps test preprocessing, ranking sentences based on features,
applying a rule-based approach, compressing the sentence generated, and finally the
selection of sentences using inference. ROUGE evaluation for DUC 2002 dataset
shows the score of 0.5239 (average recall for R-1) and 0.4691 (average precision
for R-1). Advantage: 71% accurate, and disadvantages: quality of summary depends
on the quality of the compression method, this approach is not as powerful as NLP
approaches such as encoder–decoder and pointer generator architecture.
2.5 Semantic-Based Method
Aksoy et al. [20] there are four stages of the semantic-based approach. Pronominal
resolution understands the semantics of the word and categorizes it based on its
features. Thus, by pronominal resolution, we can handle null and unknown words.
With the help of part of speech (POS) tagging, identification of the word based on
noun, pronoun, or adjective is evaluated. After POS tagging, our text preprocessing
is done. Figure 3 shows the semantic-based text summarization flow.
Semantic role modeling (SRL) [21] is used to obtain the relationship between each
word (target) when the context is known. WordNet is used to select the most likely
sentence from the vocabulary. The semantic-based method is further categorized
into three approaches. The multimodal document is solely made to handle multi-
document input corpus precisely [22]. Multimodal document produces symmetrical
abstract summary in salient as well as graphical form. The limitation of this model
is that it is evaluated manually. Another semantic-based method is the information-
based approach [23] in which extraction of graph-based vocabulary is performed. The
summary produced is 78% smaller than the input corpus and with a 3% loss only.
The disadvantage of this method is it faces issues in understanding the semantics
of sentences, thus the quality of linguistics is poor. The semantic graph method is
an extension to the information-based approach but uses the context of corpus to
predict the output. The graph produced by this approach is called a rich semantic
graph (RSG). The summary predicted is small and productive. The only limitation
faced by the semantic graph-based approach is it only works on a single document
input corpus. In all these techniques, the semantic representation of corpus is pushed
in the model and then performs the above-mentioned four steps in a structured and
effective manner.
3 Extractive Text Summarization
Extractive text summarization includes the selection of important texts from the input
corpus and concatenating them for the output summary. Selection is done based on
rank, vital keywords, and phrases in the sentences [24]. For the ranking of sentences,
various parameters are considered such as title word, content word features, cue
phrases feature, biased word features, and upper case words [25]. Extractive text
summarization methods are of two types, unsupervised and supervised methods.
Unsupervised learning methods include a graph-based method, latent semantic anal-
ysis (LSA), fuzzy logic approach, and concept-based approach. Supervised learning
methods include summarization using Bayes rule, neural networks approach (using
NLP), and conditional random field (CRF). In this paper, we will mainly focus on
supervised abstractive text summarization approaches and unsupervised extractive
text summarization approaches. Extractive text summarization includes four steps
[26]:
• Text preprocessing
• Removing of stop words
• Steaming
• POS tagging.
The advantage of extractive text summarization is that it is easy to generate and
understand. There are rarely any grammatical errors in extractive text summarization
compared to abstractive methods as it consists of the same phrases and keywords as
that of the original input corpus. Major limitations of these methods include:
• Lack of understanding semantics of sentences.
• Inefficient in the analysis of input corpus.
• Sometimes import keywords and phrases can get missed out.
• Sentences at times became meaningless.
• Selected sentences are generally longer than other sentences, thus unnecessary
phrases may get included sometimes.
4 Unsupervised Text Summarizatıon
In this method, training of data is not necessary. This machine learning technique
is not as accurate as supervised learning techniques. Its main goal is a selection of
words from corpus, assemble them in a structured way, and place them in summary
with help of clustering. According to Aggarwal and Gupta [21], clustering analysis is
the process of assembling the text in a structured way to predict the correct summary.
There are four basic clustering analysis algorithms for organizing the text K-means,
fuzzy C-means, hierarchical clustering, and a mixture of Gaussians.
530 D. Jani et al.
5 Datasets for Text Summarization
A dataset is the collection of data which can be in form of relational or non-relational.

There are various datasets available for testing and evaluation of text summarization
models. Some of the well-known datasets are Amazon food reviews, CNN/daily mail
dataset, and Gigaword dataset. Datasets can be classified into two types, public and
private. In this paper, the review is done on approximately 40 papers; out of them,
nearly 50–60% use public datasets. The majority of datasets available are public (60–
65%). DUC dataset of the last 10 years is the most commonly used public dataset
[2]. DUC dataset is widely used by researchers whose work is information-oriented,
for tasks that have requirements of processing of more than one language, multilin-
gual/bilingual datasets are preferred. DUC 2002 and 2004 datasets are generally used
for extractive text summarization purposes, but they fall behind for real-time summa-
rization (abstractive). Datasets such as Amazon fine food reviews are generally used
for sentiment-oriented tasks. DUC datasets are preferred for multi-document text
summarization. The amount of use of public datasets has comparatively increased
from 2015 due to its extensive amount of data and availability. The pie-chart shows
the use of datasets for research purposes [27].
6 Evaluation Methods
Evaluation methods are of two types [28], human evaluation and automatic evalu-
ation. Automatic evaluation is performed based on some algorithms such as Bleu
score [29] or ROUGE score. Bleu score algorithm is generally recommended for
machine translation models while for text summarization ROUGE score is prefer-
able. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score method is
generally preferred for evaluation of text summarization. This method was proposed
by Johri and Kumar [30]. According to Lin, ROUGE score is calculated on basis
of the relationship between the predicted summary and the original summary, more
the difference lowers the score. Different variations of ROUGE score are available
for evaluation [22]; for instance, N-gram Co-Occurrence Statistics (ROUGE-N),
Longest Common Subsequence (ROUGE-L), Weighted Longest Common Subse-
quence (ROUGE-W), and Skip-Bigram Co-Occurrence Statistics (ROUGE-S). The
comparison between different evaluation methods is shown in Fig. 4 [31]. ROUGE-
N (N-gram Co-Occurrence Statistics): This variation is similar to Bleu score in
machine translation [32]. Here, the length greater than 1 indicates that the summary
is coherent. We generally consider the value of N between 1 and 9. ROUGE-1 and
ROUGE-2 are the most commonly used variations.
ROUGE-L (Longest Common Subsequence): We Know LCS means the subse-
quence of the sentence with maximum length. Thus, the prediction of the score is
done on basis of calculation of LCS of original and predicted summary and deriving
the similarity index in-between. The similarity is derived by intuition that the longer
Fig. 4 Rouge score variation comparison on DUC 2001 dataset
the calculated subsequence between two sequences of text, the better will the simi-
larity. The level of subsequence can be either sentence level or summary level. This
is the most effective technique of rouge evaluation if the LCS predicted is effi-
cient. ROUGE-W (weighted longest common subsequence): Normal LCS does not
work efficiently as it is unable to distinguish the relationship between spatial rela-
tions. Thus, to overcome this, weighted component is appended to traditional LCS
which is referred to as WLCS (weighted LCS). The extra weight ‘k’ is in form of
a polynomial function. ROUGE-W outperforms ROUGE-L for such spatial rela-
tions. ROUGE-S (Skip-Bigram Co-Occurrence Statistics): Skip-Bigram is the pair
of words including null spaces. ROUGE-S compares the indistinguishable Skip-
Bigrams and evaluates them on basis of difference. ROUGE-S performs better than
ROUGE-L. In ROUGE-S, importance of the candidate score, I s not considered; thus
to overcome this, ROUGE-SU is introduced. We can obtain the ROUGE-SU score
by the addition of the <SOS> tag at the beginning of a sentence. Human evaluation
methods perform similar to Bleu Score (for machine translation) their correlation is
less compared to ROUGE evaluation. We can achieve a 90% correlation with the
assistance of the ROUGE valuation approach. On the evaluation of the DUC 2001
dataset, we found that ROUGE-W performs significantly better than ROUGE-L and
ROUGE-SU. Table 1 presents the summarized view of all the methods along with
their summary type, reference papers, dataset, and its efficiency using the ROUGE
Score.
Table 1 Comparison of various summarization techniques
532
Approach Input type support (yes/no) Output summary Available languages Dataset used for ROUGE evaluation
Single document Multi-document type evaluation efficiency
Ontology-based [17] Yes (domain specific) No Extractive English DUC 2002 0.5058 (R-1 F2)
Ontology-based [18] Yes No Abstractive English DUC 2002 0.4636 (R-1)
Rule-based [8] Yes No Abstractive English, Hindi DUC 2006 0.44151 (R-1)
Graph-based [11] No Yes Abstractive and English DUC 2004 0.393 (R-1)
extractive
Semantic-based [20] Yes No Abstractive and English, Hindi DUC 2006 0.39017
extractive
SumBasic [33] No Yes Extractive English DUC 2005 0.26054
LSA [34] No Yes Extractive English, Hindi DUC 2004 0.289 (ROUGE-L)
Fuzzy logic Yes Yes Extractive English, Hindi DUC 2003 0.6257
implementation [35]
Tree-based [36] Yes Yes Abstractive English DUC 2004 0.354 (R-1)
Term-frequency [37] Yes No Extractive English DUC 2005 0.672
LexRank [12] Yes Yes Extractive English DUC 2003 and 0.3963 (DUC 2003)
DUC 2004 and 0.3966 (DUC
2004) for R-1
(continued)
D. Jani et al.
Table 1 (continued)
Approach Input type support (yes/no) Output summary Available languages Dataset used for ROUGE evaluation
Single document Multi-document type evaluation efficiency
Lead and body [38] Yes No Extractive English DUC 2005 0.22176 (ROUGE 2)
for
FreqDist-Term_Dice
Hybrid [39] Yes No Abstractive English DUC 2002 0.464 (precision
score), 0.508 (recall
score), 0.485 (F-score)
w.r.t OPIOSIS
reference 10%
Feature-based Yes No Extractive English DUC 2004 0.6629 (R-1)
summarization [40]
A Concise Review on Automatic Text Summarization
533
534 D. Jani et al.
7 Conclusion
In this paper, we have studied and analyzed various research papers on different
extractive and abstractive text summarization methods. In abstractive text summa-
rization, we studied three main methods structured, semantic, and hybrid text summa-
rization methods. In structured abstractive summarization, there are four variations
available ontology-based, graph-based, rule-based, and tree-based. Apart from that,
we observed different evaluation methods such as human evaluation, Bleu score,
and ROUGE Score evaluation. In this paper, different types of datasets available for
English and Hindi text along with their classification are also mentioned above in a
concise manner. For all the above text summarization methods, some problems are
not solved yet. Rule-based text summarization performs better than graph-based, but
it performs poorly for multi-document input corpus, while graph-based efficiently
summarizes multi-document but a generated summary can go out of order. Similarly,
LSA performs better, but it is unable to handle word embedding in a precise way.
Overall, summary generated using NLP idea results in better handling of words
even using the abstractive approach. Seq2seq approach using encoder–decoder
architecture produces a less concise summary than pointer-generator architecture,
while the summary produced by pointer-generated architecture is more extractive
oriented. Encoder–decoder architecture with the pointer-generator switch is a better
alternative.
References
1. T. Shi, Y. Keneshloo, N. Ramakrishnan, C.K. Reddy, Neural abstractive text summarization

with sequence-to-sequence models. arXiv preprint arXiv:1812.02303 (2018)
2. D.K. Gaikwad, C. Namrata Mahender, A review paper on text summarization. Int. J. Adv. Res.
Comput. Commun. Eng. 5(3), 154–160 (2016)
3. M.-T. Luong, Q.V. Le, I. Sutskever, O. Vinyals, L. Kaiser, Multi-task sequence to sequence
learning. arXiv preprint arXiv:1511.06114 (2015)
4. J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word representation, in
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP) (2014), pp. 1532–1543
5. T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in
vector space. arXiv preprint arXiv:1301.3781 (2013)
6. K. Al-Ansari, Survey on word embedding techniques in natural language processing, 16 Aug
2020, https://www.researchgate.net/publication/343686323
7. P.-E. Genest, G. Lapalme, Fully abstractive approach to guided summarization, in Proceedings
of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short
Papers) (2012), pp. 354–358
8. A. Pimpalshende, A.R. Mahajan, Ruled based text summarizer for history documents. Int. J.
Innov. Eng. Technol. (IJIET) 7(4) (2016)
9. N. Kumaresh, B.S. Ramakrishnan, Graph based single document summarization, in Interna-
tional Conference on Data Engineering and Management (Springer, Berlin, Heidelberg, 2010),
pp. 32–35
10. M. Yasunaga, R. Zhang, K. Meelu, A. Pareek, K. Srinivasan, D. Radev, Graph-based neural
multi-document summarization. arXiv preprint arXiv:1706.06681 (2017)
11. K.S. Thakkar, R.V. Dharaskar, M.B. Chandak, Graph-based algorithms for text summarization,
in 2010 3rd International Conference on Emerging Trends in Engineering and Technology
(IEEE, 2010), pp. 516–519
12. G. Erkan, D.R. Radev, LexRank: graph-based lexical centrality as salience in text summariza-
tion. J. Artif. Intell. Res. 22, 457–479 (2004)
13. R. Mihalcea, P. Tarau, A language independent algorithm for single and multiple docu-
ment summarization, in Companion Volume to the Proceedings of Conference Including
Posters/Demos and Tutorial Abstracts (2005)
14. F. Ajiambo, C. Nzila, S. Namango, B. Deshmukh Ashvini, P. Shelke Pooja, A. Kokare Sayali,
S. Taware Saksha et al., Int. Res. J. Eng. Technol. (IRJET) 4(03) (2017)
15. V. Gupta, G.S. Lehal, A survey of text summarization extractive techniques. J. Emerg. Technol.
Web Intell. 2(3), 258–268 (2010)
16. M.S. Binwahlan, N. Salim, L. Suanmali, Swarm diversity based text summarization, in Inter-
national Conference on Neural Information Processing (Springer, Berlin, Heidelberg, 2009),
pp. 216–225
17. L. Hennig, W. Umbrath, R. Wetzker, An ontology-based approach to text summarization, in
2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent
Technology, vol. 3 (IEEE, 2008), pp. 291–294
18. M.J. Mohan, C. Sunitha, A. Ganesh, A. Jaya, A study on ontology based abstractive
summarization. Procedia Comput. Sci. 87, 32–37 (2016)
19. D. Sahoo, A. Bhoi, R.C. Balabantaray, Hybrid approach to abstractive summarization. Procedia
Comput. Sci. 132, 1228–1237 (2018)
20. C. Aksoy, A. Bugdayci, T. Gur, I. Uysal, F. Can, Semantic argument frequency-based multi-
document summarization, in 2009 24th International Symposium on Computer and Information
Sciences (IEEE, 2009), pp. 460–464
21. R. Aggarwal, L. Gupta, Automatic text summarization. Int. J. Comput. Sci. Mob. Comput.
6(6), 158–167 (2017)
22. C. Greenbacker, Towards a framework for abstractive summarization of multimodal documents,
in Proceedings of the ACL 2011 Student Session (2011), pp. 75–80
23. D. Mallett, J. Elding, M.A. Nascimento, Information-content based sentence extraction for
text summarization, in International Conference on Information Technology: Coding and
Computing, 2004. Proceedings. ITCC 2004, vol. 2 (IEEE, 2004), pp. 214–218
24. H.P. Luhn, The automatic creation of literature abstracts. IBM J. Res. Dev. 2(2), 159–165
(1958)
25. N. Moratanch, S. Chitrakala, A survey on extractive text summarization, in 2017 International
Conference on Computer, Communication and Signal Processing (ICCCSP) (IEEE, 2017),
pp. 1–6
26. A. El-Refaey, A.R. Abas, I. Elhenawy, Review of recent techniques for extractive text
summarization. J. Theor. Appl. Inf. Technol. 96(23), 7739–775 (2018)
27. A.P. Widyassari, S. Rustad, G.F. Shidik, E. Noersasongko, A. Syukur, A. Affandy, Review of
automatic text summarization techniques & methods. Journal of King Saud Univ. Comput. Inf.
Sci. (2020). https://doi.org/10.1016/j.jksuci.2020.05.006
28. M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E.D. Trippe, J.B. Gutierrez, K. Kochut, Text
summarization techniques: a brief survey. arXiv preprint arXiv:1707.02268 (2017)
29. K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a method for automatic evaluation
of machine translation, in Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics (2002), pp. 311–318
30. P. Johri, A. Kumar, Review paper on text and audio steganography using GA, in International
Conference on Computing, Communication & Automation (IEEE, 2015), pp. 190–192
31. C.-Y. Lin, Rouge: a package for automatic evaluation of summaries, in Text Summarization
Branches Out (2004), pp. 74–81
32. K.V. Kumar, D. Yadav, An improvised extractive approach to Hindi text summarization, in
Information Systems Design and Intelligent Applications (Springer, New Delhi, 2015), pp. 291–
300
536 D. Jani et al.
33. L. Vanderwende, H. Suzuki, C. Brockett, A. Nenkova, Beyond SumBasic: task-focused summa-

rization with sentence simplification and lexical expansion. Inf. Process. Manage. 43(6),
1606–1618 (2007)
34. M.G. Ozsoy, F. Nur Alpaslan, I. Cicekli, Text summarization using latent semantic analysis. J.
Inf. Sci. 37(4), 405–417 (2011)
35. F. Kyoomarsi, H. Khosravi, E. Eslami, M. Davoudi, Extraction-based text summarization using
fuzzy analysis. Iran. J. Fuzzy Syst. 7(3), 15–32 (2010)
36. D. Bacciu, A. Bruno, Text summarization as tree transduction by top-down TreeLSTM, in 2018
IEEE Symposium Series on Computational Intelligence (SSCI) (IEEE, 2018), pp. 1411–1418
37. H. Christian, M.P. Agus, D. Suhartono, Single document automatic text summarization using
term frequency-inverse document frequency (TF-IDF). ComTech Comput. Math. Eng. Appl.
7(4), 285–294 (2016)
38. L.H. Reeve, H. Han, S.V. Nagori, J.C. Yang, T.A. Schwimmer, A.D. Brooks, Concept frequency
distribution in biomedical text summarization, in Proceedings of the 15th ACM International
Conference on Information and Knowledge Management (2006), pp. 604–611
39. C.S. Yadav, A. Sharan, Hybrid approach for single text document summarization using
statistical and sentiment features. Int. J. Inf. Retr. Res. (IJIRR) 5(4), 46–70 (2015)
40. K. Bafna, D. Toshniwal, Feature based summarization of customers’ reviews of online products.
Procedia Comput. Sci. 22, 142–151 (2013)
Identification of Heart Failure in Early
Stages Using SMOTE-Integrated
AdaBoost Framework
B. Kameswara Rao, U. D. Prasan, Mokka. Jagannadha Rao,

Rajyalaxmi Pedada, and Pemmada Suresh Kumar
Abstract Heart disease, often known as Cardiovascular disease is one of the most
lethal yet silent killers of humans, resulting in a rise in the mortality rate of sufferers
per year. Every year, it kills nearly 17 million people worldwide in myocardial
infarctions and cardiac attacks. Heart failure (HF) occurs when the heart cannot
produce enough blood to satisfy the body’s needs. On the other hand, current risk
prediction techniques are moderately effective because statistical analytic approaches
fail to capture prognostic information in big data sets with multi-dimensional inter-
actions.The research investigates the proposed AdaBoost ensemble technique with
Synthetic Minority Oversampling Technique (SMOTE) on the medical reports of 299
heart failure patients obtained during their follow-up period at Faisalabad Institute
of Cardiology (Punjab) and Allied Hospital Faisalabad (Pakistan), during April–
December, 2015. The proposed approach builds on ensemble learning techniques
such as adaptive boosting. It provides a decision support mechanism for medical
practitioners to identify and forecast heart diseases in humans based on risk factors
for heart disease. The efficacy of the proposed method validates by comparing various
machine learning algorithms, and it is evident that the proposed method performs
better with an accuracy of 96.34.
Keywords Heart failure prediction · SMOTE · Adaptive boosting · Machine

learning · Ensemble learning
B. Kameswara Rao · U. D. Prasan · P. S. Kumar (B)

Department of Computer Science and Engineering, Aditya Institute of Technology and
Management (AITAM), Tekkali, India
Mokka. Jagannadha Rao
Department of Geology, Andhra University, Visakhapatnam, AP, India
R. Pedada
Department of Computer Science and Engineering, Dr Lankapalli Bullayya College of
Engineering, Visakhapatnam 530013, India
538 B. Kameswara Rao et al.
1 Introduction
Heart issues are currently a major source of worry in the medical community. The
heart plays a vital role in the human body’s organs. The lack of blood circulation to the
human body leads to the heartbeat disability and causes death in minutes. Major risk
factors of heart disease are unhealthy blood cholesterols, usage of tobacco, alcohol,
diabetes mellitus, obesity, eating high saturated fats, age, family history, lack of
physical exercise, and poor diet [1]. As per the World Health Organization (WHO)
reports, due to coronary artery disease (CAD), nearly, 17.5 million people are losing
their life [2]. Various types of heart diseases are arrhythmia (occurs due to heartbeat
abnormality), atherosclerosis (this condition leads to limits of oxygen flow to the
organs, which causes heart stroke), hypertensive heart disease (leads to a thickness
of heart muscles and heart failure), coronary artery disease (also called as ischemic
heart disease), congenital heart defects, pulmonary valve stenosis (this condition
arises before birth), heart infections (bacteria or viruses cause this condition). Some
of the symptoms of heart disease problems are chest pain, breathlessness, fatigue,
stomach pain, sweating, irregular heartbeat, arm or leg pain, depression, and swollen
ankles [3–5].
In recent times, providing the best quality services and effective diagnosis is chal-
lenging in the medical field. The severity of heart disease is the leading cause, which
may lead to sudden death. Heart disease, however, can be efficiently detected in its
early stages and treated, controlled, and managed. An electrocardiogram (ECG) is a
tool that examines the heartbeat rate and shows the possible functioning and irreg-
ularities of the heartbeat. Several clinicians are still unable to address the precise
needs of heart disease patients. However, it is essential to define an accurate diag-
nosis system to avoid problems of heart disease. Therefore, it is necessary to develop
a diagnostic plan based on ECG data and machine learning methods to classify heart
diseases and detect the problems in the cardiovascular system. Machine learning
is mainly used in the medical field to effectively diagnose, detect, and predict
various diseases. Machine learning (ML) is progressively used to predict various
heart diseases, and it is a subset of artificial intelligence (AI). Nowadays, it is crit-
ical to be able to sense the input data and decide the given task in the absence
of human intervention. Machine learning works based on the models by receiving
input data and applying mathematical or statistical models to predict the outputs.
Various ML algorithms are utilized for daily actives in different domains, especially
in the healthcare domain; more research is being conducted to forecast the severity
level of the disease [6–9]. The ensemble learning model provides better accuracy by
addressing the issues that machine learning algorithms face, such as time-consuming
data collecting, error-prone methods, and selecting the correct algorithm. Individual
machine learning algorithms are combined to form an ensemble learning model. The
ensemble learning model provides high accuracy and addresses the issues faced on
machine learning algorithms, such as time-consuming data collecting, error-prone
methods, and selecting the correct algorithm.
The major contributions in the article include:
Identification of Heart Failure in Early Stages … 539
• Ensemble learning method adaptive boosting has been proposed for the identifi-
cation of the heart disease in the early stages.
• The class imbalance issue in the data has been addressed by the oversampling
technique SMOTE.
• The proposed method performance is compared with the different ensemble and
ML models such as bagging, stacking, K-nearest neighbor (KNN), multi-layer
perceptron (MLP), linear discriminant analysis (LDA), quadratic discriminant
analysis (QDA), decision tree (DT), logistic regression (LR), and Gaussian Naive
Bayes (GNB).
The remainder of the paper is divided into five sections. Section 2 discusses the
literature study on the prediction of heart disease; Sect. 3 describes the proposed
approach of the paper. Section 4 depicts the experimental setup for the proposed
technique, which includes empirical data, data preprocessing, simulation environ-
ment, and parameter setting, as well as compared methodologies and performance
measurements, are taken into consideration to validate the suggested method; Sect. 5
results are analyzed, and at last, but not least in Sect. 6 concludes the paper.
2 Literature Study
The diagnosis of cardiac disease in a patient is difficult for healthcare professionals

since it necessitates various facts from multiple sources, such as laboratory test
reports and equipment. Literature on heart disease prediction is presented here.
Singh and Kumar [6] investigated several machine learning algorithms to predict
cardiac disorders using the dataset with 14 features of the patient from the UCI ML
repository. Performance of the K-nearest neighbor algorithm is significantly better
than decision tree, support vector machine, and linear regression with an accuracy of
87%. Apurb Rajdhan et al. [9] analyzed various data mining techniques along with
the random forest. Experimentation has been done with the Cleveland dataset with
14 attributes which is collected from the UCI ML repository. Random forest obtained
a better accuracy of 90.16%, which is better than the different techniques such as
Naive Bayes, logistic regression, and decision tree methods. Masetic and Subasi
[10] examined various ML techniques such as decision tree, k-nearest neighbor,
support vector machine (SVM), random forest, and artificial neural networks to
classify whether a patient had congestive heart failure or not. Feature extraction has
been done using the auto-regression Burg method. Performance has been measured
using different statistical metrics such as accuracy, specificity, sensitivity, f -measure,
and ROC curve. Random forest performed well among various comparative methods
with an accuracy of 100%. Literature on heart disease using various machine learning
algorithms has been presented in Table 1.
Table 1 Literature on heart failure disease

S. Data set (source) Method Performance Evolution References
No. factor
1 Allied Hospital in Random forests, Random Accuracy, [11]
Faisalabad (Punjab, decision tree, linear forest—0.740 f 1-score,
Pakistan), and regression, gradient (accuracy) precision,
Faisalabad Institute boosting, artificial AUC, TP rate,
of Cardiology, neural network, one ROC-AUC,
during rule, SVM radial, TN rate
April–December, Naïve Bayes,
2015 K-nearest neighbor,
SVM linear
2 Random clinical Supervised learning, Random Accuracy, [12]
data unsupervised learning, forest—0.963 sensitivity,
deep learning (accuracy) specificity,
ROC-AUC
3 Clinical records Random forest, Random Accuracy, TP [13]
from UCI support vector forest—94.31 rate, PCR
repository machine, K-nearest (accuracy) area, ROC
neighbor, decision area
tree, artificial neural
network, Naïve Bayes
4 Random clinical A stretch-driven A stretch-driven Average [14]
data growth model, growth
hierarchical modeling, model—52.7%
Bayesian inference, (average)
Gaussian process
regression, logistic
regression, support
vector machine
5 UCI heart disease Random forest, Decision Accuracy [15]
dataset decision tree, Naïve tree—93.19, Naïve
Bayes Bayes—87.27,
random
forest—89.14,
support vector
machine—92.30,
logistic
regression—87.36
(accuracy)
6 Random clinical Machine learning MARKER-HF-0.88 AUC, TPR, [16]
data assessment of risk and (AUC), 95% (CI) CI, TNR
early mortality in heart
failure
(MARKER-HF) risk
model, boosted
decision tree algorithm
(continued)
Table 1 (continued)
S. Data set (source) Method Performance Evolution References
No. factor
7 Enterprise data Logistic regression, Deep unified Accuracy [17]
warehouse, gradient boosting, max networks—76.4%
research patient out networks, deep (accuracy)
data repository unified networks,
cost-saving
evaluation—connected
cardiac care program
(CCCP)
8 1106 heart failure Multiple kernel Multiple kernel CI, hazard [18]
patients records, learning, K-means learning—95%, ratio (HR)
(MADIT-CRT) clustering K-means
clustering—95%
(CI)
9 Allied Hospital Cox regression, Cox Calibration [19]
Faisalabad-Pakistan Kaplan Meier plot, regression—81% slope, ROC
and Institute of Martingale residuals, (discrimination curve,
Cardiology bootstrapping ability) discrimination
Apr–Dec, 2015 data ability
10 Random clinical Multistep modeling EMR-wide AUC, [20]
data strategy, EMR-wide predictive accuracy
predictive model model—0.78
(AUC), 83.19%
(accuracy)
3 Proposed Method
Freund et al. [21] presented an ensemble learning technique called adaptive boosting
(AdaBoost). Base learner classifiers have been built based on the distribution of the
dataset weights, where previous base learners’ predictions determine the weights of
the instances on the dataset. If a prediction on an instance causes misclassification,
the instance weight increased in the next model; otherwise, the weight will remain
unchanged. The weighted vote makes the final decision of the base learners and the
weights basing on the models’ misclassification rates. Usually, decision trees use as
base learners in AdaBoost, where the model with a high prediction accuracy will have
high weights, and a model with a low prediction accuracy will have low weights.
AdaBoost Algorithm
1: Initialize weights w = 1
p, where p is instances in the data
2: While q < Q do: where Q is the number of models that need to be grown

2.1 A model is constructed for all the data points, and the hypothesis is, Hq x p , where
x p corresponds to the dataset and, y p corresponding labels
(continued)
(continued)
AdaBoost Algorithm
2.2 Compute the error e for the training set, which sums over all data points x p using
Eq. 1
S (q)
w p ∗G ( y p = Hq (x p ))
eq = p=1
S (q) (1)
p=1 w p
if G(condition) is valid returns 1 else and 0

2.3 Compute ψq as shown in Eq. 2

1−e
ψq = log eq q (2)
2.4 Update the weights for training S in the following (q + 1) model as shown in Eq. 3:
(q+1) (q)
wp = w p ∗ exp m ∗ G y p = Hq x p (3)
3: Continue Q iterations and compute the functional output by using Eq. 4:

Q
f (x) = sign q ∗ Hq x p (4)
q
This section addresses the dataset, data processing followed in the experiment, simu-
lation environment, parameter setting of the proposed method, and various classi-
fiers, and performance measures to verify the proposed method performance with
the comparable techniques.
4.1 Empirical Data
The clinical heart dataset considers for experimentation, and it comprises 299
patient’s heart clinical medical history collected from Allied Hospital in Faisalabad
(Punjab, Pakistan) and Faisalabad Institute of Cardiology during April–December,
2015 [11]. Out of 299 patients, 105 are women, and 194 are men ranging from 40 to
95 years. The dataset has 13 attributes that refer to essential features, clinical features,
body features, and lifestyle features for each patient, including the detailed features,
type, and description of each feature of the dataset is presented in Table 2. The dataset
consists of Boolean features such as high blood pressure, anemia, diabetes, smoking,
and sex. ‘Creatinine phosphokinase’ (CPK) reflects the level of CPK enzyme in the
blood. When muscle tissue is damaged, CPK is released into the bloodstream, when
tissues are damaged. CPK levels that are too high in a patient’s blood can lead to heart
failure. The ‘ejection fraction’ indicates how much blood the left ventricle pumps
out as a proportion of each contraction. ‘Platelets’ are the count of platelets in the
Table 2 Type and meanings of each feature of the dataset

Attribute Type Description
Age Numeric Age of a patient
Anemia Boolean Red blood cell or hemoglobin deficiency
Creatinine phosphokinase Numeric CPK enzyme levels in the blood
Diabetes Boolean Whether or not the patient is diabetic
Platelets Numeric Platelets that are in the blood
High blood pressure Boolean If hypertension is found in a patient
Sex Boolean 1—man, 0—woman
Serum creatinine Numeric In the blood, the creatinine level
Serum sodium Numeric In the blood, the sodium level
Ejection fraction Numeric The percentage of blood leaving the heart at each
contraction
Time Numeric Follow-up period
Smoking Boolean Patient smokes or not
DEATH_EVENT Boolean 1—live, 0—dead
blood. When a muscle breaks down, creatine produces ‘serum creatinine’, which is a
waste product. Doctors use serum creatinine in the blood to monitor the functioning
of the kidneys. Sodium is a mineral that helps nerves and muscles function correctly.
The ‘serum sodium’ test is a common blood test that determines if a patient’s blood
sodium levels are normal or not. The goal attribute in the proposed work is ‘Death
event’, which indicates whether the patient died or survived before the conclusion
of the follow-up period, which is on average 130 days.
4.2 Data Preprocessing
Data preprocessing is a critical activity that will increase the quality of raw exper-
imental data. It is a preliminary stage that takes all the data, sorts it, organizes it,
and merges it. Data preprocessing can also significantly impact the efficiency of
the generalization of a supervised machine learning algorithm. Null values in the
dataset are verified, and there are no missing or null values. The dependent variable
‘death_event’ is highly imbalanced with ‘0’: 203 and ‘1’: 96, presented in Fig. 1.
Synthetic minority oversampling technique (SMOTE) is used in the experimenta-
tion to resolve the class imbalance. SMOTE is implemented using over-sampling
the minority class or under-sampling the majority class. In the article, oversampling
of minority classes uses to address the class imbalance. Before feeding the data to
the classification model, SMOTE was applied to obtain better accuracy. It is done
simply by duplicating instances from the minority class example in the dataset until
Fig. 1 Death_Event class before SMOTE
Fig. 2 Death_Event class after SMOTE
a model fit. After using SMOTE, the ‘death_event’ class consists of ‘0’: 203 and ‘1’:
203 presented in Fig. 2.
4.3 Simulation Environment and Parameter Setting
The experimentation performs on Windows 10 Pro, Intel(R) Core(TM) i5-6300U

CPU @ 2.40 GHz, 64-bit operating system, ~2.5 GHz processor, and 8 GB RAM.
Table 3 Various classifiers parameter setting

Technique Parameter setting
AdaBoost base_estimator = DecisionTreeClassifier(criterion = ‘gini’,
splitter = ‘best’, random_state = 4, max_depth = 5)
n_estimators = 50; algorithm = ‘SAMME.R’;
learning_rate = 1.0
Stacking Estimators = [RandomForestClassifier(random_state = 2,
criterion = ‘gini’), DecisionTreeClassifier(criterion =
‘gini’, splitter = ‘best’, random_state = 3)]; meta_classifier
= LogisticRegression()
Bagging base_estimator = DecisionTreeClassifier(); bootstrap =
true; n_estimators = 100; random_state = 1
KNN algorithm = ‘kd_tree’; weights = ‘distance’; n_neighbors
=3
MLP hidden_layer_sizes = 15; activation = ‘relu’; batch_size =
10; random_state = 2; max_iter = 1300
LDA solver = ‘svd’; tol: 0.0001
QDA tol = 0.0002
LR random_state = 1; solver = ‘newton-cg’
GNB var_smoothing = 1e−09
Random forest random_state = 2; criterion = ‘gini’
Decision tree criterion = ‘gini’; splitter = ‘best’; random_state = 3
Stochastic gradient descent (SGD) random_state = 100; penalty = ‘l1’
Python programming frameworks such as sklearn use to perform data preprocessing

tasks and various classification techniques such as machine learning and ensemble
learning algorithms. Data analysis tasks were carried out by Numpy and Pandas
framework. Functions based on data visualization done using Matplotlib and seaborn
framework. Pycm module uses to deal with performance measures in the multiclass
classification. The class imbalance problem solves by the technique called random
oversampling using imblearn. Different methods and their parameters have presented
in Table 3.
4.4 Performance Measures
The proposed adaptive boosting classifiers predict the survival of heart disease from
the clinical records of several patients. For validating the performance, the proposed
method compares with various machine learning and ensemble learning algorithms
different performance metrics such as confusion matrix, true-positive rate, false-
positive rate, precision, f 1-score, accuracy, and ROC-AUC (Area under the receiver
operating characteristic curve) [22].
5 Result Analysis
The survival of heart failure prediction using the AdaBoost ensemble technique
and validated the proposed method with various ensemble learning techniques
and machine learning is described in this section. The experimental results of the
measures mentioned above, such as true positive (TP), true negative (TN), false posi-
tive (FP), false negative (FN), precision, true-positive rate (TPR), F1, ROC-AUC,
and false-positive rate (FPR), are presented in Table 4. Compared to other techniques
such as stacking, bagging, logistic regression, decision tree, linear discriminant anal-
ysis, quadratic discriminant analysis, multi-layer perceptron, K-nearest neighbor, and
Gaussian Naive Bayes, the results obtained by the proposed AdaBoost technique are
utterly better. Among all, k-nearest neighbor performs the worst in relative efficiency,
while Ada-Boost performs the best.
From the assessment of the findings, ensemble learning methods obtained an accu-
racy between 96 and 90%, and machine learning algorithms achieved between 89 and
63%. In detail, the AdaBoost classifier achieved better accuracy of 96.34, followed
by stacking and bagging, with 93.9% and 90.24%, respectively. Decision tree with
89.02%, logistic regression, and linear discriminant analysis obtained the same accu-
racy 86.59%; quadratic discriminant analysis, Gaussian Naïve Bayes, multi-layer
perceptron produce 82.93%, 81.71%, 70.73% accuracies, respectively, and finally,
k-nearest neighbor obtained 63.4% accuracy. True positive and true negative signifies
the correctly classified instances. False positive and false negative represent incor-
rectly classified instances. In the proposed AdaBoost classifier, TP and TN are 36 and
43, indicating that 36 patients are healthy and predicted as healthy, and 43 patients
are unwell and anticipated as sick. FP and FN are 0 and 3, signifying that all iden-
tified sick patients are predicted as unhealthy, and 3 patients are ill but predicted as
healthy. In the case of a recall, stacking and decision tree produced the highest value
with 0.95, followed by AdaBoost and multi-layer perceptron obtained 0.92; bagging
achieved 0.87; LDA and LR got 0.85, Gaussian Naïve Bayes and QDA with 0.79,
and k-nearest neighbor attained a value of 0.67. In the case of FPR value, AdaBoost
obtained 0, followed by bagging, stacking, LDA, LR, QDA, DT, GNB, KNN, and
MLP have 0.07, 0.07, 0.12, 0.12, 0.14, 0.16, 0.16, 0.40, 0.49, respectively. For preci-
sion, AdaBoost delivers a value of 1.00, followed by bagging, stacking, LDA, LR,
DT, QDA, GNB, MLP, KNN produce 0.93, 0.92, 0.87, 0.87, 0.84, 0.84, 0.82, 0.63,
0.60 values, respectively. AdaBoost obtained the highest ROC-AUC value 0.96, then
bagging, stacking, decision tree produced 0.94, 0.90, 0.89, LR, LDA, QDA, GNB,
MLP, KNN obtained the values 0.86, 0.86, 0.83, 0.82, 0.72, and 0.64, respectively.
The proposed method outperformed compared to various ensemble learning and
machine learning models by considering all the results.
Figure 3a–j presents the precision-recall curves for machine learning techniques,
proposed AdaBoost classifier, and other ensemble learning techniques. AdaBoost and
stacking obtained the highest average precision value of 0.99, followed by bagging
with 0.98, LDA, LR, GNB, MLP, DT, and KNN obtained average precision 0.93, 0.92,
Table 4 Performance evaluation of proposed and comparative methods
Intelligent technique Accuracy TP FP TN FN TPR (recall) FPR Precision TNR F1 ROC-AUC
AdaBoost 96.34 36.00 0.00 43.00 3.00 0.92 0.00 1.00 1.00 0.96 0.96
Stacking 93.90 37.00 3.00 40.00 2.00 0.95 0.07 0.93 0.93 0.94 0.94
Bagging 90.24 34.00 3.00 40.00 5.00 0.87 0.07 0.92 0.93 0.89 0.90
Decision tree 89.02 37.00 7.00 36.00 2.00 0.95 0.16 0.84 0.84 0.89 0.89
Identification of Heart Failure in Early Stages …
Logistic regression 86.59 33.00 5.00 38.00 6.00 0.85 0.12 0.87 0.88 0.86 0.86
Linear discriminant analysis 86.59 33.00 5.00 38.00 6.00 0.85 0.12 0.87 0.88 0.86 0.86
Quadratic discriminant analysis 82.93 31.00 6.00 37.00 8.00 0.79 0.14 0.84 0.86 0.82 0.83
Gaussian Naïve Bayes 81.71 31.00 7.00 36.00 8.00 0.79 0.16 0.82 0.84 0.81 0.82
Multi-layer perceptron 70.73 36.00 21.00 22.00 3.00 0.92 0.49 0.63 0.51 0.75 0.72
K-nearest neighbor 63.41 26.00 17.00 26.00 13.00 0.67 0.40 0.60 0.60 0.63 0.64
547
A: AdaBoost Precision-Recall curve B: Stacking Precision-Recall curve
C: Bagging Precision-Recall curve D: Decision Tree Precision-Recall curve
E: LogisticRegression Precision-Recall curve F: LinearDiscriminantAnalysis Precision-Recall curve
G: QuadraticDiscriminantAnalysis Precision-Recall curve H: GaussianNB Precision-Recall curve
Fig. 3 Precision-recall curves of a AdaBoost, b stacking, c bagging, d DT, e LR, f LDA, g QDA,
h GNB, i MLP, j KNN
I: MLP Precision-Recall curve J: K-Nearest Neighbor Precision-Recall curve
Fig. 3 (continued)
0.92, 0.90, 0.82, and 0.72, respectively. The results show that AdaBoost performs
very well in terms of average precision.
The prediction error is defined as the variation between predicted and actual
values. In the proposed method, the 3 instances from class ‘0’ predicted as class ‘1’
are incorrectly classified instances in the test data in the proposed technique. The
class prediction error for the proposed technique, other machine learning techniques,
and ensemble learning techniques is shown in Fig. 4.
The accuracy of the proposed technique and various comparative models is
presented in Fig. 5. The proposed adaptive boosting performed better than all
methods.
6 Conclusion
In this article, we presented intelligent methods to strengthen assistance for heart

failure-affected patients. The paper proposed a classification strategy to classify the
survival of heart failure patients using the AdaBoost ensemble system. The proposed
method validates by comparing various techniques such as stacking, bagging, GNB,
LR, DT, LDA, QDA, MLP, and KNN. The efficiency of the classification model
improved by solving the class imbalance problem by applying SMOTE to the data.
Before comparing the models, every technique has been tested with different param-
eters to achieve the highest accuracy. The findings show that the AdaBoost outper-
formed the other machine learning algorithms in terms of different performance
metrics in predicting the survival of heart failure students. In the future. The reported
better classification performance in predicting heart failures comes from AdaBoost’s
extremely appealing properties, which we describe here. AdaBoost lowers the classi-
fication error on training data to zero as the number of training steps grows, based on
the modest assumption that a weak learner obtains a lower error rate. AdaBoost also
successfully optimizes the margin of the resultant ensemble classifier, in addition to
reducing a given cost function. In the future, we will enhance the data with more
Fig. 4 Class prediction error for the various comparative methods and the proposed technique
Fig. 5 Accuracy of the various comparative methods and the proposed method
number of patient records and extend the work using sophisticated deep learning
techniques on the image data.
References
1. Y. Xing, J. Wang, Z. Zhao, Y. Gao, Combination data mining methods with new medical data
to predicting outcome of coronary heart disease (2008), pp. 868–872. https://doi.org/10.1109/
iccit.2007.204
2. J. Mackey, G. Mensah, WHO 2004_atlas oh heart disease and stroke.pdf (2004), p. 9
3. A.A. Ariyo et al., Depressive symptoms and risks of coronary heart disease and mortality in
elderly Americans. Circulation 102(15), 1773–1779 (2000). https://doi.org/10.1161/01.CIR.
102.15.1773
4. M.A. Whooley et al., Depressive symptoms, health behaviors, and risk of cardiovascular events
in patients with coronary heart disease. JAMA J. Am. Med. Assoc. 300(20), 2379–2388 (2008).
https://doi.org/10.1001/jama.2008.711
5. L.R. Wulsin, J.C. Evans, R.S. Vasan, J.M. Murabito, M. Kelly-Hayes, E.J. Benjamin, Depres-
sive symptoms, coronary heart disease, and overall mortality in the Framingham Heart
Study. Psychosom. Med. 67(5), 697–702 (2005). https://doi.org/10.1097/01.psy.0000181274.
56785.28
6. A. Singh, R. Kumar, Heart disease prediction using machine learning algorithms, in 2020
International Conference on Electrical and Electronics Engineering (ICE3), Feb 2020, pp. 452–
457. https://doi.org/10.1109/ICE348803.2020.9122958
7. D. Shah, S. Patel, S.K. Bharti, Heart disease prediction using machine learning techniques. SN
Comput. Sci. 1(6), 345 (2020). https://doi.org/10.1007/s42979-020-00365-y
8. R. Katarya, P. Srinivas, Predicting heart disease at early stages using machine learning: a survey,
in Proceedings of the International Conference on Electronics and Sustainable Communication
Systems, ICESC 2020 (2020), pp. 302–305. https://doi.org/10.1109/ICESC48915.2020.915
5586
9. D.P.G. Apurb Rajdhan, M. Sai, A. Agarwal, D. Ravi, Heart disease prediction using machine
learning. Lect. Notes Electr. Eng. 9(04) (2020)
10. Z. Masetic, A. Subasi, Congestive heart failure detection using random forest classifier. Comput.
Methods Programs Biomed. 130, 54–64 (2016). https://doi.org/10.1016/j.cmpb.2016.03.020
11. D. Chicco, G. Jurman, Machine learning can predict survival of patients with heart failure
from serum creatinine and ejection fraction alone. BMC Med. Inform. Decis. Mak. 20(1),
1–16 (2020). https://doi.org/10.1186/s12911-020-1023-5
12. C.R. Olsen, R.J. Mentz, K.J. Anstrom, D. Page, P.A. Patel, Clinical applications of machine
learning in the diagnosis, classification, and prediction of heart failure: machine learning in
heart failure. Am. Heart J. 229, 1–17 (2020). https://doi.org/10.1016/j.ahj.2020.07.009
13. S. Rahayu, J. Jaya Purnama, A. Baroqah Pohan, F. Septia Nugraha, S. Nurdiani, S. Hadianti,
Prediction of survival of heart failure patients using random forest. J. Pilar Nusa Mandiri 16(2),
255–260 (2020). [Online]. Available: www.ubsi.ac.id
14. M. Peirlinck et al., Using machine learning to characterize heart failure across the scales.
Biomech. Model. Mechanobiol. 18(6), 1987–2001 (2019). https://doi.org/10.1007/s10237-
019-01190-w
15. F.S. Alotaibi, Implementation of machine learning model to predict heart failure disease. Int.
J. Adv. Comput. Sci. Appl. 10(6), 261–268 (2019). https://doi.org/10.14569/ijacsa.2019.010
0637
16. E.D. Adler et al., Improving risk prediction in heart failure using machine learning. Eur. J.
Heart Fail. 22(1), 139–147 (2020). https://doi.org/10.1002/ejhf.1628
17. S.B. Golas et al., A machine learning model to predict the risk of 30-day readmissions in
patients with heart failure: a retrospective analysis of electronic medical records data. BMC
Med. Inform. Decis. Mak. 18(1), 1–17 (2018). https://doi.org/10.1186/s12911-018-0620-z
18. M. Cikes et al., Machine learning-based phenogrouping in heart failure to identify responders
to cardiac resynchronization therapy. Eur. J. Heart Fail. 21(1), 74–85 (2019). https://doi.org/
10.1002/ejhf.1333
19. T. Ahmad, A. Munir, S.H. Bhatti, M. Aftab, M.A. Raza, Survival analysis of heart failure
patients: a case study. PLoS ONE 12(7), 1–8 (2017). https://doi.org/10.1371/journal.pone.018
1001
20. K. Shameer et al., Predictive modeling of hospital readmission rates using electronic medical
record-wide machine learning: a case-study using Mount Sinai heart failure cohort, in Pacific
Symposium on Biocomputing 2017 (2017), pp. 276–287
21. Y. Freund, R.E. Schapire, M. Hill, Experiments with a new boosting algorithm rooms f 2B-428,
2A-424 g (1996)
22. B.K. Rao, P.S. Kumar, D.K.K. Reddy, J. Nayak, B. Naik, QCM Sensor-Based Alcohol Classi-
fication by Advance Machine Learning Approach (Springer, Singapore, 2021), pp. 305–320
A Comparative Study of Different
Forecasting Models for Energy Demand
Forecasting
Tanvir Islam, Saber Elsayed, Daryl Essam, and Ruhul Sarker
Abstract In recent years, many machine learning and artificial intelligence-based

models have been used for estimating energy demand across different scenarios.
However, not many papers focused on national energy requirements and especially
in the context of Australia. Australian Energy Market Operator (AEMO) uses a
traditional econometric model such as linear regression for forecasting annual energy
demand for each Australian state. In this paper, to find an appropriate model, we
conducted a comparative study of four different energy demand forecasting models
using the national-level monthly economic data from Australia. The models are
tested across various training lengths and types of input data sets. Results show that
machine learning-based models are more robust and more accurate across the whole
range of scenarios.
Keywords Energy demand forecasting · Machine learning · Linear regression ·

Artificial neural network
1 Introduction
Energy is a well-known player for economic prosperity in any country. However, it is

widely accepted that there is a link between energy pricing and energy consumption
in a country. Lack of adequate energy sources can lead to high energy prices which
can adversely affect the economy through inflation and low economic growth. For this
T. Islam (B) · S. Elsayed · D. Essam · R. Sarker

School of Engineering and IT, University of New South Wales (UNSW Canberra), Canberra,
Australia
e-mail: Tanvir.Islam@adfa.edu.au
S. Elsayed
e-mail: s.elsayed@adfa.edu.au
D. Essam
e-mail: d.essam@adfa.edu.au
R. Sarker
e-mail: r.sarker@adfa.edu.au
554 T. Islam et al.
reason, proper energy forecasting and planning are essential for continued prosperity
and economic growth.
Energy consumption forecasting and planning is traditionally done through econo-
metric modeling in Australia and many other countries. Australian Energy Market
Operator (AEMO) lays down the methodology followed for various energy consump-
tion estimates (see details can be found in [1]). The basic tool they follow is linear
regression which is the most common tool for econometric modeling. Linear regres-
sion models are usually good enough where forecasted values are used in long-term
planning and the fluctuation of data is not significant. As this type of model is gener-
ated using a least square method, the estimated values depend on the average trend
of data used. However, no data collected from any practical system shows a linear
trend. Even if there is a clear trend of data, either upward or downward, we cannot
expect a perfect fitting to a linear model. Practical data are usually fluctuating with or
without cyclic or seasonality patterns. For example, the energy consumption pattern
in summer is different from the same in winter, and it varies with population growth
and technology changes. As of the literature, the use of linear regression for such
situations, specifically for short-term forecasting, is not a preferred option. However,
it is possible to apply nonlinear regression using the linear regression methodolo-
gies where the form of the nonlinear function must be provided by the user. The
main drawback of this approach is that it is extremely hard to find the right form of
nonlinear function for a given set of data.
To fit the data into an appropriate nonlinear function, machine learning approaches
such as neural networks and support vector machines usually perform better than
standard linear regression time series models [2]. These approaches are very popular
for both short- and long-term predictions. In recent times, there has been a great
expansion in the usage of machine learning models for various applications related
to prediction problems.
To find an appropriate prediction approach of Australian energy consumption, in
this paper, we conduct a comparative study of four existing methodologies, namely
linear regression, feed-forward neural network, support vector machines, and extreme
gradient boosting method. These methods are implemented using a wide range of
data from the Australian energy sector and their performances are analyzed and
compared. We also consider the appropriate input selection, model selection, and
the robustness of forecast accuracy. The insight of a problem in regression model
implementation is uncovered and reported. The results support the fact that machine
learning techniques provide more accurate forecasting.
The paper is organized as follows. After the introduction, Sect. 2 presents a brief
literature review on different popular forecasting techniques. Section 3 describes the
details of the experimental study. The results are presented and analyzed in Sect. 4.
Finally, conclusions are drawn in Sect. 5.
A Comparative Study of Different Forecasting Models … 555
2 Literature Review
There is a wide range of techniques or models that have been used for energy
forecasting. These models can be broadly classified into the following categories:
traditional statistical techniques, artificial neural network techniques, evolutionary
computational techniques, and other techniques. A brief introduction and review of
these techniques are presented below.
First, we discuss the traditional statistical techniques such as linear regression
and a specific moving average technique that are widely used in forecasting. Linear
regression is a method to predict a target variable by fitting the best linear relationship
between a dependent variable and the independent variables. The best fit (also known
as the goodness of fit) is done by making sure that the sum of all the distances between
the shape and the actual observations at each point is as small as possible [3]. A
straight-line equation is derived with constant values that give the least amount of
error between forecast values and actual values. A simple linear regression equation
is expressed as below:
Y =a+b∗ X (1)
In this equation, Y is the dependent variable that must be forecasted, X is the

independent variable, and two constants a and b. a is known as the intercept, and b
is the slope of the linear equation. When there is one set of independent variables,
the equation is known as single linear regression; on the other hand, the model is
known as multiple regression when multiple independent variables are used. As we
indicated earlier, linear regression is suitable for a stable base model or trend model.
As pointed out in [1], Australian Energy Market Operator (AEMO) currently uses
a linear regression model for forecasting electricity consumption in each state for
small and medium enterprises.
Moving average (MA) is a simple but popular forecasting approach for time
series data where the average of given k values is taken, and the average is updated
(or moving forward from a time window to another window) with the addition of
every new data. Autoregressive integrated moving average (ARIMA) is a special
MA model where autoregressive means that model assumes a relationship between
forecast value and an early period’s measured value [4]. Integrated means that there
is a difference of observations between the current period and earlier period to make
the time series stationary. Moving average refers to the use of residual errors for
estimating forecast values through a moving window over some time [5].
Artificial neural networks (ANNs) are popular methods for forecasting using
complex data patterns [6]. ANNs are programmatic representations of a human or
biological neural network. The basic idea is that information is processed and passed
through different layers of connected neurons. A neuron can be activated based on
the strength of the input signal it receives. Weights are modified at different layers
to make predictions closer to actual values. At the very least, an ANN has one
input layer, one output layer, and multiple hidden layers. All layers have multiple
556 T. Islam et al.
connected nodes. The input layer takes in the input signal. Hidden layers act as
distillation layers, and they distill out important patterns from the input and pass
them onto the next layers. Hidden layers also have activation functions that help in
capturing the nonlinear relationships and convert their input to output. Layers and
nodes are connected through weights which change when the ANN is trained to
map an input to output. Later on, the ANN is tested with test data to predict output
from input test data [7]. Multilevel perceptron model (MLPM), which is also called
a feed-forward neural network (FFNN), is a simple ANN and can be used for both
classification and numerical prediction problems [8].
Machine learning is a way of using computer programs for automating tasks.
However, machine learning systems are not explicitly programmed rather; they are
provided with many examples of data and solutions, which is called training the
system. After that, machine learning systems can automate the tasks themselves
because they learn the rules [5]. ANNs are often at the heart of machine learning
systems. Deep learning is learning through ANNs. Here, the “Deep” refers to many
layers of successive representations of information [5]. This is essentially many
layers of neural networks structured to learn input–output rules for a task. Some of
the commonly used ANN techniques are discussed below.
Recurrent neural network (RNN) goes through an internal loop for processing
information. It reuses quantities calculated in the previous iteration of the loop. It is
a preferred model to use for sequential data [9]. In convolutional neural networks
(CNNs), there are three sections: convolutional layers, pooling layers, and fully
connected MLP layers [8].
Support vector machine (SVM) sets up a hyperplane and boundary lines in such
a way that the maximum number of data points are captured within the boundaries.
Support vector regressor (SVR) works on the same principles of SVM with minor
differences. It creates a correlation matrix based on training data. Relevant parameters
from the correlation matrix are subsequently extracted and used in estimator functions
that estimate outputs for test data [10].
Evolutionary computation (EC) techniques are widely recognized as stochastic
global search methods for solving complex optimization problems. Among the EC
techniques, a genetic algorithm (GA) is a popular approach. EC techniques are based
on the concept of natural selection, adaptation, and survival of the fittest. The process
uses a population of individuals (/solution points) that are evolved through variation
operators such as crossover and mutation in several generations (/iterations) until
a stopping criterion is met. As the forecasting methods usually apply optimization
techniques in minimizing the prediction errors, EC techniques can serve the purpose
of optimization. Hyperparameter optimization of machine learning models can be
done through EC techniques.
As indicated earlier, we will conduct an experimental study of four forecasting
techniques that include linear regression, feed-forward neural network, support
vector machines, and the extreme gradient boosting method. These methods are
briefly reviewed below.
AEMO uses linear regression for various sectors of the economy [1]. The equation
below describes their consumption forecast for small-medium enterprises (SME)
based on Gross State Product (GSP), electricity price, efficiency, climate change,
and shock factor.
SME_Cons = GSPimpact + electricitypriceimpact + energyefficiencyimpact

+ climatechangeimpact + shockfactorimpact (2)
Lu and Zhang [11] proposed a long-term model for forecasting electricity

consumption. They used input variables: GDP, population, import, and export. They
proposed a multi-variable linear regression model that has been optimized using
a GA. The confidence intervals are varied through a GA to find optimum regres-
sion coefficients. Based on the comparison with traditional regression and the SVM
model, the GA-based regression model delivered the best results.
Feed-forward neural network models were used by [12] for electrical load fore-
casting performance analysis for smart grid programs. Feed-forward neural network
/MLP was also used in [13] for long-term load forecasting of Canada’s power grid.
Kumar and Dixit [14] successfully used MLP models for forecasting daily peak loads
for the city of Bangalore in India. The MLP models performed better than conven-
tional curve fitting techniques. ANNs were also successfully applied by Aydinalp
et al. [15] to model the energy requirement for the Canadian household sector. Yuan
et al. [16] presented a hybrid model for effectively modeling the long-term electricity
requirement for a certain region of China. The model was used to forecast annual
electricity consumption based on the input variables regional GDP, industrial-valued
added, and annual social fixed asset investment. For electric load forecasting, Park
et al. [6] built and trained an ANN to forecast the hourly load based on environmental
temperature. They achieved reasonable results based on the temperature data they
used. Kuo et al. [17] used a RNN to predict chaotic time series. Deep belief networks
were used by Ribeiro and Lopes [18] for predicting bankruptcy for French companies
from a database of financial information.
SVM was successfully used by Khan et al. [19] to model an hour ahead short-
term load requirements based on the historical data of their educational institution.
Shukla et al. [20] used an SVR for benchmarking the performance of two other
proposed models for near real-time load forecasting for a regional state of India.
Models were used to predict weekend and weekdays loads separately. SVR was used
in [21] to develop a mid-term load forecast incorporating factors that have come
into play due to electricity market reforms. Thissen et al. [22] investigated SVR
along with the ARMA and Elma networks from prediction approaches in the field
of chemometrics. SVR performed better than the other two models. More recently,
Du et al. [23] used an SVR model enriched with a gray wolf optimizer to predict the
life cycle costs of power transformers for electric power companies.
XGBoost stands for “Extreme Gradient Boosting.” The underlying algorithm is
founded upon an incrementally built decision tree that is gradient boosted. More
elaborately, “Gradient boosting is an approach where new models are created that
predict the residuals or errors of prior models and then added together to make the
final prediction. It is called gradient boosting because it uses a gradient descent
558 T. Islam et al.
algorithm to minimize the loss when adding new models” [24]. Wang et al. [25] used
an XGBoost-based hybrid model to forecast building energy consumption success-
fully. The data was provided by the US national energy laboratory. XGBoost-based
hybrid model was used by Fan et al. [26] and Wang et al. [27] to accurately predict the
short-term load of a distribution network for an electricity company in China and the
load demand with greater accuracy compared to status quo models and techniques
respectively.
Although many approaches were published in the area of energy forecasting, most
had the following limitations: they used a single set of variables/features for judging
their respective model performances and none of the studies showed how the models
performed on multi-period forecasting.
3 Experimental Study
As previously discussed, various models have been proposed for energy demand
forecasting. Most of these methods can be categorized into two major categories:
(a) causal models and (b) historical data-based methods [28]. In the causal methods,
energy consumption is taken as the output variable and some economic, social, and
climatic variable is taken as the input variable. The focus of this research is on causal
models with the following objectives:
• Ascertain the set of variables as input variables,
• Ascertain the set of models for reasonable accuracy, and
• How to assess the robustness of outputs.
In this study, we considered the following four models: linear regression, ANN
(feed-forward model), SVR, and XGBoost (Extreme Gradient Boost). The features
of these models are presented below.
(a) Linear Regression function was used from sklearn.linear_model python library
with all its default values.
(b) ANN (MLP/Feed-Forward Model)—Structure consisted of 1 input, 1 output,
and 3 hidden layers. Hidden layers had 200, 112, and 50, respectively. The
optimizer type was “adam”.
(c) The SVR function was used from sklearn.svm module in python. The Kernel
type used was “rbf”. C had the default value of 1 and epsilon 0.1. C is a penalty
factor for misclassified data points, epsilon represents a margin of tolerance
where no penalty is given to errors.
(d) XGBRegressor function was used from the xgboost python module. All the
default parameters were used. Default values can be found in [29].
In this study, we consider two sets of input variables for judging the performance of
four models under consideration as shown in Table 1. These two sets of features were
chosen based on their importance in affecting the output variable. The output variable
in both cases was the energy consumption (in tons of oil). Features contained in data
Table 1 Input variable sets

Data set
for our study
Alpha Beta
Input variables GDP GDP
Electricity market Population
spot prices
Gas prices House hold electricity
price index
Coal prices Consumer price index
set Alpha and Beta were selected based on SelectKbest run results. SelectKbest
run results showed which were the most important features and also subsequent
correlation considerations among the features themselves were considered in creating
data set Alpha and Beta. Each feature was scaled from 0 to 100 with min–max feature
scaling.
Each set of input variables (Alpha and Beta) was run through each of the four
models (LR, ANN, SVR, and XGBoost) described earlier. The accuracy of prediction
was noted for each case.
3.1 Data Source and Collection
Primarily data was sourced from [30, 31]. Following is a snapshot of the input data
file. As part of preprocessing of data, annual data was converted to monthly data and
backcast if some data was missing. Backcasting of data was done by mostly using
functions that are available within MS Excel that provides the best fit for a given set
of data. Sample input data are shown in Fig. 1.
Year,Month,Population,GDP,CPI,Electricity_Market_Spot_Prices,Househol
d_Electricity_price_index,Coal_Prices,Gas_Prices,Calculated_Energy_Us
e
1960,1,0,0,0,0,0,0,0,0
1960,2,0,0,0,0,0,0,0,0
.
.
1960,12,0,0,0,0,0,0,0,0
1961,1,0.114067099,0.033043091,0.013569047,0.102523687,0.141242938,0.
084813454,0.141003855,0.083142697
1961,2,0.228134198,0.066086183,0.027138094,0.205047374,0.282485876,0.
169626909,0.282007709,0.166285395
.
.
Fig. 1 Sample input data

560 T. Islam et al.
A sliding window is used for various periods of data. The window consisted of
five different periods. The periods of the training window are 1, 2, 3, 6, and 10 years.
The source data consist of monthly data, and the task of the research method is to
make predictions for multiple months (month-wise for 12 months) in each instance.
The experiments are run for five different training spans, namely 12, 24, 36, 48, and
120 months, and the testing period is one year for all runs. A sliding window was
slid over 30 years: from 2019 backward till 1990.
Python 3.5 was used for building the models. These modules were used for
importing relevant functions from them: Sklearn, Tensorflow, and xgboost. A laptop
with AMD 4 core CPU with 8 GB RAM was used to run the experiments.
4 Results and Analysis
The experiments were run and the output errors as MSEs were collected for each
of five different training spans, over four different models, with two different data
sets in both training and testing phases. These errors were then summarized, and the
summary of average errors is presented in Table 2. From this table, it can be seen that
the three machine learning models (MLP, SVR, and XGBoost) perform better than
the LR model across various training periods and both data set Alpha and Beta. The
LR model did well on certain instances but performed poorly across various training
lengths and data sets. Especially since the scenarios dealt with forecasting for 12
monthly periods, the errors added up quickly to produce very large numbers. In the
table, M is shown where the error is very big. It can be seen that XGBoost is the most
accurate model among the four models. It performed well across different training
periods and data sets (Alpha and Beta). Currently, AEMO uses linear regression
models for the SME sector’s long-term forecasting model. The results reveal that
the XGBoost model is outstanding for this kind of task among the compared models
across various data types and also for multi-period forecasting.
As shown in Table 2, for the testing phase with the Alpha data set, the average
MSEs for SVR and MLP are at least two and six times higher than the same from
XGBoost, respectively. With the Beta data set, they are at least 1.7 times higher.
However, the MSEs vary a lot with the length of training data sets for the three
better-performing methods (MLP, SVR, and XGBoost). To visualize their relative
Table 2 Summary results from the models with five different training periods
Test mean MSE Test mean MSE Train mean MSE Train mean MSE
range for Alpha range for Beta range for Alpha range for Beta
MLP 18.6–38.56 5.10–26.26 1.35–7.57 0.55–4.98
SVR 6.46–21.80 5.10–13.84 0.37–1.81 0.0–0.74
XGBoost 3.03–3.22 3.04–3.05 0.0–0.0 0.0–0.0
LR 13.9–8.34E + 14 6.64–168,885 0.0–0.0 0.0–0.64
MLP SVR XGBoost
50.00
40.00
30.00
20.00
10.00
0.00
1 YEAR 2 YEARS 3 YEARS 6 YEARS 10 YEARS
Fig. 2 Test MSE for various training lengths with data set Alpha
MLP SVR XGBoost
30
25
20
15
10
5
0
1 YEAR 2 YEARS 3 YEARS 6 YEARS 10 YEARS
Fig. 3 Test MSE for various training lengths with data set Beta
performances, the training length-wise MSEs are plotted in Fig. 2 for the Alpha data
set and in Fig. 3 for the Beta data set. From these two figures, it is clear that MLP is
worse than SVR for lower and higher training lengths and they have similar perfor-
mances for two training lengths in the middle. However, XGBoost is consistently
better than both MLP and SVR for all training lengths. So, it is easy to claim that
XGBoost is the best model, out of the four models investigated in this study, for the
data sets used.
5 Conclusions
This research aimed to find suitable models for making longer-term and multi-period
forecasting for energy requirements in Australia based on macroeconomic input
variables. This paper presented the considered models along with the results obtained
562 T. Islam et al.
and the results of a traditional econometric model with two different feature (input
variable) sets and 5 different training window scenarios. In each case, forecasting
was made for 12 periods consisting of 12 months of data.
Out of the 4 models (MLP, SVR, XGBoost, and LR) considered in this study,
the 3 AI-based models (MLP, SVR, and XGBoost) produced reasonably accurate
results across all scenarios. LR-based model sometimes did well, but other times
performed poorly when data variability was high. XGBoost model proved to be the
most accurate and robust among these models as it performed best across different
training lengths and for both source data sets (Alpha and Beta). In the testing phase,
the mean MSE ranges obtained from XGBoost for the Alpha and Beta data sets are
3.03–3.22 and 3.04–3.05, respectively. These ranges are a few times higher for MLP
and SVR. In the training phase, XGBoost provided zero mean MSE values for both
data sets, which is both positive and significant for both of them. Note that although
LR shows good mean MSE for its training phase, they are unusually high (e.g., 13.9
to 8.34E + 14 for Alpha) for the testing phase, due to data fluctuations.
The task of the models was to produce 12 months of forecast (test) based on
training periods of various lengths. It appeared that shorter training periods improved
performance for the 3 AI-based models. This can be explained by the fact that the
most recent data might be the best predictor for future performance.
Further work can be carried out for investigating the suitability of these models
or additional AI-based models for making longer-term forecasting with a forecast
period of 1–5 years. Whether the hybridization of models can lead to better long-term
forecasting and can also be investigated.
References
1. Australian Energy Market Operator, Electricity demand forecasting methodology information

paper (2020). [Online]. https://aemo.com.au/-/media/files/electricity/nem/planning_and_for
ecasting/inputs-assumptions-methodologies/2020/2020-electricity-demand-forecasting-met
hodology-information-paper.pdf?la=en
2. A. Parmezan, V. Souza, G. Batista, Evaluation of statistical and machine learning models for
time series prediction: identifying the state-of-the-art and the best conditions for the use of
each model. Inf. Sci. 484, 302–337 (2019)
3. J. Le, The 10 statistical techniques data scientists need to master. [Online]. https://medium.
com/cracking-the-data-science-interview/the-10-statistical-techniques-data-scientists-need-
to-master-1ef6dbd531f7
4. J. Saboia, Autoregressive integrated moving average (ARIMA) models for birth forecasting.
J. Am. Stat. Assoc. 72(358), 264–270 (1977)
5. J. Lee, Time Series Forecasting with Python (Machine Learning Mastery, 2019)
6. D.C. Park, M.A. El-Sharkawi, R.J. Marks, L.E. Atlas, M.J. Damborg, Electric load forecasting
using an artificial neural network. IEEE Trans. Power Syst. 6(2), 442–449 (1991). https://doi.
org/10.1109/59.76685
7. J. Mahanta, Introduction to neural networks, advantages, and applications. [Online]. https://
towardsdatascience.com/introduction-to-neural-networks-advantages-and-applications-968
51bd1a207
8. J. Brownlee, When to use MLP, CNN and RNN neural networks. [Online]. https://machinele
arningmastery.com/when-to-use-mlp-cnn-and-rnn-neural-networks/
9. G.R. Kanagachidambaresan, A. Ruwali, D. Banerjee, K.B. Prakash, Recurrent neural

network, in Programming with TensorFlow, ed. by K.B. Prakash, G.R. Kanagachidambaresan.
EAI/Springer Innovations in Communication and Computing (Springer, Cham, 2021). https://
doi.org/10.1007/978-3-030-57077-4_7
10. P. Pedamkar, Support vector regression. [Online]. https://www.educba.com/support-vector-reg
ression/
11. Q. Lu, Z. Zhang, A genetic algorithm regression model for the mid-long term of China’s
electricity consumption, in Proceedings of the 31st Chinese Control and Decision Conference,
CCDC 2019, vol. 8833197 (2019), pp. 4776–4782
12. M. Mansoor, F. Grimaccia, S. Leva, M. Mussetta, Comparison of echo state network and feed-
forward neural networks in electrical load forecasting for demand response programs. Math.
Comput. Simul. 184, 282–293 (2021)
13. A. Masoumi, F. Jabari, S. Ghassem Zadeh, B. Mohammadi-Ivatloo, Long-term load forecasting
approach using dynamic feed-forward back-propagation artificial neural network, in Studies in
Systems, Decision, and Control, vol. 262 (2020), pp. 233–257
14. R. Kumar, P. Dixit, Daily peak load forecast using artificial neural network. Int. J. Electr.
Comput. Eng. (IJECE) 9(4), 2256–2263 (2019)
15. M. Aydinalp, V. Ismet Ugursal, A. Fung, Modeling of the appliance, lighting, and space-cooling
energy consumptions in the residential sector using neural networks. Appl. Energy 71, 87–110
(2002)
16. C. Yuan, D. Niu, C. Li, L. Sun, L. Xu, Electricity consumption prediction model based on
Bayesian regularized BP neural network, in Advances in Intelligent Systems and Computing
(2019), pp. 528–535. https://doi.org/10.1007/978-3-030-15235-2_76
17. J. Kuo, J.C. Principle, B. de Vries, Prediction of chaotic time series using recurrent neural
networks, in Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop
(1992), pp. 436–443. https://doi.org/10.1109/NNSP.1992.253669
18. B. Ribeiro, N. Lopes, Deep belief networks for financial prediction, in Neural Information
Processing. ICONIP 2011, ed. by B.L. Lu, L. Zhang, J. Kwok. Lecture Notes in Computer
Science, vol. 7064 (Springer, Berlin, Heidelberg, 2011). https://doi.org/10.1007/978-3-642-
24965-5_86
19. R.A. Khan, R. Dewangan, S.C. Srivastava, Short term load forecasting using SVM models, in
8th IEEE Power India International Conference, PIICON 2018, 8704366
20. D. Shukla, S. Jaiswal, V.P. Babu, S. Singh, Near real time load forecasting in power system,
in 2020 21st National Power Systems Conference (NPSC) (2020), pp. 1–6. https://doi.org/10.
1109/NPSC49263.2020.9331953
21. W. Wang et al., Load forecasting method based on SVR under electricity market reform. IOP
Conf. Ser. Earth Environ. Sci. 467, 012201 (2020)
22. U. Thissen, R. Van Brakel, A.P. Weijer, W.J. Melsen, L.M.C. Buydens, Using support vector
machines for time series prediction. Chemom. Intell. Lab. Syst. 69, 35–49 (2003)
23. M. Du, Y. Zhao, C. Liu, Z. Zhu, Lifecycle cost forecast of 110 kV power transformers based on
support vector regression and gray wolf optimization. Alex. Eng. J. 60(6), 5393–5399 (2021).
ISSN 1110-0168. https://doi.org/10.1016/j.aej.2021.04.019
24. J. Brownlee, A gentle introduction to XGBoost for applied machine learning. [Online]. https://
machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
25. W. Wang et al., IOP Conf. Ser. Earth Environ. Sci. 467, 012201 (2020)
26. M. Fan et al., Short-term load forecasting for distribution network using decomposition with
ensemble prediction, in 2019 Chinese Automation Congress (CAC) (IEEE, 2019), p. 152
27. U. Wang, S. Sun, X. Chen, X. Zeng, Y. Kong, J. Chen, Y. Guo, T. Wang, Short-term load
forecasting of industrial customers based on SVMD and XGBoost. Int. J. Electr. Power Energy
Syst. 129, 106830 (2021). https://doi.org/10.1016/j.ijepes.2021.106830
564 T. Islam et al.
28. I. Ghalehkhondabi, E. Ardjmand, W. Young II, An overview of energy demand forecasting

methods. Energy Syst. 8, 411–447 (2017). https://doi.org/10.1007/s12667-016-0203-y
29. Python API reference, https://xgboost.readthedocs.io/en/latest/python/python_api.html
30. World development indicators. [Online]. World Development Indicators | DataBank (world-
bank.org)
31. Federal Reserve Economic Data (Federal Reserve Bank, St. Louis). [Online] https://fred.stloui
sfed.org
Sentimental Analysis of Streaming
COVID-19 Twitter Data on Spark-Based
Framework
S. P. Preethi and Radha Senthilkumar
Abstract Social media has become an inevitable part of human’s daily life enabling
people to express their opinion, sentiments, and ideologies. During this COVID-19
pandemic when the whole world went into a lockdown situation, Twitter served as an
outlet for people to express their emotions. This work proposes streaming the real-
time Twitter data on COVID-19 using Twitter API and handling the streaming big
data using the Apache Spark framework. Here the fake account detection to detect the
non-legit accounts present in the streamed data was accomplished by the proposed
feature-based algorithm which attain overall accuracy of 98.74%. This constructed
fake account detection model filters out the genuine accounts from the API streamed
Twitter data. Sentimental analysis on these genuine Twitter accounts is performed
by modifying the Natural Language Processing (NLP) state-of-art algorithm called
Bidirectional Encoder Representations from Transformers (BERT). The proposed
method achieved 88.30% of classification accuracy rate by concatenation of the
pooled NN layer with the influential feature.
Keywords COVID-19 · Fake account detection · Natural language processing ·

Sentimental analysis · BERT algorithm
1 Introduction
The outbreak of Coronavirus disease 2019 (COVID-19) caused due to the virus
named SARS-Cov-2 has taken the form of a pandemic causing humongous loss
of life and economy all around the world. To effectively control the spread of the
virus, the government of every country has enforced preventive measures such as
lockdown, social distancing, etc. But this confinement period has resulted in severe
psychological issues due to boredom and loneliness [1]. To fill this void people all
S. P. Preethi (B) · R. Senthilkumar

Department of Information Technology, Madras Institute of Technology, Chennai, Tamil Nadu
600044, India
R. Senthilkumar
e-mail: radhasenthil@annauniv.edu.in
566 S. P. Preethi and R. Senthilkumar
around the world took social media networks such as Twitter, Facebook, Instagram as
a platform to express their opinions and sentiments. Therefore public sentiment can
be captured by monitoring the text content generated by these social media platform
users. But before analyzing these tweets for sentiment, the fake or non-legit accounts
have to be removed from the Twitter API streamed data. Fake accounts are created
by malicious social bots and they expand their social communication by creating
multiple social accounts. It has to be differentiated from other legit accounts as they
spread misinformation among the users of the account and for reliable sentimental
analysis of the public opinion. Then the semantic of the tweets has to be derived
to arrive at the opinion of the Twitter account users. The sentimental analysis uses
NLP extensively to categorize the emotions of the human. Sentimental analysis in
real-time is the sector in NLP that is not addressed enough [2]. It requires a powerful
Big-data tool to analyze the incoming streaming data. The main setback in streaming
data sentimental analysis is that the genuinely of the data that is streamed is low [3].
The fake data streamed reduces the efficiency of the model constructed to analyze
sentiments [4].
2 Related Work
Fake accounts are created by malicious social bots and they expand their social
communication by creating multiple social accounts or by disguising themselves as
a follower of the account [5, 6]. To identify such malicious accounts from legitimate
account URL features such as URL redirection, frequency of shared URLs, and
spam content in the URL of the tweets can be used [7]. Twitter streams consist of
both high-quality URLs and low-quality URLs as the Twitter user’s tweet about the
COVID-19. Lisa Singh et al. in their experiment found that misinformation from low-
quality URLs spread by the bot accounts is shared more than the high-quality health.
Rout R. et al. [8] proposed a learning automata-based malicious social bot detection
(LA-MSBD) algorithm to identify the legitimate user. Here the direct trust is derived
from Bayes’ theorem, and the indirect trust is derived from the Dempster–Shafer
theory (DST) to determine the trustworthiness of each Twitter user accurately.
Bot accounts are dangerous because they try to manipulate the content and spread
misinformation which can greatly affect public opinion and misguide people [9].
Sharma et al. [10] in their work identified unreliable and misleading content based
on fact-checking sources and examined the narratives promoted in misinformation
tweets, along with the distribution of engagements with these tweets spread by bots
[11]. But these bots are not very easy to be detected as they actively try to avoid detec-
tion. Phillip Efthimiou et al. [12] proposed a novel approach to ensemble features
such as followers-to-friends ratio, and message variability, length of user names,
reposting rate, temporal patterns, and sentiment expression for bot detection. Kılınç
[13] proposed a method that checked the confidence of the data generated from
Twitter and then analyzed the sentiment of the confident data. Here both fake account
detection and sentimental analysis are done using the Naive Bayes algorithm.
Sentimental Analysis of Streaming COVID-19 Twitter Data … 567
Walt et al. [14] work researches on the fake account created by human. The
proposed work considered the friend-to-follower ratio engineered from the friend
and follower count and concluded that the engineered features to detect the fake bot
account cannot be used as efficiently on fake human accounts. Xiao et al. [15] in their
work proposed a methodology to detect a cluster of malicious bot accounts by using
a supervised machine learning pipeline. This method suggests using main features
such as content generated by the users.
Mukherjee et al. [16] proposed fine-grained sentimental classification to clas-
sify multiple human emotions toward pandemics using the Roberta classifier. The
proposed methodology was trained and tested on two benchmark datasets AIT (non-
covid) and SenWave (Covid) datasets [17]. The ROBERTA method resulted in a good
Jaccard index, F1 micro, and F1 macro scores. Chriqui and Yahav [18] in their work
suggested a transformer-based model named Hebert. The constructed tool HebMo
used Hebert to extract emotions from the Hebrew UGC gives a very efficient score
even for emotion detection in the English language.
3 System Architecture
The proposed methodology involves streaming real-time data using the Twitter plat-
form through Twitter API. The Twitter developer account is created from which
the consumer and access tokens are obtained. These tokens are used to stream the
Twitter live data. This live-streamed data is then sentimentally analyzed using the
feature combined BERT algorithm. But the major concern is to conserve the reli-
ability of the Twitter data streamed. Thus the genuine of the Twitter accounts are
checked using the feature-based algorithm and only the legit accounts are filtered and
the fake accounts are dropped. The feature-based algorithm is a rule set constructed
using ensemble-engineered features-based conditions derived from the metadata of
the real-time streaming data. The feature is engineered by analyzing the profile-based
metadata such as name, description, screen name, status, and behavior-based features
such as friends count, the following count, statuses count, and listed count. Then the
sentimental analysis is done for legit accounts using fine-tuned BERT model. In
this proposed model, the influential feature is concatenated as a layer along with the
pooled neural network of the BERT model. The analyzed sentiments are then visually
represented to get an overall idea of people’s sentiments. The system architecture is
shown in Fig. 1.
3.1 Fine-Tuned BERT Architecture
Bidirectional Encoder Representations from Transformers (BERT) algorithm is a

stack of 12 encoders. Each classification token [CLS] goes into each layer which
applies self-attention and passes the result through the feed-forward network. This
Fig. 1 Overall system architecture
work proposes concatenation of influential features as a layer along with the BERT
model. It suggests combining pooled NN layer with the influential feature with
higher correlation and adding layer gives better classification accuracy compared to
the state-of-art BERT algorithm used for sentimental analysis. The modified BERT
architecture is given in Fig. 2.
4 Methodology
The proposed methodologies involved in selection and engineering of features along

with the technique used to sentimentally analyze the tweets are explained in this
section.
4.1 Feature Selection and Engineering
Features on Twitter derived from the meta-data of the user accounts can be used
to analyze fake accounts run by bots on the Twitter platform. This work proposes
constructing a feature-based algorithm that ensembles engineered features. The bag-
of-words are constructed including the most commonly used words by bot. The
constructed bag-of-words are checked with features such as name, screen_name,
description, and statuses of the streamed Twitter accounts. The accounts are checked
if they are verified by Twitter. The feature is engineered by analyzing features such
as friends count, the following count, statuses count, and listed count to compute the
friends-to-following ratio because the bots follow the maximum number of people
while their friend count is much less compared to the legitimate account. The data is
Fig. 2 Fine-tuned Bert architecture
analyzed for features such as followers_retweet, status frequency, and the threshold
is set for determining bots which ensures that the account is legit.
Algorithm
Procedure: Ruleset construction- feature engineering
Input: meta-data features of the Twitter account
Output: classified legit and non-legit accounts
Start
Read Source ← training_data_bot.csv
Construct bag of words to check name, screen_name,
description and status of the account has spam words.
name = contains("bag of words"== true)
description = contains("bag of words" == true)

screen_name = contains("bag of words"== true)
statuses = contains("bag of words"== true)
Check the condition if the account is a verified account,
location specified, status count and listed count is
reasonable.
condition = (verified_df.verified == TRUE)
condition = (location_df.location == TRUE)
condition =
(followers_count<=t1)&(statuses_count>t2)
condition = (listed_count_df.listed_count > t3)
The remaining accounts are allocated to the legit accounts.
Classify the accounts into bots (1) or non_bots (0)and
export the classified data frame as legit_acc.csv CSV
file.
4.2 Preprocessing-Label Encoding and Tokenization
As proposed work do multi-class classification, the best encoding method would

be one hot encoding. One hot encoding is nothing but creating dummy variables in
place of the categorical variables. It created additional features in place of the unique
values in the categorical feature. Every unique value in the category will be added as
a feature. The tokenization involves breaking the input text into separate tokens that
are available in the vocabulary which has trained 30,522 words. For classification
tasks using the BERT architecture, the whole sentence given as an input is to be
converted as a single vector. The ‘[CLS]’ token is added to the beginning of the first
token. ‘[SEP]’ token is added to the end of the sentence to indicate the end of the
sentence and beginning of the next sentence and converted to corresponding unique
IDs according to the BERT trained vocabulary.
4.3 Concatenation of Influential Feature
BERT can be used for a large variety of natural language processing tasks by fine-
tuning the model. Sentimental analysis is a classification task and thus it can be done
similar to the next sentence classification. The classification layer is added to the
top of the transformer output for the [CLS] token. Along with this fine-tuning, this
work proposes concatenation of influential features as a layer along with the BERT
model. The input tweets are first embedded into vectors and processed by the neural
networks. The vector output of the sequence is that every vector is of the same size H
corresponds to the input token. BERT has two main functionalities masked language
modeling (MLM) and next sentence prediction (NSP). Here a classification layer is
added on the top of the output of the encoder. The output is then multiplied by an
embedding matrix to transform each word into vocabulary dimensions. Softmax is
used to calculate the probability of each word to the mask. The context of the tweet
is perceived by segment embedding where the sentence number is encoded into the
vectors. At last, the position embedding is added to each tokens denoting the position
of the word within the sentence. The segment embedding and position embedding
make up the NSP. Combining pooled NN layer with the most influential feature
i.e. friend_count gives better classification accuracy compared to the state-of-art
algorithms.
Algorithm
Procedure: Fine-tuning BERT model
Input: Three input vectors - token embedding, segment
embedding, and position embedding
Output: the classified sentiment of the tweets.
Start
Read Source ← covid19_tweets.csv (kaggle covid19 tweets
dataset)
corr_feature = df.corr(method=’spearman’)
Assign sampleDf ← read source dataset
Split the dataset into test and train data along with the
most correlated feature (usr_frnd).
encoder ← LabelEncode()
encoder.fit(target_values)
Save the encoding map named twitter_classes.npy for later
use.
tokenizer ← tokenization.FullTokenizer(vocab_file,
do_lower_case)
Convert the tokens to token ids using the function
encode_names(n, tokenizer).
Using one hot encoding, encode the influential feature.
featureEncoder = LabelEncoder()
featureEncoder.fit(usr_frnd)
Save the encoding of the influential feature
twitter_wkd.npy for future use.
Pre-process the input tweets, built a function of
bert_encoder().
Built a fine-tuned BERT model using the inputs defined in
the pre-processing and save the model
’twitter_BERT_usr_frnd’.
Train and test the model using the fine-tuned Bert model.
Using the saved encoder and model now classify the
Twitter live-streamed data.
End
The environments in which the implementation is done and labeling of the dataset
based on the polarity are explained in this setup.
The social honeypot dataset was collected on the Twitter platform from 30th
December 2009 to 2nd August 2010 [19]. Kaggle collected the covid19_tweets.csv
using Python script and Twitter API. The labeling of these tweets is done using
a lexicon-based approach using [20] Valence Aware Dictionary and Sentiment
Reasoner (VADER) to label the tweets. The polarity scores above 0.5 are labeled
as positive, scores below -0.05 are labeled as negative and scores between − 0.5
and 0.5 are labeled as neutral. The datasets are used to train the constructed model
and real-time test data is streamed using Twitter developer API using the keyword
#COVID for validation data.
5.2 Implementation Specification Environment
The rule-set construction for feature-based classification of fake Twitter accounts

is done using Anaconda. The model configuration is trained in google Colab since
GPU is required. Google Colab is used to implement the sentimental analysis on the
legit accounts filtered from the streamed data. Processor above 23 GHz and 2 cores,
GPU above 1300 Cuda cores and 4 GB memory, and Windows 8.1 operating system
are required for the above implementation.
6 Results
The constructed rule-set-based feature-based filter of the genuine accounts based on

the analyzed features is constructed. The feature-based algorithm achieved a higher
overall accuracy of 98.74% compared to the Decision tree algorithm, binomial Naive
Bayers, Random Forest, Single layer perceptron, and Multilayer perceptron given in
Table 1. The proposed fine-tuned BERT-based system to identify people’s emotions
from the text of the tweets. The proposed system attains an overall validation accuracy
of about 88.30%. ROC curve is used to plot the true positive rate to the false positive
Table 1 Comparison of
S. No Methodology Overall accuracy (%)
methodologies in fake
account detection 1 Decision tree algorithm 86.25
2 Binomial Naive Bayers 68.75
3 Random Forest algorithm 85
4 Single layer perceptron 70
5 Multilayer perceptron 76
6 Feature-based algorithm 98.74
Fig. 3 ROC of
feature-based algorithm
rate is shown in Fig. 3. The figure illustrates that the true positive rate equals 98.74
on a scale of 0–1.
6.1 Overall Accuracy and Loss of Fine-Tuned BERT
The overall accuracy of the model built is evaluated using this metric. Accuracy is
nothing but the fraction of right predictions by the model constructed. The constructed
feature combined BERT algorithm achieves 88.30% of validation accuracy. The loss
is also known as Softmax loss and is mostly used in multi-class classification. This
kind of loss will train a deep NN to output a probability over the multi-classes. The
graph plotting overall accuracy and loss is given in Figs. 4 and 5 respectively.
Fig. 4 Training and

validation accuracy
Fig. 5 Training and

validation loss
7 Conclusion
The main contribution of this paper is to sentimentally analyze the public emotion
on Twitter during this COVID-19 pandemic using fine-tuned BERT algorithm. For
reliable results, a model to filter the genuine accounts based on the analyzed features
is constructed. The feature-based algorithm achieved a higher overall accuracy of
98.74% compared to the Decision tree algorithm with 86.25%, Naive Bayers with
68.75%, Random Forest with 85%, Single layer perceptron with 70%, and Multilayer
perceptron with 76%. The legit account tweets are then to be sentimentally analyzed
for the public emotion using Fine-tuned BERT algorithm. The fine-tuned feature
combined BERT algorithm achieves overall training and validation accuracy of about
90% and 88.30% respectively. Future work may include automating the fake account
detection system that upgrades the rule-set that is compatible with any dataset. The
sentiment of the people is complicated and thus it cannot be merely classified into
positive, negative, and neutral emotions. Thus this work can be extended to classify
fine-grained multi-class emotions rather than three sentiments expressed.
References
1. A. Badawy, E. Ferrara, K. Lerman, Analyzing the digital traces of political manipulation, in The
2016 Russian Interference Twitter Campaign. arXiv e-prints page arXiv:1802.04291 (2018)
2. R. Liu, Y. Shi, C. Ji, M. Jia, A survey of sentiment analysis based on transfer learning. IEEE
Access 7, 85401–85412 (2019)
3. W. Medhat, A. Hassan, H. Korashy, Sentiment analysis algorithms and applications: a survey.
Ain Shams Eng. J. 5, 1093–1113 (2019)
4. M. Cinelli, W. Quattrociocchi, A. Galeazzi, C.M. Valensise, E. Brugnoli, A. LuciaSchmidt,
P. Zola, F. Zollo, A. Scala, The covid-19 social media infodemic. arXiv preprint: arXiv:2003.
05004. (2020)
5. G.K. Shahi, A. Dirkson, T.A. Majchrzak, An exploratory study of COVID-19 misinformation

on Twitter. ArXiv, (2020). arXiv:2005.05710v2
6. J. Knauth, Language-agnostic Twitter bot detection, in Proceedings of Recent Advances in
Natural Language Processing,Varna, Bulgaria, 2–4 Sep. 2019, pp. 550–558
7. L. Singh, L. Bode, C. Budak, K. Kawintiranon, C. Padden, E. Vraga, Understanding high- and
low-quality URL Sharing on COVID-19 Twitter streams. J. Comput. Soc. Sci. 3, 343–366
(2020)
8. R.R. Rout, G. Lingam, D.V.L.N. Somayajulu, Detection of malicious social bots using learning
automata with URL features in Twitter network. IEEE Trans. Comput. Soc. Syst. 7, 1004–1018
(2020)
9. A. Al-Rawi, V. Shukla, Bots as active news promoters: a digital analysis of COVID-19 tweets,
information, 11, 461 ( 2020)
10. K. Sharma, S. Seo, C. Meng, S. Rambhatla, Y. Liu, COVID-19 on social media: analyzing
misinformation in Twitter conversations, ArXiv (2020). arXiv:2003.12309v4
11. A. Bessi, E. Ferrara, Social bots distort the 2016 us presidential election online discussion.
First Monday 21, 1–13 (2016)
12. P.G. Efthimion, S. Payne, N. Proferes, (2018) Supervised machine learning bot detection
techniques to identify social twitter bots. SMU Data Sci. Rev. 1, 1–70
13. D. Kılınç, A spark-based big data analysis framework for real-time sentiment prediction on
streaming data. Research Article. 49, 1352–1364 (2019)
14. E. Van Der Walt, J. Eloff, Using machine learning to detect fake identities: bots vs humans.
IEEE Access 6, 6540–6549 (2018)
15. C. Xiao, D.M. Freeman, T. Hwa, Detecting clusters of fake accounts in online social networks.
ACM (2015). https://doi.org/10.1145/2808769.2808779
16. R. Mukherjee, S. Poddar, A. Naik, S. Dasgupta, How have we reacted to the covid-19 pandemic?
Analyzing changing Indian emotions through the lens of Twitter. ArXiv (2020) arXiv:2008.
09035v1
17. E. Chen, K. Lerman, E. Ferrara, Tracking social media discourse about the covid-19 pandemic:
development of a public coronavirus Twitter data set. JMIR Publ. Health Surveill. 6, e19273
(2020)
18. A. Chriqui, I. Yahav, HeBERT & HebEMO: a Hebrew BERT model and a tool for polarity
analysis and emotion recognition (2021). arXiv:2102.01909v1
19. K. Lee, B. Eo, J. Caverlee, Seven months with the devils: a long-term study of content polluters
on twitter, in Proceeding of the 5th International AAAI Conference on Weblogs and Social
Media (ICWSM), Barcelona, (2011)
20. Gabriel Preda.: Covid19 tweets (2020). Available in: https://www.kaggle.com/gpreda/covid19-
tweets. Accesses on May 21, 2021
Efficient Approximate Multipliers
for Neural Network Applications
Zainab Aizaz , Kavita Khare, and Aizaz Tirmizi
Abstract In this era of machine learning, computationally efficient multiplier cir-

cuits are essential as most of the calculations on a neural network-based processor
involve the multiplication process. Multiplication operation is a major bottleneck on
a neural network accelerator. Approximate computing is a recent paradigm in which
exact circuits are deliberately made inaccurate to reduce the circuit complexity. The
loss in accuracy is tolerated by neural networks because of their inherent error toler-
ance, to achieve reduced area, delay and energy consumption benefits. In this paper,
neural network is designed on an FPGA using an approximate Booth multiplier by
replacing the exact multiplication. The proposed multipliers are analyzed on hand-
written digit recognition using neural networks. A novel 4-2 compressor is designed
using the concept of approximate computing to be used in Booth multiplier circuit.
The k-map of exact compressor is modified to design approximate compressor, and
a comparatively simpler circuit is obtained. The synthesis result of the proposed
approximate multipliers shows 17.69% energy reduction with a 0.9% loss in accu-
racy as compared to exact multipliers. Upon comparison with the state-of-the-art
neural networks, the proposed neural network provides high classification accuracy
of 96.23%.
Keywords Approximate computing · Booth multipliers · Compressors ·

Neural networks
Z. Aizaz (B) · K. Khare

Department of Electronics and Communication Engineering, Maulana Azad National Institute of
Technology, MANIT, Mata Mandir, Link Road 3, Bhopal 462003, M.P., India
e-mail: aizazzainab@gmail.com
K. Khare
e-mail: kavita_khare1@yahoo.co.in
A. Tirmizi
Department of Electronics and Communication Engineering, Rabindranath Tagore University,
Bhojpur, Chiklod Rd, Bhopal 464993, M.P., India
e-mail: aizaztirmizi@gmail.com
578 Z. Aizaz et al.
1 Introduction
Neural networks can provide useful conclusions from large and imprecise datasets.
Almost all modern computing algorithms are lagging behind the computing abilities
of the neural networks. Neural networks when implemented on hardware have chal-
lenging requirements because of their computationally intensive nature. Whether it
is pattern recognition onboard on a satellite, disease prediction using image datasets
and computer vision in a robot, all of them require neural network implementation
on dedicated hardware such as on Field Programmable gate Arrays (FPGA) or on
the Application Specific Integrated Circuits (ASIC). The hardware requirement and
large energy consumption of neural networks can be curtailed by using an emerging
paradigm called approximate computing. Approximate computing is a technique for
designing compact and low-energy digital logic circuits by introducing some inac-
curacy so as to minimize the circuit complexity [1]. Adders, multipliers and counters
are the most frequently used arithmetic circuits on a processor, and their efficient
design is crucial for hardware efficiency of an application-based circuit. Multipliers
are undoubtedly the most important primitives for image processing and artificial
neural network (ANN) applications, predominating the area, delay and overall per-
formance of their hardware implementations. The partial products of the multipliers
can be generated in two ways, one using AND gates and the other using the Booth
encoding. Booth-encoded partial products are generally preferred as the number of
rows is reduced to half. For example, if the size of a multiplier is n bits, the number
of rows of partial product is also n for a non-Booth multiplier but n2 + 1 for a Booth
multiplier. Booth multiplier circuits [2, 3] have proven to be the better as compared
to other multipliers in speed and energy/power efficiency. Compressor circuits play
a vital role in the partial product accumulation stage of the multiplier circuit using
tree-based schemes such as Wallace and Dadda [4]. In case of addition of two or three
bits of partial products, half and full adders are used, but when the number of partial
product bits is more than three, a compressor is used to add them. A n-m compressor
has n inputs and m outputs. Thus, full adder is also a 3-2 compressor which adds three
input bits and provides two output bits. Most efficient state-of-the-art multipliers have
used 5-3 compressors [5, 6]. A simple 5-3 compressor consists of two full adders
with three outputs, i.e., Sum, Carry1 and Carry2. The weight of Carry1 and Carry2 is
double as compared to the weight of Sum. If the input carry and Carry2 pins are not
used, a 4-2 compressor can be designed which allows a significant saving in area and
power consumption. But, this requires the third carry bit Carry2 to be modified from
1 to 0 at the input combination of 1111. This modification induces error in the output
of the compressor. Since some outputs of a 4-2 compressor are wrong, its design is
approximate and not accurate. A number of previously proposed designs used dif-
ferent efficient approximate 4-2 compressors [7–11]. Dual-mode 4-2 compressors
are presented in [12] which can change their mode of operation between exact and
approximate. In [13], power-efficient 4-2 compressors consisting of just one NAND
gate and three NOR gates are presented. In [14], a wide variety of approximate 4-2
compressors are extensively reviewed with their hardware and error characteristics.
Neural networks are inherently error tolerant, and therefore, implementing neural
Efficient Approximate Multipliers for Neural Network Applications 579
applications using an approximate multiplier provides very little loss in accuracy but
large gains in energy, and timing performance can be achieved [15]. In this paper,
we have designed an FPGA-based neural network using approximate multiplication
instead of exact multiplication.
2 Proposed Approximate Neural Network Using Using

Novel Approximate Compressor-based Booth Multiplier
On an FPGA, the neural network application is executed in two phases, the training
phase and the testing phase. Since the training process is computationally extensive,
it is performed on software, and an already trained neural network is applied in the
hardware circuit. Therefore, power consumption is not of much importance in the
context of training phase. Even though testing process contains computation cycles
that are many times smaller as compared to the training phase, for large number
of neurons in the network, the testing process also becomes energy consuming. It
can be observed in Fig. 1, for calculating the output of a neuron, inputs and weights
are multiplied and then activation function is applied at each neuron. The output
of each neuron then becomes the input for next layer, shown in Fig. 2. The imple-
mentation of neural network using hardware is challenging largely due to extensive
multiplications. To address the problem of large energy consumption by exact multi-
plier, approximate computing can be used. Approximate computing can mitigate the
hardware consumption of neural networks, and the error induced due to the approx-
imate computing is compensated due to inherent error tolerance of neural networks.
Approximate memories for the storage of synaptic weights and approximate mul-
tipliers for the computations are commonly used for neural network on circuit. In
the proposed work, an approximate multiplier is used instead of exact multiplier.
Fig. 1 Structure of neuron showing input and weight multiplication

580 Z. Aizaz et al.
Hidden neurons
Input neurons Weights
Output neurons
Io
I1
I2
IN
Fig. 2 Structure of neural network
The proposed approximate multiplier is a signed Booth multiplier in which novel

approximate 4-2 compressors are used to add the partial products. Mixed National
Institute of Standards and Technology (MNIST) is a benchmark dataset consisting
of handwritten numbers of 28×28 pixel images. There are total 70,000 images, out
of which 60,000 are used for training, while 10,000 are used for testing. The neu-
ral network structure that we are using is composed of 120 neurons with sigmoid
activation function. Sigmoid is used for producing nonlinearity in the output. Let us
consider a neuron x in yth layer, then i x,y is the activation function of the weighted
sum of the inputs of neuron x as shown in Eq. (1).
1
i x,y = (1)
1 + e−sx,y

N
sx,y = i j,y−1 × w j x,y−1 (2)
j=1
N is the number of neurons in the last layer before the output, and w j x,y−1 is the weight
connecting the neuron x and the layer y-1. The multiplication operation shown in
Equation (2) requires parallel multipliers on the circuit. Thus, instead of using exact
multipliers, approximate multipliers can be used to obtain area, power and energy
savings on the circuit that performs the neural application.
2.1 Introduction to Booth Multipliers
A Booth multiplier operates in a series of three stages, i.e., partial product genera-
tion, accumulation and the final two row addition, as shown in Fig. 3. Let A be the
multiplicand string and B be the multiplier string. The Booth encoder encodes the
MULTIPLIER MULTIPLICAND
BOOTH
ENCODER
PARTIAL PRODUCT
GENERATOR
PARTIAL PRODUCT
ACCUMULATION USING
ADDERS AND
COMPRESSORS
FINAL ADDITION
PRODUCT
Fig. 3 Block diagram of Booth multiplier
Table 1 Modified Booth encoding (b represents LSBs of multiplier and a represents LSBs of
multiplicand)
b2i+1 b2i b2i−1 Partial Pi, j for a2i a2i−1
products
00 01 10 11
0 0 0 0 0 0 0 0
0 0 1 A 0 0 1 1
0 1 0 A 0 0 1 1
0 1 1 2A 0 1 0 1
1 0 0 −2 A 1 0 1 0
1 0 1 −A 1 1 0 0
1 1 0 −A 1 1 0 0
1 1 1 0 0 0 0 0
partial products by utilizing three consecutive bits of the multiplier, b2i+1 , b2i and
b2i−1 . These three bits of the multiplier are responsible for the partial product value
to be equal to 0, A, 2A, −2 A and −A, respectively, shown in Table 1. Last three bits
of B and last two bits of A are used to encode the partial products, then B and A are
shifted right by 2 bits and 1 bit, respectively, and again last three bits of B and last two
bits of the A are used. Let the sixteen-bit multiplier string be 0000011101011101, a 0
is first appended in the least significant bit (LSB) position, then three LSBs are 010,
shifted right by two bits, and remaining LSB groups are 110, 011, 010, 110, 011, 000,
000. There are eight such groups that provides eight rows n2 of partial product matrix
582 Z. Aizaz et al.
s0 s0 s0 P0,7 P0,6 P0,5 P0,4 P0,3 P0,2 P0,1 P0,0

1 s1 P1,7 P1,6 P1,5 P1,4 P1,3 P1,2 P1,1 P1,0 n0
1 s2 P2,7 P2,6 P2,5 P2,4 P2,3 P2,2 P2,1 P2,0 n1
1 s3 P3,7 P3,6 P3,5 P3,4 P3,3 P3,2 P3,1 P3,0 n2
n3
Fig. 4 Partial product matrix of 8-bit radix-4 Booth multiplier
and one row for neg bit[2]. The partial product matrix shown in Fig. 4 is for 8-bit
multiplier; therefore, there are only four groups of three bits formed, and hence, the
matrix has 4+1 rows as shown in Fig. 4. The partial products are added columnwise,
and finally, all the columns of the partial product matrix are converted to two rows
which can be added again using a fast adder.
2.2 Novel Approximate Compressor Based on k-map

Alterations
In this paper, a novel approximate computing-based 4-2 compressor is designed that

has a simplified circuitry as compared to the exact 5-3 compressor. To design the
proposed approximate compressor, the principle of approximate computing is applied
in the k-map of the addition of 4 bits. The truth table of the exact 4-2 compressor is
shown in Table 2, and in addition to x1 , x2 , x3 , x4 , the carry equation in case of all
inputs equal to one will require three output variables. Therefore, it is never possible
to design a 4-2 compressor that is exact because addition of four input bits needs three
outputs, and therefore, exact compressors are 5-3 only. Let us consider the k-map of
the exact compressor as shown in Fig. 5. In the k-map shown in Fig. 5b, the terms
marked in red color indicate a modified bit; i.e., if 0 is marked red, it means that a
1 has been changed to 0, and if a 1 is marked red, it means that a 0 is changed to 1.
The mapping is shown by dashed rectangles. Using the k-map of Fig. 5b, we obtain
the approximate 4-2 compressor of Fig. 6b which consists of just two OR gates, two
AND gates and one XNOR gate. It should be remembered that the exact compressor
shown in Fig. 6a uses two full adders, and each full adder contains one XOR gate
with two AND and one OR gate.
Thus, it is clear that the proposed 4-2 compressor circuit is compact and, therefore,
consumes less power and is faster than the exact compressor. Table 2 shows that the
error probability of the proposed approximate processor for the sum output is 8/16,
i.e., 0.5, which is large, but the error probability for the carry output is 2/16, i.e.,
0.0625, which is very less. This is an improvement over existing approximate com-
pressors as the carry output has double the weight of sum output. While in the existing
multipliers, most of designs have equal error probability irrespective of the weight
of the outputs, inducing larger error in carry output results in comparatively lesser
Table 2 Truth table of proposed approximate compressor

x1 x2 x3 x4 S C1 C2 S C E sum E carry
0 0 0 0 0 0 0 1× 0 1 0
0 0 0 1 1 0 0 0× 0 1 0
0 0 1 0 1 0 0 0× 0 1 0
0 0 1 1 0 1 0 1× 0× 1 1
0 1 0 0 1 0 0 1 0 0 0
0 1 0 1 0 1 0 0 1 0 0
0 1 1 0 0 1 0 0 1 0 0
0 1 1 1 1 1 0 1 1 0 0
1 0 0 0 1 0 0 1 0 0 0
1 0 0 1 0 1 0 0 1 0 0
1 0 1 0 0 1 0 0 1 0 0
1 0 1 1 1 1 0 1 1 1 0
1 1 0 0 0 1 0 1× 1 1 0
1 1 0 1 1 1 0 0× 1 1 0
1 1 1 0 1 1 0 0× 1 1 0
1 1 1 1 0 0 1 1× 1× 1 1
Fig. 5 a K-map of exact compressor circuit, b K-map of proposed approximate compressor
loss in accuracy of the proposed compressor. In the proposed approximate multiplier

circuit, the proposed novel 4-2 compressor is used, and its design is depicted in Fig. 7
in which the exact 5-3 compressors are replaced by approximate 4-2 compressor for
first 16 columns of the matrix of the partial products. At all levels, the half adder and
full adders used are exact only. A parameter m decides the approximation level of
the multiplier. m refers to number of least significant columns from left to which the
approximate compressor is applied. For m = 16, the approximate Booth multiplier
circuit is as shown in Fig. 7. In Fig. 7, for the first 16 columns, there is no carry
imposed to compressor of the next column, and this is because we have reduced the
number of outputs of the compressor from 3 to 2. This reduction in carry has a benefit
of lower overall delay of the approximate multiplier. Due to the application of the
approximate 4-2 compressor, even though the circuitry of the multiplier is reduced
to a great extent, but the error is imposed simultaneously.
584 Z. Aizaz et al.
Fig. 6 a Exact 5-3 compressor, b Proposed approximate 4-2 compressor
Fig. 7 Proposed approximate Booth multiplier
3 Error Analysis of Approximate Multiplier

For approximate multipliers, several metrics that assess the accuracy of designs are
proposed [16, 17]. These metrics are referred to as error metrics. Let us consider
that P is the product of inputs from exact multiplier and PA is the product from the
approximate multiplier, respectively. Then, error distance D and the relative error
distance are given by D = |P − PA | and R D = |P − PA |/|P|. If max is maximum
value which can be the output of the multiplier, then the normalized mean error
distance (NMED) is the mean value of error distance divided by max. Another metric
is the mean relative error distance (MRED) which is defined as the mean value of
R D (P not equal to 0). The number of effective bits (NoEB) is a parameter that gives
the indication of the number of error-free bits, and it depends on root mean square
(ERMS) value of the multiplier as shown in Equation (3).
NoEB = 2n − log2 (1 + ERMS) (3)

Table 3 Accuracy comparison of approximate multipliers

Design MRED (10−3 ) NMED (10−5 ) NoEB PRED(%)
Exact radix-4 – – – –
Proposed 3.85 3.89 16.33 99.71
Mom[7] 6.88 1.31 15.89 99.81
Ven[8] 7.30 1.57 15.72 99.13
Ha[11] 5.43 0.65 16.13 99.82
Akb[12] 8.54 1.69 14.87 98.81
Ahm[13] 6.64 1.32 15.89 99.80
The metric PRED is probability of R D larger than a predefined percentage, and in

this paper, it is assumed to be 2multipliers are programmed in MATLAB, and the
metrics are calculated using 10 million input vectors. The results of error analysis
can be observed in Table 3. The lower value of MRED is due to the fact that our
proposed 4-2 compressor has less error probability of the carry output as compared
to the other approximate compressors. Similarly, NoEB and PRED values of our
designs are higher than other existing designs.
4 Hardware Analysis of Approximate Multiplier

The proposed approximate multiplier and the existing multipliers such as Mom[7],
Ven[8], Ha[11], Akb[12] and Ahm[13] are designed in Verilog HDL. Synopsys VCS
is used for verification of the waveforms. Synthesis is performed using Synopsys
Design Compiler with Synopsys 90nm SAED library using the typical corner. The
toggle frequency used is 1 GHz with an operating voltage of 1.2V. The temperature
used is 25◦ C. From Table 4, it can be observed that the proposed design outperforms
the existing designs in terms of area, delay, power and power-delay product (PDP).
The PDP is an important metric in the digital signal design as it indicates the
amount of energy used by the design when it is operated with a set of applied input
Table 4 Hardware comparison of approximate multipliers

Design Area (µm2 ) Delay (ns) Power (µW) PDP (µW ns)
Exact radix-4 10190 5.27 5437 28653
Proposed 8990 4.36 4690 20448
Mom [7] 9631 4.32 4981 21518
Ven [8] 10018 4.51 4819 21733
Ha [11] 9616 4.56 4713 21491
Akb [12] 9218 4.48 4903 21965
Ahm [13] 8998 4.27 4887 20867
586 Z. Aizaz et al.
Ahm[13]
Multiplier Design
Akb[12]
Ha[11]
Venk[8]
Mom[7]
Proposed
19500 20000 20500 21000 21500 22000 22500

PDA
Fig. 8 PDP comparison for different approximate multipliers
10200
10000
Proposed
9800
Mom[7]
Area(μm2)
9600 Venk[8]
9400 Ha[11]
9200 Akb[12]
9000 Ahm[13]
8800
0 2 4 6 8 10
MRED(10-3)
Fig. 9 Area vs MRED for different approximate multipliers
voltages. For the sake of fair comparison, in all the existing multipliers and also in the
proposed multiplier, we have used approximate compressors in 16 least significant
columns of the partial product matrix. The PDP values of the proposed design are
minimum which signifies that this design possesses improved energy consumption.
The proposed approximate multiplier is an energy-efficient design which reduces the
overall power consumption of the processor. Table 4 shows an improvement of 12%
in area, 17.3% in delay and 13.7% in power, by the proposed Booth multiplier over the
exact signed Booth multiplier. Figures 8 and 9 show the PDA (power × delay × area)
analysis and area versus MRED of various approximate multipliers, respectively.
5 Hardware and Error Analysis of Proposed Approximate

Neural Network
The FPGA-based proposed approximate neural network is designed using Verilog

HDL at gate level. The neural network is assumed to be trained, and only testing
phase is used with approximate multipliers. This means that we wrote the values of
Table 5 Comparison of approximate neural network implementation on FPGA

Design Slices LUTs used Energy consumption Classification
(mJ) accuracy (%)
Exact multiplier 36313/101400 13.62 97.15
Proposed 35816/101400 11.21 96.23
ASM [18] 36002/101400 12.95 96.05
the trained weights on the chip memory. Two hidden layers are used to construct
the neural network structure. Each layer is composed of 120 neurons with rectified
linear unit as activation function for each layer. The offline training of 60000 images
is carried out. The number of bits of weights and inputs is converted to 16 bit since
we are using 16-bit approximate multipliers. For the training purpose, floating point
notation is used. The accuracy is calculated using 10,000 test images. The number of
epochs used is 400, while the batch size is 80. First of all, the neural network imple-
mentation is performed using exact multipliers for the calculations that take place
inside the neurons, and the classification accuracy is noted as shown in Table 5. Then,
we used proposed 16-bit signed approximate multipliers in place of exact multipliers.
The implementation of exact and approximate multipliers is performed using look-
up tables (LUTs) on a XEM7350 Kintex-7 FPGA (XC7K160T-1FFG676C) board.
Table 5 demonstrates that the results of the FPGA implementation of the neural net-
work and proposed approximate neural network achieve low energy consumption
and high classification accuracy compared to existing design. It is observed that
8-bit or 16-bit multipliers provide sufficient accuracy, and therefore, most neural
network hardware implementations used low bit-width multipliers to increase the
energy efficiency. The implementation of approximate neural network is also per-
formed using various state-of-the-art multipliers instead of the proposed multiplier
on simulation level. The variation in mean square error with exact multiplier and
proposed multiplier is shown in Fig. 10, and it shows that the error occurred due to
proposed design is not very large compared to that of the exact multiplier. Table 6
shows the accuracy of the proposed and existing multipliers compared to the exact
multipliers.
6 Conclusion
Since neural networks are computationally intensive, running a neural network appli-
cation on hardware necessitates low-energy and high-speed circuits. Multipliers
are essential circuits in digital design. They are widely used in image processing
and neural network applications. Hardware resource utilization and processing time
required by multipliers are more than addition and subtraction. In this paper, a novel
compressor-based approximate multipliers is presented. The proposed compressor is
used in place of exact compressor in a 16-bit signed Booth multiplier. The multiplier
is then used to implement a neural network on FPGA. Due to approximate mul-
588 Z. Aizaz et al.
0.9
0.8
0.7
Classification Error
0.6
0.5
0.4
0.3
0.2
0.1
0
0 50 100 150 200 250
Epochs
Fig. 10 Error vs epoch to compare error in each epoch due to exact (red) and approximate (blue)
multipliers
Table 6 Comparison of approximate multipliers using neural network application

Design Classification accuracy
Exact multiplier 97.15
Proposed 96.23
Mom [7] 95.17
Ven [8] 95.63
Ha [11] 96.18
Akb [12] 96.07
Ahm [13] 95.35
tiplier, the neural network also becomes approximate. The disadvantage of design
of approximate multipliers using approximate compressors is that it provides lim-
ited hardware reduction and can be incorporated with truncation of partial products
or with coding of input operands to achieve further energy savings. An important
requirement for an approximate multiplier used for neural network applications is
the scalability. The advantage of the scalable design is that the proposed compressor
circuit can be used in a multiplier with bit-width of any value. This means that 4-bit,
8-bit, 32-bit and 64-bit multipliers can also use the proposed compressor. This allows
implementation of neural network hardwares on FPGAs or ASICs having variable
precisions.
In the future, the proposed multiplier can be used in the biomedical circuits
which require battery-operated wearable devices. These devices use machine
learning-based classifiers. Due to low PDA value of the proposed multipliers, high-
performance compact neural network-based wearable devices can be designed.
References
1. J. Han, M. Orshansky, Approximate computing: an emerging paradigm for energy-efficient

design, in 18th IEEE European Test Symposium (ETS) (2013), pp. 1–6
2. A.D. Booth, A signed binary multplication technique. Q. J. Mech. Appl. Math. 4(2), 236–240
(1951)
3. O.L. MacSorley, High-speed arithmetic in binary computers. Proc. IRE 49, 67–91 (1961)
4. L. Dadda, Some schemes for fast serial input multipliers, in IEEE 6th Symposium on Computer
Arithmetic (ARITH), 52-59 Aarhus, Denmark (1983)
5. D. Baran, M. Aktan, V.G. Oklobdzija, Energy efficient implementation of parallel CMOS
multipliers with improved compressors, in Proceedings of the 16th ACM/IEEE International
Symposium on Low Power Electronics and Design (ISPLED) (2010), pp. 147–152
6. A.K. Verma, P. Ienne, Automatic synthesis of compressor trees: reevaluating large counters, in
Proceedings of the Conference on Design, Automation and Test in Europe (2007), pp. 443–448
7. A. Momeni, J. Han, P. Montuschi, F. Lombardi, Design and analysis of approximate compres-
sors for multiplication. IEEE Trans. Comput. 64(4), 984–994 (2015)
8. S. Venkatachalam, S.B. Ko, Design of power and area efficient approximate multipliers. IEEE
Trans. Very Large Scale Integr. (VLSI) Syst. 25(5), 1782–1786 (2017)
9. Z. Yang, J. Han, F. Lombardi, Approximate compressors for error resilient multiplier design, in
Proceedings of IEEE International Symposium on Defect Fault Tolerance in VLSI Nanotech-
nology Systems (DTFS), pp. 183–186 (2015)
10. C.H. Lin, I.C. Lin, High accuracy approximate multiplier with error correction, in Proceedings
of IEEE 31st International Conference on Computer Design, ICCD, pp. 33–38 (2013)
11. M. Ha, S. Lee, Multipliers with approximate 4–2 compressors and error recovery modules.
IEEE Embed. Syst. Lett. 10(1), 6–9 (2018)
12. O. Akbari, M. Kamal, A. Afzali-Kusha, M. Pedram, Dual-quality 4:2 compressors for utilizing
in dynamic accuracy configurable multipliers. IEEE Trans. Very Large Scale Integr. (VLSI)
Syst. 25(4), 1352–1361 (2017)
13. M. Ahmadinejad, M.H. Moaiyeri, F. Sabetzadeh, Energy and area efficient imprecise compres-
sors for approximate multiplication at nanoscale. AEU-Int. J. Electron. Commun. 110, 152859
(2019)
14. A.G.M. Strollo, E. Napoli, D. De Caro, N. Petra, G.D. Meo, Comparison and extension of
approximate 4–2 compressors for low-power approximate multipliers. IEEE Trans. Circuits
Syst. I: Reg. Pap. 67(9), 3021–3034 (2020)
15. S.S. Sarwar, S. Venkataramani, A. Ankit, A. Raghunathan, K. Roy, Energy-efficient neural
computing with approximate multipliers. J. Emerg. Technol. Comput. Syst. 14, 2 (2018)
16. B.J. Phillips, D.R. Kelly, B.W. Ng, Estimating adders for a low density parity check decoder,
in Proceedings of SPIE Optics + Photonics, vol. 6313 (2006), p. 631302
17. Z. Aizaz, K. Khare, Area and Power efficient truncated booth multipliers using approximate
carry based error compensation. IEEE Trans. Circuits Syst. II: Exp. Briefs (2021). https://doi.
org/10.1109/TCSII.2021.3094910
18. S.S. Sarwar, S. Venkataramani, A. Ankit, A. Raghunathan, K. Roy, Energy-efficient neural
computing with approximate multipliers. J. Emerg. Technol. Comput. Syst. 14(2), Article 16
(2018)
Explainable AI (XAI) for Social Good:
Leveraging AutoML to Assess
and Analyze Vital Potable Water Quality
Indicators
Prakriti Dwivedi, Akbar Ali Khan, Sareeta Mudge, and Garima Sharma
Abstract Water is the most fundamental need of mankind, and its demand has been
ever increasing concomitant with growth in world’s population. Regrettably, planet
earth is witnessing a steep decrease in water quality leading to various diseases
and deficiencies in human body. An immense pressure to meet the demand has not
only led to the reduction of important minerals in water but also their appropriate
proportion in it. This neglect of fundamental need of good quality water has reached
a point that global attention is needed for it and consequently steps are being taken to
enhance awareness, and research is being conducted to make healthy potable water
within the reach of everyone. This paper attempts to align with the philosophy of ‘AI
for Social Good’ to address this problem. Experimental results of this paper include
Accuracy of 97.58%, AUC of 0.9939, Recall of 0.8521, Precision of 0.9163, F 1 score
of 0.8831, Kappa of 0.8696, MCC of 0.8703 with a run time of 150 s.
Keywords Environmental AI or AI for environment · PyCaret · AutoML · Water ·

Machine learning · Explainable AI · Industry 4.0 · Water quality · AI for social
good
1 Introduction
The worldwide criticality of demand of healthy potable water as also simultaneous

degradation of its quality with time has always been under-stated. Here, quality
degradation refers to the depleting mineral level that is caused because of the high
P. Dwivedi · A. A. Khan (B)

Research & Business Analytics, Prin. L.N. Welingkar Institute of Management Development &
Research, Mumbai 400019, India
S. Mudge
Assistant Professor-Research and Business Analytics, Prin. L.N. Welingkar Institute of
Management Development and Research, Mumbai 400019, India
G. Sharma
Associate Dean-IIC & Research, Prin. L.N. Welingkar Institute of Management Development &
Research, Mumbai 400019, India
e-mail: garima.sharma@welingkar.org
592 P. Dwivedi et al.
pressure on potable water production for meeting the ever growing demand. For
this purpose, sea water is converted into portable one by desalination process that
generally involves removing of mineral components from saline water to make it
drinking-ready. Thus making the most fundamental need of human that is water just
good enough to drink, but not possessing the requisite mineral content that may
lead not only to diseases such as cholera, diarrhoea etc. but also to multi-mineral
deficiency in human body as for various mineral inputs the best and balanced supply
source is water. The threshold level of presence of some of the minerals in water
is as follows perchlorate-56 OD, ammonia—32.5 OD, nitrates—10 OD, radium—5
etc. Here, OD stands for Optical Density as the unit of measurement of minerals in
water per litre. This level of inappropriate mineralized water intake has become so
alarming that it has become a universal problem that requires immediate attention
and action. The UN General Assembly’s Sustainable Development Goal (SDG) aims
to provide universal access to safe and affordable drinking water for all by 2030. This
has led to an increased effort at the global level through a variety of public and private
organizations, NGO, governments etc. Moumen et al. [1] dwell on the relevance of
SDG goals besides focussing on Morocco’s water management plan and highlight
the fact that despite building of many dams ostensibly for economic growth yet the
country lacks in terms of safe drinking water and sanitation. Various technology
infusions were made to achieve this objective as the introduction of AI has been a
significant facilitator that has shown the enormous potential to leverage AI for the
social good [2], making the goal attainable for society 5.0. Gunning and Aha [3]
and Adadi and Berrada [4] conducted a variety of researches in this domain applying
machine learning technique with Explainable AI (XAI) approach. But the use of such
an approach in water sector, to benefit from robust prescriptive analytics with an aid
of Explainable AI concept, is still rare. Therefore, this paper makes an attempt to
use AutoML approach to provide prescriptive analytics in this domain which serves
as a catalyst for the SDG goal 6 of UNGA and also gives a clear awareness of the
importance of minerals in water content to largest stakeholder, which is every human
being.
2 Literature Review
Numerous studies have been conducted in this field with the purpose of better predic-
tion of water quality using various AI algorithms or models. Ahmed et al. [5] imple-
mented the three different models using the Wavelet De-noising techniques (WDT)
to predict water quality of Johor River. As a result, this WDT-ANFIS (Adaptive
Neuro-Fuzzy Inference System) model had improved the predictive precision for
parameters of water quality. Elkiran et al. [6] applied a series of AI models for water
quality parameter modelling. The results have shown that the neural-based set model
improves predictability by up to 14%. Ahmed et al. [7] in their work have estimated
WQI most adequately through a supervised ML algorithm that includes gradient
boosting and MLP (Multi-layered perceptron) having an accuracy of 85.07%. Lu
Explainable AI (XAI) for Social Good … 593
et al. [8] studied the same using the two hybrid decision tree-based models and
advanced denoising method. Chen et al. [9] identified that big data would help
in improving the quality of water. Yilma et al. [10], Abbasi et al. [11] and Ehte-
shami et al. [12] have validated their models using ANN and WQI approaches for
modelling, respectively. Barzegar and Moghaddam [13] have discussed about the
salinity in groundwater as one of the important issue. Having considered the Tabriz
Plain confined acquifier, the authors have compared the results after investing the
accuracy of three different neural networks viz. MLP (multi-layer perceptron neural
network), RBFNN (radial basis function neural network) and a generalized regression
neural network (GRNN). According to them, the Committee Neural network which
is combined of GRNN, RBFNN and MLP performs better than any other Artificial
Neural Networks. Hussain et al. [14] have calculated the health risk factors which are
associated with the intake of impure water. Their study was confined to the region of
Pakistan and nearby provinces. Several elements of water are absorbed by the body
and can lead to chronic diseases. Gao et al. [15] have traced the concentrations of
metals of all water concentrations to meet the drinking standards through their study.
It was hence found that Arsenic was a dominating pollutant having carcinogenic
effects that cannot be neglected. Chatalova et al. [16] on the other hand have high-
lighted the challenges faced by water related sectors viz. agriculture etc. Havelaar
et al. [17] in their work have made a comparative analysis conducting a case study
on hypothetical potable water supply from the surface water. Their region of study
is confined to Netherlands and have used disability adjusted life-years (DALYs) in
order to calculate the positive and negative effects of consuming disinfected water.
As a result, the risk of cause of infection was found to get lowered by ozonation of
water with cryptosporidium parvum. Gutiérrez et al. [18] have reported the complex-
ities faced and the immediate need for incorporation of drinking water legislation.
Acharya et al. [19] have pressed upon the fact that how the quality of rivers and lakes
are declining in spite of so much advancement in technology. All this is because of
the human activities and sudden climatic changes. In their study, they have specifi-
cally focussed on rivers and lakes in Nepal. As per Munos et al. [20], it can be clearly
that the issue of availability of clean drinking water is the criticality worldwide and
it has been one of the major reasons for 1.7 million deaths annually due to water
borne diseases like diarrhoea. Various researches have been done using different
approaches in order to predict the quality of water before consuming. This paper is
a step forward attempt to explain the amount of elements required for the water to
be safe for drinking using XAI approach through AutoML with a higher accuracy of
97.58%. He et al. [21] have also stated that AutoML provides the promising results
without involving human intervention as well the outcomes are reliable enough to
build a deep learning system.
3 Research Design
The research design schema proposed for this paper as shown in Fig. 1 can be divided
into six steps which are:
3.1 Data Gathering
The use case for this [22] research is gathered from Kaggle through secondary
research, and it is tailored for the purpose of education, research practice and acqui-
sition of adequate knowledge of basic mineral composition of water and conclude
whether the water is safe to drink or not. The dataset consists of 8000 data points and
20 features excluding the target variable—safeness of water, which has been cate-
gorized into safe and not safe. The independent variables in this dataset are nothing
but the minerals and their Optical Density (OD) level of which some are aluminium,
perchlorate, ammonia, nitrate etc.
3.2 Data Understanding
Here, a sound and clear understanding of the dataset is required as it forms the pre-
requisite for the next stage that is pre-processing of data. Clear understanding of the
target variable is attained in this step which is of binary classification type in this
case. It is a vital stage because it decides the algorithm to be deployed later in the
machine learning pipeline.
Fig. 1 Research design

3.3 Data Pre-processing
This step involves the initial cleaning of the dataset including presence of any
unusual data pattern, presence of any outlier or missing values, one-hot encoding
and balancing of target variable using Synthetic Minority Oversampling Technique
(SMOTE) statistical method.
3.4 Exploratory Data Analysis (EDA)
Understanding each independent variable and performing various univariate and

bivariate analyses to better visualize the data points. Subsequently, imputation of
NaN or empty values and the appropriate processing of outlier(s) are also performed
here.
3.5 Model Experimentation
This stage can be further be divided into two steps: Model Building and Model
Evaluation.
3.5.1 Model Building
Before starting with this step, it is important to split the dataset into train and test data
where 80% of the data is allocated for training, and remaining 20% for testing. Post
this, the data is fed into the model through Auto ML approach using PyCaret library of
machine learning which automates the task of model selection and hyper-parameter
tuning imparting the advantage of evaluating and returning the list of ranked models
based on accuracy and efficiency over time, through various metrics. The model at the
top of the list as per the dataset in this paper is the Light Gradient Boosting Method
(LGBM) showing an accuracy of 96.58% which was further improved to 97.58%
after optimization. In other words, it can be said that out of every 100 predictions
made 97 are correctly predicted indicating whether the minerals in water are present
in optimum quantity or not. LGBM is a tree-based ML algorithm which makes use
of Gaussian technique to filter the data. Other ranked models in the list are Gradient
Boosting Method(GBM), Random Forest Classifier, Decision Tree and Ada Boost
Classifier respectively, which forms the top 5 suggested models here.
3.5.2 Model Evaluation
The working efficiency of the models is decided based on various evaluation metrics.
Being a binary classification problem, the evaluation metric used in this paper are:
a. Confusion Matrix—It is n × n square matrix, where n represents the count of
classes present in the target variable, where the rows of the matrix represents
the actual class and the columns represents the predicted class. The four tags of
the matrix are True Positives (T.P), False Negatives (F.N), False Positives (F.P)
and True Negatives (T.N). Parameters that can be derived from the matrix are:
a.1. True Positive Rate and True Negative Rate which are the ratio of
deriving the actual positive and negative predictions, respectively, over
the total actual prediction in case of true positive rate whereas total actual
negatives in case of true negative rate. The same can be represented in
Eqs. (1) and (2):
T.P Rate = T.P/ (T.P + F.N) (1)
T.N Rate = T.N/ (F.P + T.N) (2)
a.2. False Positive Rate and False Negative Rate which are the ratio of
deriving the actual negative, predicted as positives and actual positives,
predicted as negatives, respectively, over total actual negatives in case of
false positive rate whereas total actual positives in case of false negative
rate. The equation representation of the same is shown in Eqs. (3) and
(4), respectively.
F.P Rate = F.P/ (F.P + T.N) (3)
F.N Rate = F.N/ (T.P + F.N) (4)
b. Accuracy Rate—As per Eq. (5), it is the ratio of deriving the sum of correct
predicted values over the sum of predicted values which forms one of the most
important evaluation metrics for any classification model
Accuracy Rate = (T.P + T.N)/ (T.P + T.N + F.P + F.N) (5)
c. Precision (P)—As per Eq. (6), it is the ratio of deriving the predicted actual
positive over the sum of predicted positives. It is basically used when the model
cannot afford to have much of false positives. Hence, neglecting it is more vital
than encountering any false negative.
Precision = T.P/(T.P + F.P) (6)

d. Recall (R)—As per Eq. (7), it is the ratio of obtaining the predicted actual
positives over the sum of actual positives. It is basically used when the model
cannot afford to have much of false positives. Hence, neglecting it becomes
more crucial than encountering any false positives.
Recall = T.P/(T.P + F.N) (7)
e. F1 Score—As per Eq. (8), it is the reciprocal of arithmetic mean of precision and
recall which indicates the accuracy of classifier in terms of number of instances
when it is able to classify the classes of target variables correctly. It is mostly
used with the combination of other evaluation metrics.
F1 = 2/[(1/Precision) + (1/Recall)] (8)
f. AUC—It is Area Under Curve measures the whole area falling under the ROC
curve. Its ranges hover between 0 and 1. The higher the AUC of the model,
better the model is at distinguishing between the positive and negative classes.
g. Cohen Kappa’s Co-efficient—As per Eq. (9), it is a measure of the accordance
between a pair of variable which represents the degree to which the data is the
correct representation of the measured value. It ranges between 0 and 1.
Kappa(k) = (to − te )/(1−te ) (9)
Here,
t o = the observed accordance among the variable.
t e = the hypothetical ratio of chance accordance.
h. Mathew’s Correlation Co-efficient (MCC)—As per Eq. (10), It is the measure
to gauge the effectiveness of the entire confusion matrix with a single value.
Higher the value of MCC, better the model at making the correct prediction.
√
MCC = (T.P ∗ T.N)−(F.P ∗ F.N)/ (T.P + F.P)(T.P + F.N)(T.N + F.P)(T.N + F.N)
(10)
i. Model Deployment—It is the final stage of the machine learning pipeline where
the trained ML model is taken and made available to users and to other systems.
It is beyond the scope of the study.
4 Result and Interpretation
The results attainment process begins with the Exploratory Data Analysis of the
dataset, which is followed by various other analyses.
Fig. 2 K.D.E Plot and Violin plot for the model
4.1 Exploratory Data Analyses (EDA)
Here, the first step is to analyze all the independent variables with reference to the
target variable which is ‘is_safe’ in this case. The visuals for the same shown in
Fig. 2 is comprised of violin and Kernel Density Estimate (KDE) plots which gives
the overall birds view of the variables and their distribution pattern which in returns
helps to check for skewness of the data and detection and removal of outliers, if any.
The heat map shown in Fig. 3 shows the degree to which two variables are asso-
ciated with each other. A negative correlation value indicates an inverse relation
between the two variable and the positive correlation value means direct relation.
A very high correlation value between any two independent variables can lead
to multi-collinearity which needs to be removed. In this dataset, no such case of
multi-collinearity was detected.
4.2 Model Comparison
The most vital feature of PyCaret—that of giving a list of models ranked on the basis
of their accuracy along with other evaluation metrics—was utilized to shortlist the
top 5 models for further optimization.
Table 1 shows the list of top 5 default and optimized models ranked on their basis
of their accuracies. Other evaluation metrics like AUC, recall, precision, F 1 Score,
Kappa and MCC are also given which helps in selecting the top model to proceed
with further steps of obtaining its feature importance and other related results for the
same. From the above Table 1, it can be concluded that the top model suitable for
Fig. 3 Heat map of the model
Table 1 Model Comparison before and after optimization

Model Name Accuracy AUC Recall Precision F1 Kappa MCC TT(s)
Evaluation metrics for top-5 default models
Light GBM 0.9658 0.9852 0.7887 0.9036 0.8408 0.8218 0.8250 0.261
GBC 0.9567 0.9782 0.7055 0.8986 0.7862 0.7627 0.772 1.079
RF 0.9504 0.9713 0.6436 0.8971 0.7467 0.7203 0.734 0.88
DT 0.9482 0.8755 0.7811 0.7744 0.7763 0.7470 0.7429 0.041
AdaBoost 0.9274 0.9289 0.5331 0.7643 0.6256 0.5872 0.6001 0.317
Evaluation metrics for top-5 optimized models
Light GBM 0.9758 0.9939 0.8521 0.9163 0.8831 0.8696 0.8703 150
GBC 0.9641 0.9836 0.7726 0.9070 08,336 0.8136 0.8171 900
RF 0.9655 0.9865 0.7603 0.9336 0.8367 0.8177 0.8239 1380
DF 0.9568 0.9523 0.7374 0.8742 0.799 0.775 0.7791 15
AdaBoost 0.9341 0.9371 0.5681 0.8157 0.6667 0.6316 0.646 960
this dataset is Light Gradient Boosting method which has an accuracy of 96.58% for
the default model and 97.58% after optimizing it. Other evaluation metric of LGBM
like F 1 score after the model optimization is observed to be 0.8831 which proves to
be a good score to move ahead with and it also shows the correctness of the recall
and precision values as F 1 score is nothing but the weighted average of both.
Light GBM is a tree-based learning algorithm capable of handling large-scale

data with a low model run time. The basic tree-based structure as shown in Fig. 4
can be used for classification, ranking and other tasks.
Mathew Correlation Co-efficient (MCC) is one of the critical metrics which shows
the aptness of the confusion matrix and its value nearer to 1 indicates the perfectness
of the model. For the above dataset, the MCC score of LGBM after tuning is 0.8703
which indicates that the classifier is good enough. The Kappa value for the same is
0.8696 which indicates a sound reliability between two variables.
Figure 5 shows the confusion matrix for the test data of the top model where the
predicted class is shown as the columns and the actual class is shown as rows. The
target variable labelled as 0 indicates that the water is not safe to drink and 1 says
that water is safe. The true positive, true negative, false positives and false negative
values for the above confusion matrix are 2122, 20, 38 and 219, respectively. The
values 2122 and 219 show the correct prediction of the model.
Fig. 4 Basic structure of LGBM
Fig. 5 Confusion matrix for the model

Figure 6 shows the discrimination threshold plot for the model which shows the
certainty or score at which the positive class is chosen over the negative ones whose
ideal value is 0.5. The threshold value for the above model is 0.47, thus indicating
nearer to perfect balance between the cases.
Figure 7 shows the ROC curve along with its AUC score for the top model. ROC
curve is the certainty curve plotted as True Positive Rate on y-axis against the False
Positive Rate on the x-axis and the AUC curve score is the measure of separability
Fig. 6 Discrimination threshold plot for the model
Fig. 7 ROC curve for the LGBM classifier

Fig. 8 Precision and recall curve of the model
Fig. 9 Validation curve for LGBM classifier
indicating how well the model is able to differentiate between classes. In Fig. 7, the
AUC for ROC of both classes that is 0 and 1 are 0.99 which shows that the model is
able to predict perfectly whether the water is safe for drinking or not.
Figure 8 reflects the Precision-Recall curve for LGBM where the average precision
comes out to be 0.96 which proves that the model is good enough to detect all the
positive values correctly.
Figure 9 shows the learning curve for the classifier better known as the validation
curve of the model. It is a tool which makes it easy to determine whether a bias error
or the variance error effects the estimator more.
4.3 Explainable AI (XAI)
Also termed as interpretable AI in which the outcome of the classifier can easily be
interpreted by human while having a complete understanding of the entire machine
learning path thus demystifying the black box machine learning approach. This
approach of Explainable AI is termed as white box approach as it gives a deeper
insight from the model which can help the various stakeholders make better business
decisions and also forms the base for prescriptive analytics. It comprehends the model
via numerous plots like feature importance plot, SHAP plot etc.
Feature importance plot is a feature based visual in which all the features are
individually assigned a score called feature importance score which is obtained using
various permutation techniques. The level of importance of any feature in the use case
is judged on the basis of the rate of change that feature shows in its score when any
change or modification is made in the model. The change in the feature importance
score not only indicates the higher importance of that variable as per the dataset
but also its high degree of association with the target variable. Beyond this, it can
additionally help in dimensionality reduction thus making the model run time short.
In Fig. 10, the top five features are perchlorate, ammonia, nitrates, radium and
chloramine which in this case means that the presence of the above mentioned mineral
is of utmost important to declare water as mineral rich and fit for drinking.
Fig. 10 Feature importance plot of the model

Fig. 11 SHAP bar and correlation plot
SHAPley values, a widely used approach of game theory also provides an edge to
Explainable AI where the input variable sums the difference between the current and
expected model output for the prediction. The SHAP bar plot in Fig. 11 indicates the
sum of mean SHAP values of both the classes of the target variable that is whether
the water is safe for drinking or not. As per the above dataset, the sum of mean
SHAP value is highest for aluminium which is coming out be around 2.7 followed
by cadmium which is 2.4. The SHAP Scatter plot in Fig. 9 shows the correlation of
the top 2 features obtained from the SHAP bar plot. This indicates that higher the
value of the average predicted value, higher is its importance and effect on the target
variable.
In the individual force plot (Fig. 12), red depicts the features which are driving
the predictions above the base value whereas those driving the predictions lower
than the base value are shown in blue. Here, base value means the average of the
estimators distributed over the whole input space. The model predicted output for
the LGBM model i.e. f (x) comes out to be −4.54 which is much higher than the
average predicted value i.e. −5.9. This indicated the high contribution of a particular
feature to a high output value. As per Fig. 12, contribution of mineral aluminium
Fig. 12 Individual force plot

as an independent variable towards the portability of water is 0.07 higher than the
average model predicted output.
5 Conclusion
This research paper has proposed Light Gradient Boosting Machine algorithm as it
was at the top of the leader-board of accuracy-based ranked models suggested by
PyCaret to conclude whether the water is safe for drinking or not. The proposed model
with an optimized accuracy of 97.58% along with various other evaluation metrics
like AUC, Kappa co-efficient, precision, recall and MCC shows a very promising and
satisfactory result which are good enough to contribute to this domain and help its
various stakeholders. The top three feature obtained from the feature importance plot
are perchlorate, aluminium and ammonia which shows that these are the most impor-
tant minerals to be present in water to make it mineral rich and best fit for drinking.
The unique approach of Explainable AI in this research provides an extra edge to it
by indicating the features (minerals) which should necessarily be present in water. It
does so with the help of Shapley bar and correlation plot which shows that aluminium,
cadmium and silver are the minerals whose proper and adequate combination can
make water mineral rich and good enough for drinking. The conclusion made by
various organizations working in this domain and by domain experts regarding the
water quality threshold and the important minerals matches with the result shown in
this paper. Hence, it can be concluded that machine learning techniques can aptly be
applied to resolve such issues. Therefore, proper attention by various water giants
and municipal corporations towards the presence of these mineral should be a special
focus to make water potable but also make this most fundamental need of human
being easily accessible to all in a healthier manner.
References
1. Z. Moumen, N. El Idrissi, M. Tvaronavičienė, Water security and sustainable development.

Insights into Reg. Dev. Entrepreneurship Sustain. Center. 301–317 (2019)
2. L. Floridi, J. Cowls, T. King, M. Taddeo, How to design AI for social good: seven essential
factors. Sci. Eng. Ethics 26, 1771–1796 (2020)
3. D. Gunning, D. Aha, DARPA’s explainable artificial intelligence (XAI) program. Sci. Robot.
4(37) (2019)
4. A. Adadi, M. Berrada, Peeking inside the black-box: a survey on explainable artificial
intelligence (XAI). IEEE Access 6, 52138–52160 (2018)
5. A. Najah Ahmed, F. Binti Othman, H. Abdulmohsin Afan, R. Khaleel Ibrahim, C. Ming Fai, M.
Shabbir Hossain, M. Ehteram, A. Elshafie, Machine learning methods for better water quality
prediction. J. Hydrol. 578, 124084 (2019)
6. G. Elkiran, V. Nourani, S. Abba, Multi-step ahead modelling of river water quality parameters
using ensemble artificial intelligence-based approach. J. Hydrol. 577, 123962 (2019)
7. U. Ahmed, R. Mumtaz, H. Anwar, A. Shah, R. Irfan, J. García-Nieto, Efficient water quality
prediction using supervised machine learning. Water 11, 2210 (2019)
8. H. Lu, X. Ma, Hybrid decision tree-based machine learning models for short-term water quality
prediction. Chemosphere 249, 126169 (2020)
9. K. Chen, H. Chen, C. Zhou, Y. Huang, X. Qi, R. Shen, F. Liu, M. Zuo, X. Zou, J. Wang, Y.
Zhang, D. Chen, X. Chen, Y. Deng, H. Ren, Comparative analysis of surface water quality
prediction performance and identification of key water parameters using different machine
learning models based on big data. Water Res. 171, 115454 (2020)
10. M. Yilma, Z. Kiflie, A. Windsperger, N. Gessese, Application of artificial neural network in
water quality index prediction: a case study in Little Akaki River, Addis Ababa, Ethiopia.
Model. Earth Syst. Environ. 4, 175–187 (2018)
11. T. Abbasi, S. Abbasi, Water Quality Indices (Elsevier, Oxford, 2012)
12. M. Ehteshami, N. Farahani, S. Tavassoli, Simulation of nitrate contamination in groundwater
using artificial neural networks. Model. Earth Syst. Environ. 2–28 (2016)
13. R. Barzegar, A. Asghari Moghaddam, Combining the advantages of neural networks using
the concept of committee machine in the groundwater salinity prediction. Model. Earth Syst.
Environ. 2–26 (2016)
14. S. Hussain, M. Habib-Ur-Rehman, T. Khanam, A. Sheer, Z. Kebin, Y. Jianjun, Health risk
assessment of different heavy metals dissolved in drinking water. Int. J. Environ. Res. Publ.
Health. 16, 1737 (2019)
15. B. Gao, L. Gao, J. Gao, D. Xu, Q. Wang, K. Sun, Simultaneous evaluations of occurrence and
probabilistic human health risk associated with trace elements in typical drinking water sources
from major river basins in China. Sci. Total Environ. 666, 139–146 (2019)
16. L. Chatalova, N. Djanibekov, T. Gagalyuk, V. Valentinov, The paradox of water management
projects in central Asia: an institutionalist perspective. Water 2017, 300 (2017)
17. A. Havelaar, A. De Hollander, P. Teunis, E. Evers, H. Van Kranen, J. Versteegh, J. Van Koten,
W. Slob, Balancing the risks and benefits of drinking water disinfection: disability adjusted
life-years on the scale. Environ. Health Perspect. 108(4), 315–321 (2000)
18. A. Gómez-Gutiérrez, M. Miralles, I. Corbella, S. Garcia, S. Navarro, X. Lleberia, Drinking
water quality and safety. 63–68 (2016)
19. T. Acharya, A. Subedi, D. Lee, Evaluation of machine learning algorithms for surface water
extraction in a Landsat 8 scene of Nepal. Sensors 19, 2769 (2019)
20. M. Munos, C. Walker, R. Black, The effect of oral rehydration solution and recommended
home fluids on diarrhoea mortality. Int. J. Epidemiol. 39, i75–i87
21. X. He, K. Zhao, X. Chu, AutoML: a survey of the state-of-the-art. Knowl. Based Syst. 212,
106622 (2019)
22. Water quality. https://www.kaggle.com/mssmartypants/water-quality. Retrieved Accessed on
15 July 2021
Prediction of Dynamic Virtual Machine
(VM) Provisioning in Cloud Computing
Using Deep Learning
Biswajit Padhi, Motahar Reza, Indrajeet Gupta, Poorna Sai Nagendra,

and Sarath S. Kumar
Abstract The increasing usage of remote services and high marketplace compe-
tition requires cloud service providers to plan and provision computing resources
efficiently, while providing affordable services and managing their data center expen-
ditures. Generally, IaaS cloud resources are managed by predicting either long-term
workload or long-term resource utilization pattern. But it does not give any genuine
information about the necessary memory/CPU before exposing it to the physical
machine. So, the prediction of dynamic virtual machine (VM) provisioning is a chal-
lenging problem in cloud computing. In this paper, we explored CPU usage details
of VMs in Azure cloud dataset to predict utilization patterns. The dataset is used to
train several deep learning models. Training with CPU utilization as the target class,
we predict the minimum, maximum, and average CPU utilization values. The results
are then analyzed using multiple evaluation metrics. After evaluating the different
models, we conclude that the GRU performs better in predicting the CPU utilization.
Keywords Virtual machine · Deep learning · Gated recurrent unit · Long

short-term memory · Microsoft Azure cloud
B. Padhi
National Institute of Science and Technology, Berhampur, Odisha 761008, India
M. Reza (B)
Department of Mathematics, GITAM University Hyderabad, Hyderabad 502329, India
e-mail: mreza@ieee.org
I. Gupta
Bennett University, Greater Noida, Uttar Pradesh 201310, India
P. S. Nagendra
RVR and JC College of Engineering, Guntur, Andhra Pradesh 522019, India
S. S. Kumar
Saintgits College of Engineering, Kottayam, Kerala 686532, India
608 B. Padhi et al.
1 Introduction
The world has witnessed that cloud computing has emerged as a well-adopted
computing paradigm that offers computing resources such as CPU, memory, server,
network, and platform beyond the geographic boundaries. Cloud services are acces-
sible over the Internet, in a seamless manner from individual to enterprise-level on
pay-per-use basis. However, the pandemic COVID-19 crisis has been like an exam-
ination that no one was prepared for. All the offices, educational institutes, business,
scientific research, entertainment, and even personal work have shifted from personal
interaction to the virtual world. Such a massive shift to the digital domain would have
been an impossible thing a decade ago. It would have been even difficult to imagine
such a thing. It has tested the digital efficiency and ability of online applications to
a more reliable, secure, and dramatic scale tremendous achievement, which is only
possible due to cloud computing. Cloud service providers (CSP) manage their data
centers and provide all services access to their end-users based on the service level
agreement (SLA).
Large companies are already having their own cloud server to handle their day-
to-day work. But smaller companies, academic institutions are now adapting to this
new normal and shifting their works to online platforms for the safety and security
of their stakeholders. However, these institutions are dependent on commercially
available cloud servers to fulfill their requirements. Because of these reasons, the
number of online classes and labs, conference calls, zoom meetings, etc., have been
skyrocketed. So currently, millions of new users now actively seeking a good cloud
provider. Due to this high marketplace competition, providers are facing many diffi-
culties to generate interesting features and services, while managing the data center
expenditures. Now, cloud service providers need to handle over as well as under-
utilization of the cloud resources to manage the quality and cost of the service. This
over/under-utilization of the resources is also referred to as the load and demand issue.
A load in the cloud server can be comprised of CPU utilization, memory capacity,
network, traffic, etc., and load balancing is the process of distributing the workload
among various servers to optimize resource utilization and avoid overloading. To
achieve this load balancing, cloud servers implement various load balancing algo-
rithms (LBA). Based on the system state, LBAs can be static (e.g. round robin,
min-min, max–min algorithm) where the load balancing is done at compile-time
based on processing power, memory, etc., or dynamic (e.g. ant colony optimization,
biased random sampling algorithm) where load balancing is done dynamically based
on different policies and state of the nodes. In [1], the authors have explained the
working, pros, and cons of these algorithms in detail. Although these algorithms
do a decent job in real-time load balancing in servers, they lack the ability to track
trends and patterns in the workload which can be useful in further optimization
of the resource’s utilization. Therefore, this forecasting the virtual machine (VM)
provisioning in cloud servers is a learning issue now.
Cloud resource prediction and VM provisioning have drawn the attention of the
cloud computing research community. As we can find the involvement of machine
Prediction of Dynamic Virtual Machine (VM) Provisioning … 609
learning (ML) and artificial intelligence (AI) for predicting cloud resources, but
no existing research work claims definite clouds resource prediction. The models
developed for this problem can be broadly divided into three groups [2]: analytical,
computational intelligence, and simulation models. The analytical model follows
the greedy approach to reduce the search time. The models such as fuzzy logic,
neural networks, genetic algorithms, and multi-agent systems are used in computa-
tional intelligence models. Overdriver, memory-buddies, and VMCtune are used in
simulation model-based prediction.
Many models have been proposed to deal with the problem of cloud resource
prediction across the data center for fulfilling the SLA parameters. In [2], the authors
proposed a Bayesian model against the workload patterns of Amazon and EC2. The
model can decide the slow and fast VM resources but is only applicable for computing
and memory-optimized resources. However, transaction throughput and latency into
underlying resources, e.g. vCPU cores have not been considered here. In the paper
[3, 4], the authors took a data-centric approach for generating a prediction model for
cloud resources. Here, both regression models and RNNs are used to devise a model.
For regression models, the author chose to predict 95 percentile CPU usage data
separately for delay-insensitive and interactive VMs. The authors concluded that the
delay-insensitive VMs are more stable than interactive VMs. In the RNN model,
the author used LSTM to predict the minimum (min), maximum (max), and average
(avg) CPU usage for a specific VM in the overall data. Much research investigated
and modeled for resource allocation provisioning in different cloud-based models
[4–7]. In [8], authors had reported a cost-saving benefit of dynamic scaling of cloud
resources which does not require any setup time. But the approach was designed
for workload prediction based on past traces and generally, it is applicable for auto-
scaling only. The limitations of the work can be marked as unarranged historical
data for a specific application domain. Chen et al. [9] proposed an EEEMD-ARIMA
method for cloud resource prediction which is a short-term method and based on
ensemble empirical mode decomposition. They verified the effectiveness of this
method and compared the experimental results of the EEMD-ARIMA method with
those of the ARIMA model in terms of RMSE, MAE, and MAPE.
Cloud resource waste is the main concern of most IT companies around the world
that are using public cloud services. The public cloud customers will spend more than
an astonishing amount of 50 billion dollars on the IaaS from the providers such as
AWS, Azure, Google, and VMware [10]. This boom has arrived due to the compar-
atively broader adoption of the public cloud services and further within the existing
accounts, the expansion of the infrastructure. Many-a-times, the growth in spending
surpasses the growth in commerce. The primary reason is that the significant chunk
of what companies spend on cloud resources gets wasted. Cloud resource waste holds
significance, not only in terms of resources that are not used but also the spending
on the resources, which are largely ignored and go unchecked. Therefore, keeping a
check on the usage, and adequately monitoring them is the need of the hour. It is appli-
cable for both small-scale enterprises and large organizations to carry out daily tasks,
in a proper cost-efficient manner. The leading causes of this issue are as follows: In
cloud resource prediction, there are two components, i.e., the actual computing load
610 B. Padhi et al.
and the maximum computing load. Generally, the companies opt for the maximum
computing load, in the way of making sure that everything is running smoothly when
the needs come to utilize the resources at their full capacity. However, such situations
do not often come up, where on regular days, the consumption requirement is much
lower. Idle resources are mainly seen in development centers. Here, various stages
like testing, staging, and various other courses of action are seen. Around 80% of the
organizations and data centers are occupying more server capacity than they require.
They are not only increasing their budget, but they are also proving to be problems for
the service providers. People are not utilizing pay-per-use cloud services to improve
efficiency and utilizing the resources, as per the demand. Although cloud resources
are not like fossil fuels, it has it is own kind of limitations too. Therefore, dynamic
cloud resource prediction is a recognized hot topic of research.
Although there are several techniques available to characterize and predict the
workload, IaaS cloud resources are mostly managed efficiently and used by fore-
casting the long-term workload or the long-term resource utilization pattern. Cloud
service providers can now collect metrics out of their own infrastructures and analyze
them with proper ML techniques and effectively enhance performance. The type and
nature of workload at a public cloud are never fixed, therefore more desirable tech-
niques are required to forecast the workload on cloud servers. In this paper, we have
tackled the above issue. Here, the Microsoft Azure dataset [11, 12] is taken into
consideration and various deep learning techniques are applied to it for predicting
the future workload, then these models are compared to find the best one. After
analyzing the whole Microsoft Azure cloud dataset, we looked deeper into the CPU
usage details of specific VMs, which are grouped on the basis of their timestamps, to
predict future utilization patterns. To select the best suitable model for our project,
several machine learning and deep learning models like GRU, LSTM, and IndRNN
have been tested as these models perform better on time series data. The normalized
dataset is used to train the prediction models. Training with CPU utilization as the
target class is done using these deep learning algorithms. We used Google Colab
notebooks to run the entire analysis. The “min CPU,” “max CPU,” and “avg CPU”
utilization values are predicted using these machine learning techniques and selected
the suitable model. The results are evaluated using mean absolute percentage error
(MAPE), mean absolute error (MAE), and root mean square error (RMSE).
2 Dataset
The Azure dataset shows part of the actual first-party virtual machine (VM) work-
load of Microsoft Azure in a region. The first-party workload is comprised of internal
VMs and first-party VM services. This dataset released in 2017 consists of 2,013,767
VMs [11]. It is acquired over 5958 Azure subscriptions. The time series data were
acquired over a period of 30 days and contain VM CPU utilization readings and
VM information table for each 5 min. So, the total number of VM CPU utilization
readings available in the dataset are 1,246,539,221. The whole dataset is divided
into over 128 files. It consists of 1 subscription, 1 deployment, 1 VM table, and 125
vm_cpu_reading files. Characteristics of the dataset include (1) encrypted identifi-
cation number of the VMs and deployment and subscription to which it belongs (2)
the VM category, (3) VM size in terms of max core, memory, disk allocation, (4) the
minimum, average, and maximum VM resource utilization during the 5 min.
3 Methodology
This section describes the data preprocessing, data formatting, data analysis, and
description of the models used in the proposed methodology.
The Azure dataset consists of 128 files. First, of which is “subscription.csv,” it

contains 3 attributes such as subscription id, timestamp first virtual machine (VM)
created, and count of VMs created. The next file is “deployment.csv,” its attributes
are deployment id and deployment size. These two files do not have much to offer
in terms of data analysis, so we have to explore further. The next file we have is
“vmtable.csv” (see Table 1) and it has 11 attributes, which are as follows: VM id,
subscription id, deployment id, timestamp VM created, timestamp VM deleted, max
CPU, avg CPU, p95 max CPU, VM virtual core count, VM memory (GB).
Table 1 First 5 entries of VM table.csv

Attributes VM id Subscription id Deployment id VM VM
created deleted
0 x/XsOfHO4o… VDU4C8cqdr… Pc2VLB8aDx… 0 2,591,700
1 H5CxmMoV… BSXOcywx8p… 3J17LcV4gXj… 0 1,539,300
2 wR/G1YUjp… VDU4C8cqdr… Pc2VLB8aDx… 2,188,800 2,591,700
3 1XiU + KpvIa… 8u + M3WcFp… DHbeI + pYTY… 0 2,591,700
4 z5i2HiSaz6Z… VDU4C8cqdr… Pc2VLB8aDx… 0 2,188,500
Max CPU Avg CPU p95max CPU VM VM core count VM memory
utilization utilization utilization Category
99.36987 3.424094 10.19431 Delay-insensitive 1 1.75
100 6.181784 33.98136 Interactive 1 0.75
99.40509 16.28761 95.69789 Delay-insensitive 8 56
612 B. Padhi et al.
Fig. 1 Correlation matrix for vm_table
Here, out of 2,013,767 VM ids, 780,488 VM ids are delay-insensitive, 60,682

VM ids are interactive, and the rest of the VM ids are of unknown category. Now, we
drop the VMs of unknown categories and plot the correlation between the rest of the
VMs. In Fig. 1, we can see that max CPU, avg CPU, and p95max CPU utilization are
highly positively correlated but VM memory and VM core hour have a very weak
correlation with the other attributes. According to [3], the delay-insensitive VMs are
much more predictable than the Interactive VMs.
That was all the analysis for vm_table. Now, let us look into the remaining 125
files. These are vm_cpu_reading files. It contains the VM CPU readings for every
5 min of a period of 30 days. The total number of entries in these 125 files is
1,246,539,221. Its attributes are timestamp, VM id, min CPU, max CPU, and avg
CPU utilization.
3.2 Data Formatting
Let us have a look at a vm_cpu_reading file first, which is shown in Table 2.

Before the analysis, we have to bring the data into proper format. The analysis of
the CPU utilization is independent of VM id. So, the VM id attribute is removed from
the dataset. The timestamps are converted into YYYY-MM-DD hh:mm:ss format.
Then, all the data in the 125 vm_cpu_readings files are grouped according to their
timestamp by summing up their min, max, and avg CPU utilization. Now, the total
data are grouped into 8640 entries.
Table 2 First 5 entries of vm_cpu_readings-file-1-of-125.csv

Timestamp VM id Min CPU Max CPU Avg CPU
utilization utilization utilization
0 2zrgeOqUDy + l0… 1.64695 8.794403 3.254472
0 /34Wh1Kq/qkN… 2.440088 6.941048 4.33624
0 2lzdXk1Rqn1ibH… 0.302992 2.046712 0.970692
0 0GrUQuLhCER5b… 1.515922 4.471657 2.438805
0 2I8OpI6bMkdzL… 0.148552 0.315007 0.264341
Table 3 Vm_cpu_readings
Timestamp Min CPU Max CPU Avg CPU
after formatting
utilization utilization utilization
2017–01-01 715,146.5368 2,223,302.433 1,229,569.371
0:00:00
2017–01-01 700,473.8403 2,212,393.246 1,211,321.709
0:05:00
2017–01-01 705,953.5659 2,213,056.745 1,206,634.914
0:10:00
2017–01-01 688,383.0732 2,187,572.239 1,190,368.507
0:15:00
Fig. 2 Min, max, and avg CPU utilization
Here, Table 3 shows few examples of the records after the data are transformed
and grouped and Fig. 2 shows the min, max, and avg CPU utilization of all the entries.
3.3 Data Analysis
Now, the vm_cpu_readings data have been properly formatted. This is time
series/sequential data. As we know, recurrent neural networks (RNNs) models are
614 B. Padhi et al.
best known for analyzing sequential data. So, we are using 3 RNN variants for the
prediction model. These RNN variants are.
• Long short-term memory (LSTM)
• Gated recurrent unit (GRU)
• Independently recurrent neural network (IndRNN).
Our proposed technique for the analysis is shown in Fig. 3.
Initially, the dataset is divided into the training set and test set (generally in the
80–20 ratio). Then, the look back value is set. The input and output to the model
are determined by this look back value. For example, if look back = i, then 0 to ith
entries of the file is the first input and ith value is the first output. Similarly, then i +
1th to 2*ith entries of the file is the second input, and 2*ith value is the second output.
After dividing the training and test set into inputs and outputs, a model is chosen,
and its parameters are initialized. The optimizer and no. of epochs are defined, and
the model is trained and validated on the training set. Then, the trained model is
tested on the test set and evaluated using various evaluation metrics. This process is
repeated for all the models.
Fig. 3 Flowchart of the proposed methodology

Table 4 Model summaries for LSTM, GRU, and IndRNN

Model Layer (type) Output shape Parameters Total parameters
GRU GRU (None, 128) 50,688 51,075
Dense (None, 3) 387
LSTM LSTM (None, 128) 67,584 67,971
Dense (None, 3) 387
Ind-RNN IndRNN (None, 128) 640 1027
Dense (None, 3) 387
3.4 Model Summaries
The deep learning models used in our proposed methodology, i.e., LSTM, GRU, and
IndRNN are imported from the Keras library in Python. The optimizer is the Adam
optimizer, the loss function is the mean squared error for all the models, and they are
trained for 20 epochs each. Table 4 provides the summaries of the models.
4 Result
We have used 3 different evaluation metrics for the evaluation of models which are
as follows.
4.1 Root Mean Square Error (RMSE)
RMSE is the more accustomed metric for regression models, and it is defined as the
square root of the mean squared difference between the target value and the predicted
value as shown in Eq. (1).

1 n 2
RMSE = y j − ŷ j (1)
n j=1
4.2 Mean Absolute Error (MAE)
MAE is another popular metric used to evaluate regression models. It is defined as

the mean of the absolute difference between the target value and the predicted value.
616 B. Padhi et al.
The MAE is more efficient to handle outliers and does not penalize the errors as
extremely as mean squared error. The mathematical formula for MAE is shown in
Eq. (2).
1
n
MAE = y j − ŷ j (2)
n j=1
4.3 Mean Absolute Percentage Error (MAPE)
It is the normalized MAE. It is used because it remains unaffected by the order of

the data. The mathematical formula for MAPE is shown in Eq. (3).
n
1 y j − ŷ j
MAPE = (3)
n j=1 y j
Table 5 shows the descriptions of the notations used in Eqs. (1)–(3).

The following are the comparison of training and test error of all the 3 models:
From Tables 6 and 7, we can understand that the GRU model performed better
than the other two models both in training and testing. So, we confirm GRU as the
final model. After finalizing it, we plotted the predicted values for min, max, and
avg CPU workload in the GRU model over given min, max, and avg CPU workload
in the test dataset. In the plot, Y-axis represents the workload/CPU utilization and
X-axis represents the timestamp. Here, the X-axis is common in all the graphs but
the Y-axis is separate for min, max, and avg CPU utilization because the difference
in their magnitudes is too high. As we can see in Fig. 4, the prediction model shows
a good fit over the data.
Table 5 Notations used in

Notation Description
the formulae
n Total no. of elements
yj Actual output
ŷ j Predicted output
Table 6 Comparison of error

Model RMSE MAE MAPE
in the training set
GRU 27,479.68 16,976.966797 0.010746
LSTM 30,041.72 17,444.191406 0.012133
Ind-RNN 28,190.72 17,636.75 0.012310
Table 7 Comparison of error

Model RMSE MAE MAPE
in the test data
GRU 27,882.27 127.436381 0.010876
LSTM 28,992.30 139.526158 0.011556
Ind-RNN 32,850.68 150.247764 0.015510
Fig. 4 Prediction of min, max, and avg. CPU workload using GRU
5 Conclusion
This paper proposed a model for predicting dynamic VM provisioning which is a

data-centric model rather than a mathematical model. It is also concluded that GRU
618 B. Padhi et al.
displayed the best-desired result. After testing all the models, it is faster than LSTM
and IndRNN. So, this model can be used by the CSPs to optimize their resources
according to both cloud users’ and cloud providers’ benefits.
The limitation of the model is that it is susceptible to small datasets. But the service
providers constantly track their resource usage logs which can be used to generate
large datasets on workload and CPU utilization. So, when bulk data on cloud servers
utilization become available, this model can be modified and trained accordingly to
provide an optimum result.
References
1. T. Deepa, D. Cheelu, A comparative study of static and dynamic load balancing algorithms in
cloud computing, in 2017 International Conference on Energy, Communication, Data Analytics
and Soft Computing (ICECDS) (2017), pp. 3375–3378
2. G.K. Shyam, S.S. Manvi, Virtual resource prediction in cloud environment: a Bayesian
approach. J. Netw. Comput. Appl. 65, 144–154 (2016). ISSN 1084–8045. https://doi.org/10.
1016/j.jnca.2016.03.002
3. M. Hariharasubramanian, Improving application infrastructure provisioning using resource
usage predictions from cloud metric data analysis. Retrieved from. https://doi.org/10.7282/t3-
y8e4-5v69
4. R. Moreno-Vozmediano, R.S. Montero, E. Huedo et al., Efficient resource provisioning for
elastic cloud services based on machine learning techniques. J. Cloud. Comput. 8, 5 (2019).
https://doi.org/10.1186/s13677-019-0128-9
5. B. Sotomayor, R.S. Montero, I.M. Llorente, I. Foster, Resource leasing and the art of
suspending virtual machines, in 2009 11th IEEE International Conference on High Perfor-
mance Computing and Communications, 2009, pp. 59–68. https://doi.org/10.1109/HPCC.200
9.17
6. B. Sotomayor, K. Keahey, I. Foster, Combining batch execution and leasing using virtual
machines, in Proceedings of the 17th International Symposium on High Performance
Distributed Computing. ACM: USA, 2008, pp. 87–96
7. C. Li, L.Y. Li, Optimal resource provisioning for cloud computing. J. Supercomput. 62(2),
989–1022 (2012)
8. E. Caron, F. Desprez, A. Muresan, Pattern matching based forecast of non-periodic repetitive
behavior for cloud clients. J. Grid Comput. 9(1), 49–64 (2011)
9. J. Chen, Y. Wang, A resource demand prediction method based on EEMD in cloud computing.
Procedia Comput. Sci. 131, 116–123 (2018)
10. https://www.gartner.com/en/documents/3982411. Accessed on 24 April 2021
11. Azure Public Dataset: https://github.com/Azure/AzurePublicDataset. Accessed on 25 April
2021
12. E. Cortez, A. Bonde, A. Muzio, M. Russinovich, M. Fontoura, R. Bianchini, Resource central:
Understanding and predicting workloads for improved resource management in large cloud
platforms, in Proceedings of the 26th Symposium on Operating Systems Principles (SOSP
‘17). Association for Computing Machinery, New York, NY, USA, 2017, pp. 153–167. https://
doi.org/10.1145/3132747.3132772
Explainability of Deep Learning-Based
System in Health Care
Shakti Kinger and Vrushali Kulkarni
Abstract Ocular disease is an eye disease that reduces the eye’s ability to work
normally. Early ocular disease detection is important to avoid blindness caused by
some of the diseases like cataracts, glaucoma, diabetes, age-related macular degen-
eration (AMD), etc. Artificial intelligence (AI) techniques have been used to build
systems for the speedy diagnosis of such diseases. In recent years, the deep neural
network (DNN) has shown remarkable success in this area. But the black box nature
of such systems has created questions on the use of DNN in a high-risk system like
health care. Explainable AI (XAI) is a suite of methods and techniques that provides
explanations of predictions made by AI systems. This helps to achieve accountability,
transparency, and debugging of the model in the healthcare domain. In this paper, we
have proposed to develop an ocular disease classification model and an XAI method
that can be used to explain the classification of eye diseases, from eye fundus images.
Keywords Ocular disease · Deep neural network (DNN) · Explainable AI (XAI) ·

Fundus images
1 Introduction
Ocular disease is one of the eye diseases that affects the normal functioning of the eye
and reduces visibility. Early detection of ocular disease is curable. For the prevention
of vision loss or blindness, regular eye checkups are mandatory. According to World
Health Organization (WHO) [1], currently, 2.2 billion people in the world are facing
vision impairment, and from this, 1 billion people are facing near or distance vision
impairment that has been prevented. Eyesight reduction can have lifelong effects like
work, daily activities, and health status. Automatic detection of disease is critical
S. Kinger (B) · V. Kulkarni

School of Computer Engineering and Technology, Dr. Vishwanath Karad MITWPU, Pune,
Maharashtra, India
e-mail: shakti.kinger@mitwpu.edu.in
V. Kulkarni
e-mail: vrushali.kulkarni@mitwpu.edu.in
620 S. Kinger and V. Kulkarni
and must prevent vision loss. The ocular surface disease is nothing but the damage
to the surface layers of the eye called cornea and conjunctiva. Ocular disorders
can happen at any age. There are dozens of ocular disorders; some may just be an
infection, while others may be more serious and lead to vision loss or even blindness.
Some of the common ocular disorders are refractive errors, cataracts, diabetic
retinopathy, glaucoma, and myopia or nearsighted. Early and accurate detection of
disease prevents blindness.
Hence, this has developed the interest of several researchers in the field of auto-
matic detection of ocular pathologies. With the recent advancement in deep learning
technologies, AI-based methods have provided high performance in disease detec-
tion based on image classification. The machine learning-based models hence can
prove to be very effective in preventing blindness caused by cataracts, age-related
macular degeneration, diabetes, glaucoma, etc.
Machine learning (ML) has shown exponential growth in recent years. ML is
playing a key role in applications like social media with notable examples like
sentiment analysis, filtering spam, etc.; transport examples—autonomous vehicle,
air traffic control safety monitoring; financial services examples—fraud detec-
tion, algorithmic trading portfolio management; healthcare examples—disease diag-
nosis, drug discovery, robotic surgery; eCommerce examples—product recom-
mendation, customer support, advertising; virtual assistance examples—intelligent
agents, natural language processing, and many more. Machine learning’s sub-field,
deep learning, is growing and replacing human beings at many places, and at some
places, it has outperformed human beings as well. Some of the applications where
deep learning has outperformed in vision, text, and speech areas include light-duty
autonomous vehicles get approval to ply in California, Robots take over warehouses
in Amazon, Google AI beats human players at strategy games StartCraft, AI helps
doctors identify cancer cells, GAN generate real-life human faces, etc.
The inherent structure of deep learning algorithms involves nested layers of
nonlinear neurons that are highly accurate and successful models in their predic-
tion. How- ever, these models are black box in nature as they lack transparency in
providing the exact cause of their predictions. Due to these limitations, despite high
accuracy, the effectiveness and adoption of these systems are very low. Hence, lack of
transparency poses a major hurdle for its adoption in critical domains like autonomous
driving, medical applications, military, and legal to name a few. Explainable machine
learning (XAI) [2] or interpretable machine learning (IML) [3] programs comprise a
suite of machine learning methodologies that help develop more explainable models
without impacting the properties like the high accuracy of the models. Moreover, it
helps human users understand the predictions, trust the results, and effectively deploy
these systems for wider adoption.
The explanation is a mechanism that helps verification of decisions made by the AI
model. For example, an explainer for cancer detection AI model using microscopic
images generates an explanation that maps pixels from input image to output predic-
tion. Similarly, explainer for speech recognition AI model identifies specific time
slice of power spectrum from input audio that contributed most toward the output
Explainability of Deep Learning-Based System in Health Care 621
prediction. In the case of advanced algorithms in reinforcement learning, an expla-

nation generated provides the reasoning behind the particular decision of an agent
in the environment. A group of ML algorithms, for example, decision trees, are said
to be inherently interpretable if they have a mechanism internally build to explain
its decision. This inherent property of algorithms adversely impacts its accuracy and
hence performs low compared to the deep learning models.
This paper focuses on making AI models in health care, specifically the eye
disease detection, more explainable so that the users have more confidence and trust
in accepting the results produced by these models. In absence of such explanations,
its adoption with doctors is going to be a challenging task.
To help more adoption of AI, explainable AI primarily tries to focus on three
important aspects like 1. Transparency, 2. Trust, and 3. Bias/Fairness of AI algorithm.
In this paper, we have proposed to use the most recent XAI techniques to under-
stand the factors behind predictions to help the stakeholders in health care to gain
the following benefits:
• Improves transparency and trust: XAI methods help gain visibility into the
decision-making mechanism of an AI system and hence increase its transparency
and lead to increased trust with the stakeholders [4].
• Understands model: XAI methods generate explanations that can be used to
identify the factors affecting the outcome of the AI system and hence understand
the internal working of a model [5].
• Model improvement: AI systems rely on large data to learn the features before
being deployed into the production of prediction. There are chances, a model
might learn from unwanted features and hence generate erroneous predictions.
XAI explanation can help understand the learned features, identify any errors,
rectify errors, and improve models for better prediction [4].
To address the above objective, we present an overview of XAI in Sect. 2,
related to work in ocular disease detection using DL methods; in Sect. 3, we
have a proposed approach for deep learning-based ocular disease classification with
XAI-based visualization of classification, and conclusion is presented in Sect. 4.
2 Related Work
A lot of researchers have developed interest in analyzing the learning capabilities of

AI models and explain reasons behind its failures in some of the scenarios. In this
section, we will discuss some of this work done in XAI and application of AI on
ocular disease detection.
2.1 Foundation and Techniques of XAI
In this section, we cover some basic terminology used in XAI, some properties of
XAI algorithms, business benefits of XAI.
2.1.1 Terms and Definitions
Very often, interpretability and explainability are used interchangeably; however, it

is important to understand the difference between the two [6–9]
• Interpretability: The machine learning model can determine cause and effect.
For example, does an increase in stress reduce the life span of a person? Does
high profile career create a lifetime risk for a person?
• Explainability: The machine learning model provides an explanation of output
created by a black box. Example: Suppose a neural network model developed
to predict a person’s death year based on some of the features like age, BMI
score, career category. An explainable model contributes to each feature toward
prediction, like—the career category is about 40% important, the age is 35%
important, and BMI score is 25% important.
2.1.2 Model Interpretation Methods
Interpretation methods vary based on how explainer models work [7].

• Intrinsic or post-hoc: These are machine learning models that are intrinsically
interpretable—i.e. model has an in-built explainer, like linear, tree-based, or para-
metric models. Post-hoc method refers to interpretability methods that are applied
after training.
• Model-specific or model-agnostic: Model-specific works for specific models
only while model-agnostic methods work for any machine learning model.
• Local or global: Local methods explain a single prediction while global methods
explain the entire model behavior.
2.1.3 Properties of Explanation
The effectiveness of an XAI model is measured based on the properties listed in

Table 1. These properties help not only evaluate the effectiveness but also compare
the results of different XAI methods.
Table 1 Properties of explanation

Property Definition
Accuracy It is the extent to which the model accurately predicts unseen
instances—accuracy score, F1-score are measures of accuracy
Fidelity It is the extent to which an explanation approximates the output
prediction made by a black box model
Consistency It is the extent to which models trained on the same prediction task
produce similar explanation
Stability It is the extent to which similar are the explanations for a similar
instances
Comprehensibility It is the extent to which humans understand the explanation
Certainty It is the confidence of the machine learning model
Degree of Importance It is the reflection of importance of features or parts of the
explanation
2.1.4 Business Benefits of XAI
Below are some of the most important benefits why a business should adopt XAI for
its AI-based solutions.
• Model performance: Model or dataset bias is easy to understand when we know
how the model is working and arrive at a particular decision.
• Decision making: Stakeholders can make better sales strategy if the reasoning
behind predictions made by the model is known.
• Control: XAI can help to ensure that the data without permission cannot be used
for analysis.
• Trust: XAI can help to build trust by providing interpretable models.
• Ethics: XAI helps identify biases in the model and hence makes it ethical for
adoption in production.
• Accountability: Who is accountable for an AI system’s decisions.
• Regulations: Ensures governance, accuracy, transparency, and explainability are
high for the AI model.
2.2 DL-Based Methods for Ocular Disease Detection
The author in [8] has used the CNN model and classified hard exudates to the central
pixel. In another work [9], researchers have used ImageNet pre-trained DCNN model
to detect AMD. During preprocessing, image cropping and resizing are done and
then used for the classification of early stages/intermediate or intermediate/advanced
stages. In the paper [10], the author has used 24 layers of CNN with batch normal-
ization and max-pooling to classify the disease. In [11], 18-layer CNN model is
proposed for glaucoma detection. Tan et al. [12] have proposed a diagnosis of AMD
at an early stage using a fourteen-layer deep CNN model. The diabetic retinopathy
lesions exudates, hemorrhages, microaneurysms can segment through a ten-layer
CNN as shared in the work by Tan et al. [13]. Kwasigroch et al. [14] proposed
a method that can classify five classes of diabetic retinopathy (DR) using the DL
network. Chai et al. [15] use a model to detect glaucoma using several CNN models—
one by directly injecting fundus images, the second to obtain optic disk region using
Faster-RCNN, and the third to segment disk area, cup area, and parapapillary atrophy
(PPA) area. Dai et al. [16] identified microaneurysms candidates through automatic
image-to-text mapping. Grassmann et al. [17] proposed a method of nine classes of
AMD disease—a random forest algorithm is trained based on results provided by
CNN’s. Khojasteh et al. [18] detect the exudates using pre-trained residual networks
(ResNet-50). Jain et al. [19] use DL method to classify healthy and disease retinal
fundus images and tested on two datasets one from Friedrich-Alexander University
and the other from the local hospital in Bangalore with the accuracy of 96.5–99.7%.
In most of the research work, DL is used to classify ocular disease but not providing
any explanations like which part or feature of the eye image contributed to specific
classification. We are proposing a model for ocular disease classification and explain
the results of the same by applying the XAI method.
3 Proposed Approach: XAI-Based System for Ocular

Disease Classification
In this section, we describe the details of the CNN architecture used for ocular disease
classification. We have used images of the eye from the ocular disease recognition
dataset [20] to train the model. We also share the complete procedure for building
the solution which is divided into three major steps that are discussed in subse-
quent sections: the dataset preparation, training and evaluating the trained model,
and analysis of the results. Figure 1 shows the overall system where the labeled
ocular disease recognition dataset is first processed for resizing and then images are
augmented. The pre-trained model is then trained using this dataset, and finally, the
results are analyzed using the explainable method. The explanations provided by the
XAI method are used for validation with the subject matter expert. If the explana-
tions provided indicate erroneous model prediction, then appropriate steps are taken
to improve the model.
3.1 Dataset Preparation
For our experiments, we are going to use the ocular disease intelligent recognition
(ODIR) dataset [20] which contains color fundus pictures of left and right eyes of
approx. 6000 patients. The data also contains metadata about the age of the patient
Fig. 1 Architecture for ocular disease identification
and diagnostic keywords from doctors. It is one of the biggest datasets collected by
Shanggong Medical Technology Co. Ltd. The data is collected from various hospi-
tals/medical centers in China and is real patient information available in this area.
Since the data is collected by different medical institutions using different vendor
cameras, the image size/resolution varies a lot. Under the supervision of trained
human readers, these images are annotated to classify images into eight classes—viz.
normal(N), diabetes (D), glaucoma (G), cataract (C), AMD (A), hypertension (H),
myopia (M), and other diseases/abnormalities (O). Figure 2 shows the distribution
of data across various classes.
After exploring the dataset, some of the challenges in the ODIR dataset observed
were:
• Data is highly unbalanced. A large proportion of the dataset is having normal (N),
diabetes (D) images while the occurrence of other disease images is very few.
• Multi-label disease images are also part of this dataset; other
diseases/abnormalities (O) contain images of different eye disease images.
• The dataset contains high resolution and size around 2976 × 2976 or 2592 ×
1728 pixels images which will take a longer time for the model’s training.
Data preprocessing steps required to create a valid dataset are:
• Applied random zoom, random rotation, flip left–right, flip top–bottom, etc. on
the original image, as shown in Fig. 3, to reduce data unbalance for certain classes.
• Labeling is done by renaming image names based on the diagnostic keywords.
• To reduce the training time, images are resized to 250 × 250 pixels.
Fig. 2 ODIR dataset class distribution
Fig. 3 Augmentation applied to create balanced data
3.2 Model Training, Evaluation Results, and Discussion
Deep learning-based approaches like convolutional neural network (CNN) has shown
the highest accuracy. CNN model consists of many layers that help to train the large
dataset and learn features of it correctly. In recent years, many researchers have used
CNN in all areas of classification including health care. It is an unsupervised method
that learns itself to extract features during the training phase. In this work, we have
used the CNN architecture (Fig. 4) with an input layer that takes 250 × 250 RGB
images. The first two 2D convolutional layers extract the features from the input
images followed by the ReLU activation function. To reduce the spatial size of input
representation, max-pooling layers are added, and to avoid overfitting, two dropout
layers are added. Finally, a dense layer with shape 8 is added to map each class of
data.
Fig. 4 DNN model summary for ocular disease detection
The model is trained on 20 epoch with learning rate of 0.00001 and batch size
of 32. The CNN model is trained and tested in python programming language with
Keras’s deep learning framework. All images are resized to a specific size using the
OpenCV tool before training. Some of the evaluation metrics like accuracy, precision,
recall, sensitivity, and specificity used to check the performance of this model and
shown in Table 2. Figure 5 shows the training and validation loss and accuracy values
concerning the number of an epoch. We are going to validate the predictions made
by this model using XAI methods.
Table 2 Performance analysis of the proposed architecture

Accuracy Precision Recall F1 Score Loss
0.9381 0.9381 0.9381 0.9381 0.19
Fig. 5 Training and validation loss and accuracy vs number of the epoch
3.3 Interpretability Method to Explain Deep Learning Model
Some of the most commonly used model-agnostic post-hoc methods to explain local
and global predictions of any DNN model are—Local Interpretable Model-Agnostic
Explanations (LIME) [21], Anchors: High-Precision Model-Agnostic Explanations
[22], Shapley Additive Explanations (SHAP) [23]. These are perturbation-based XAI
techniques while Saliency Maps [24], Gradient class activation mapping (CAM)
(GradCAM) [25], Gradient class activation mapping (GradCAM++) [26], Layer-
wise Relevance Propagation (LRP) [27] are gradient-based explainability methods.
Figure 6 shows comparative results from some of the (gradient-based and
perturbation-based) methods when applied to the ImageNet dataset [28].
We have used LIME for explaining the classification of the model. LIME high-
lights the area which contributes toward the classification. The model’s predictions
and image data are given as input to XAI methods to generate visual explanations.
These explanations can be analyzed with the help of an eye specialist. This analysis
will help to validate the predictions.
Fig. 6 Comparative results of saliency-based (e.g., GradCAM) and perturbation-based methods

(e.g., SHAP, LIME) [28]
3.3.1 Working of LIME (Local Interpretable Model-Agnostic

Explanations) Method to Visualize Explanations
LIME is a model-agnostic post-hoc method that works on text, image, and tabular
data. It perturbs the test observation to create local linear model. The output of LIME
is feature importance that is a list of explanations, reflecting the contribution of each
feature to the prediction of a data sample. The overall decision boundary in points
is complex but in the neighborhood of a single decision, the boundary is simple. A
single decision can be explained by auditing the black box around the given instance
and learning a local decision.
Snippet 1 shows high-level LIME algorithm, and Fig. 7 depicts the implementation
of LIME on the image classification model. The algorithm creates random segments
of the input image using a segmentation algorithm (e.g., quickshift). The image is
then perturbed randomly by masking different segments. The perturbed image is
then fed to the original interpretable model and calculate the class prediction and
cosine similarity between the original and perturbed image. Using these neighboring
samples and their outcome, a linear regression model is generated which is then
used to predict the local outcome and mask the original image with corresponding
segments to generate the required explanation.
Fig. 7 LIME algorithm for image data
Code Snippet 1.1. LIME algorithm
Figure 8 represents images of different eye defects as discussed earlier with

their explanations return by the LIME explainer. The areas highlighted in green
are contributing toward the model prediction, whereas areas highlighted in red are
regions that contributed the least toward models outcome. These explanations provide
answers to why a model has classified a retina image into specific class and hence
helps to gain trust in the model’s prediction. It is observed that explanations are
highlighting different portions of retina image contributing to prediction and in some
cases even highlighting portions lying outside of retina. These explanations need to
be verified with a subject matter expert. These verifications will help to carry out
improvements in the model.
Fig. 8 Different class predictions (row 1 and 3) by the model and their corresponding LIME
explanations (row 2 and 4). (Green portion indicates area contributing toward prediction.)
The stricter regulations from European General Data Protection Regulator (GDPR)
mandates results produced by AI systems be transparent in terms of providing expla-
nations behind its prediction. In the healthcare domain, where the risk of erroneous
prediction is high, it is extremely important to deploy XAI methods to make AI
systems transparent and traceable. The XAI methods need continuous development
to enable better improvements in AI-based systems.
Through this work, we have introduced XAI and introduced basic terminologies
used in XAI. The prior work in ocular disease detection has been limited to develop-
ment of DL model and its evaluation based on prediction accuracy. In our work, we
have applied XAI method on the ocular disease identification AI model to explain the
reasoning behind predictions made by the system and analyzed the results obtained
from the XAI method.
The future work involves applying more explainers to our model and comparing
their explanations. Another area to explore is using different CNN model architecture
to compare the prediction results on ocular data and compare their results using one
of the explainers.
References
1. WHO: World report on vision (Oct 2019), https://www.who.int/publications-detail/world-

report-on-vision. Accessed on 20 April, 2021
2. D. Gunning, D. Aha, DARPA’s Explainable Artificial Intelligence (XAI) Program. AI Mag-
azine 40(2), 44–58 (2019). https://doi.org/10.1609/aimag.v40i2.2850
3. Z.C. Lipton, The Mythos of Model Interpretability (2018). https://doi.org/10.1145/3236386.
3241340
4. Doshi-Velez, F., Kim, B.: Towards A Rigorous Science of Interpretable Machine Learning
pp. 1–13 (2017)
5. H. Lakkaraju, S.H. Bach, J. Leskovec, Interpretable decision sets: A joint framework for
description and prediction. Proceedings of the ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining 13, 1675–1684 (2016)
6. Gilpin, L.H., Bau, D., Yuan, B.Z., Bajwa, A., Specter, M., Kagal, L.: Explaining explanations:
An overview of interpretability of machine learning. In: Proceedings of the 2018 IEEE 5th
International Conference on Data Science and Advanced Analytics (DSAA). pp. 80–89
7. Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D.: A Survey of
Methods for Explaining Black Box Models. ACM Computing Surveys 51(5), 1–42 (2019).
https://doi.org/10.1145/3236009
8. Benzamin, A., Chakraborty, C.: Detection of Hard Exudates in Retinal Fundus Images Using
Deep Learning”. IEEE International Conference on System, Computation, Automation and
Networking (ICSCA) pp. 1–5 (2018)
9. Burlin, P., Freund, D.E., Joshi, N., Wolfson, Y., Bressler, N.M.: Detection of Age-related
Macular Degeneration via Deep Learning. In: and others (ed.) IEEE 13th International Sym-
posium on Biomedical Imaging (ISBI). pp. 184–188 (2016), https://doi.org/10.1109/ISBI.2016.
7493240
10. Mitra, A., Banerjee, P.S., Roy, S., Roy, S., Setua, S.K.: The region of interest localization for
glaucoma analysis from retinal fundus image using deep learning. Computer Methods and
Programs in Biomedicine 165, 25–35 (2018). https://doi.org/10.1016/j.cmpb.2018.08.003
11. U. Raghavendra, H. Fujita, S. Bhandary, A. Gudigar, J.H. Tan, R. Acharya, Deep Convo- lution
Neural Network for Accurate Diagnosis of Glaucoma Using Digital Fundus Images. Inf. Sci.
441, 41–49 (2018). https://doi.org/10.1016/j.ins.2018.01.051
12. J.H. Tan, S.V. Bhandary, S. Sivaprasad, Y. Hagiwara, A. Bagchi, U. Raghavendra, A.K. Rao, B.
Raju, N.S. Shetty, A. Gertych, K.C. Chua, U.R. Acharya, Age-related Macular Degeneration
detection using deep convolutional neural network. Future Generation Com- puter Systems 87,
127–135 (2018). https://doi.org/10.1016/j.future.2018.05.001
13. J.H. Tan, H. Fujita, S. Sivaprasad, S. Bhandary, A.K. Rao, K.C. Chua, U.R. Acharya, Automated
Segmentation of Exudates, Haemorrhages, Microaneurysms using Single Convo- lutional
Neural Network. Inf. Sci. 420, 66–76 (2017). https://doi.org/10.1016/j.ins.2017.08.050
14. Kwasigroch, A., Jarzembinski, B., Grochowski, M.: Deep CNN based decision support system
for detection and assessing the stage of diabetic retinopathy. International Interdisci- plinary
PhD Workshop pp. 111–116 (2018)
15. Y. Chai, H. Liu, J. Xu, Glaucoma Diagnosis Based on Both Hidden Features and Do- main
Knowledge through Deep Learning Models. Knowl.-Based Syst. 161, 147–156 (2018). https://
doi.org/10.1016/j.knosys.2018.07.043
16. Dai, L., Fang, R., Li, H., Hou, X., Sheng, B., Wu, Q., Jia, W.: Clinical Report Guided Reti- nal
Microaneurysm Detection With Multi-Sieving Deep Learning. IEEE Transactions on Medical
Imaging 37(5), 1149–1161 (2018). https://doi.org/10.1109/tmi.2018.2794988
17. Grassmann, F., Mengelkamp, J., Brandl, C., Harsch, S., Zimmermann, M.E., Linkohr, B.,
Peters, A., Heid, I.M., Palm, C., Weber, B.H.: A Deep Learning Algorithm for Prediction of
Age-Related Eye Disease Study Severity Scale for Age-Related Macular Degeneration from
Color Fundus Photography (2018). https://doi.org/10.1016/j.ophtha.2018.02.037
18. P. Khojasteh, L.A.P. Júnior, T. Carvalho, E. Rezende, B. Aliahmad, J.P. Papa, D.K. Kumar,
Exudate detection in fundus images using deeply-learnable features. Comput. Biol. Med. 104,
62–69 (2019). https://doi.org/10.1016/j.compbiomed.2018.10.031
19. Jain, L., Murthy, H.V.S., Patel, C., Bansal, D.: Retinal Eye Disease Detection Using Deep
Learning. In: 2018 Fourteenth International Conference on Information Processing (ICIN-
PRO). pp. 1–6 (2018), https://doi.org/10.1109/ICINPRO43533.2018.9096838
20. Larxel: Ocular Disease Recognition (2020). https://www.kaggle.com/andrewmvd/ocular-dis
ease-recognition-odir5k. Accessed on 20 April 2021.
21. Ribeiro, T., Singh, S., Guestrin, C.: ”Why Should I Trust You? In: Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD
’16. pp. 1135–1144. ACM Press (2016)
22. Ribeiro, T., Singh, S., Guestrin, C.: Anchors: High-Precision Model-Agnostic Explanations.
Proceedings of the AAAI Conference on Artificial Intelligence 32(1) (2018)
23. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. Advances in
Neural Information Processing Systems pp. 4765–4774 (2017)
24. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep Inside Convolutional Networks: Visualising
Image Classification Models and Saliency Maps. CoRR abs/1312.6034 (2013). http://dblp.uni-
trier.de/db/journals/corr/corr1312.html#SimonyanVZ13
25. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM:
Visual Explanations from Deep Networks via Gradient-Based Localization. In: Proceedings of
the IEEE International Conference on Computer Vision (ICCV). pp. 618–626 (2017). https://
doi.org/10.1109/ICCV.2017.74
26. Chattopadhay, A., Sarkar, A., Howlader, P., Balasubramanian, V.N.: Grad-CAM++: General-
ized Gradient-Based Visual Explanations for Deep Convolutional Networks. Proceedings of
the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) pp. 12–15
(2018)
27. S. Bach, A. Binder, G. Montavon, F. Klauschen, K.R. Müller, W. Samek, On Pixel-Wise
Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLoS
ONE 10(7), e0130140–e0130140 (2015). https://doi.org/10.1371/journal.pone.0130140
28. Das, A., Rad, P.: Opportunities and Challenges in Explainable Artificial Intelligence (XAI): A
Survey. ArXiv abs/2006.11371 (2020)
A Hybrid MSVM COVID-19 Image
Classification Enhanced with Swarm
Feature Optimization
Bhupinder Singh and Ritu Agarwal
Abstract COVID-19 (novel coronavirus disease) is a serious illness that has killed
millions of civilians and affected millions around the world. Therefore, numerous
technologies that enable both the rapid and accurate detection of COVID-19 illnesses
will provide much assistance to healthcare practitioners. A machine learning-based
approach is used for the identification of COVID-19. In general, artificial intelligence
(AI) approaches have yielded positive outcomes in healthcare visual processing and
analysis. CXR is the digital image processing method that plays a significant role
in the analysis of corona disease. In this research article, at the initial phase of
the process, a median filter is used for the noise reduction from the image. Edge
detection is an essential step in the process of COVID-19 detection. The canny
edge detector is implemented for the detection of edges in the CXR images. The
principal component analysis (PCA) method is implemented for the feature extraction
phase. There are multiple features extracted through PCA. The essential features are
optimized by an optimization technique known as swarm optimization is used for
feature optimization. For the recognition of COVID-19 through CXR images, a
hybrid multi-class support vector machine technique is implemented. The particle
swarm optimization (PSO) technique is used for feature optimization. The proposed
system has achieved an accuracy of 97.51%, specificity (SP) of 97.49%, and 98.0%
of sensitivity (SN).
Keywords COVID-19 disease · Chest X-ray (CXR) images · Principal component

analysis (PCA) · Hybridization using multiclass support vector machine · Particle
swarm optimization
1 Introduction
Due to the outbreak of an incurable illness throughout China during early 2019,
numerous persons fall affected in a regional supermarket. The illness was initially
unidentified; however, experts identified its indications as being comparable with
B. Singh (B) · R. Agarwal

Delhi Technological University, New Delhi 110042, India
636 B. Singh and R. Agarwal
Fig. 1 Chest X-ray (CXR)

images [3]
coronavirus illness and influenza. The precise reason for COVID-19 infectious
diseases was undetermined at first; however a scientific inspection and evaluation
of significant samples using a legitimate PCR test, the latent infection was identified
and termed “COVID-19” mostly on World Health Organization (WHO) suggestion
[1]. The COVID-19 outbreak spreads quickly across international borders, wreaking
havoc on the worldwide patient’s quality of life, economy, and well-being [2].
According to Worldometers data recorded in mid-July 2021, upwards of 86 million
people across the world infected with COVID and more than 1,870,000 individuals
dying as a result including its illness. CXR images are utilized to provide the internal
information of the body parts. X-ray images are the electromagnetic rays produced
by the radiation and show the internal parts of the body [3]. COVID-19 can also
detect through CXR images. The images of chest X-ray provide the internal view of
the chest that can easily detect the virus. Example of a chest X-ray Image is shown
in Fig. 1.
The WHO declared coronavirus infection 2019, a pandemic around March 11th,
2020, owing to the rapid flu virus and widespread [4]. Normally, the first case was
reported in Wuhan City of South China on 31st December, 2019. COVID-19 epidemi-
ologic component was identified and isolated as a new coronavirus, which was given
the name 2019-nCoV at first [5]. The disease genomic was ultimately transcribed,
and the World Association of Categorization of Infections dubbed it acute respira-
tory syndrome (SARS-CoV-2) since it was morphologically closer to the COVID-19
breakout that caused the SARS spread in 2003.
1.1 Sign and Symptoms
COVID-19 has a variety of effects on various persons. The majority of affected

people will experience mild to severe sign and overcome them without any require
for hospitalization. The several types of signs and symptoms are mentioned in Table
1.
A Hybrid MSVM COVID-19 Image Classification Enhanced … 637
Table 1 Types of signs and symptoms of COVID-19 [6]

S. No. Type of signs Symptoms
1 Common signs Dry cough, exhaustion, high fever
2 Highly risky signs Chest pain, breath shortness, loss of movement
3 Less common signs Sore throat, headache, skin rashes, body pain, diarrhea, loss of
taste and smell
1.2 Types of COVID-19 Tests
In this section, we have mentioned three different types of COVID-19 tests such
as polymerase chain reaction (PCR) test, COVID-19 antigen tests, and COVID-19
antibody test.
1.2.1 Polymerase Chain Reaction (PCR) Test
PCR testing is used to examine for the accumulation of RNA polymerase in the
organism, which can be detected when receptors are formed or signs of the disease
appear. This indicates that the testing can detect whether or not someone is infected
with the virus at an early stage of sickness [7].
1.2.2 COVID-19 Antigen Tests
Any extraneous materials or chromosomal DNA in the body that causes an antibody
reaction are referred to as a COVID-19 antigen [8]. This examination aids in the iden-
tification of antigens associated with the COVID-19 disease. Antigen testing often
referred to as antigen detection testing is a type of diagnostic test that provides results
more quickly than molecular assays. However, there is a disadvantage: Antigen
testing is more likely to lose an actual diagnosis.
1.2.3 COVID-19 Antibody Test
Confirmatory testing also referred as a cytology analysis and immunological check, or

a biopsy blood sample, are tests done that look for CORONA responses. It determines
whether you have ever been exposed to the virus which generates COVID-19. The
antiviral drug assesses if your antibodies have adapted to the illness rather than
looking for the live virus.
In everyday lives, due to COVID-19, people have lost their lives and the expense
for detection of this disease is very high in the context of the country, patients, and
states. Various health issues occur in the human body due to COVID-19 such as
low-breathing, lung infection cough, and cold. There are various techniques avail-
able for the detection of COVID-19 but still it some methods have some problems
such as (i) imbalanced data and classification issues, (ii) multiclass or multi-label and
hierarchical classification issues, and (iii) issues in features flattening and high false
negative rate. The deep learning-based model is designed by [1] which is based on
a convolutional neural network (CNN) known as decompose, transfer, and compose
(DeTraC) that is utilized to categorize the COVID-19 images of CXR. The discovery
of class boundaries is used with the help of class decomposition and handles non-
identical datasets. DeTraC was established by utilizing several pre-trained CNN
models, in which an extreme accuracy rate was achieved by VGG19 in DeTraC. The
outcomes proved the ability of DeTraC to recognize COVID-19 test cases from a
descriptive image dataset gathered from various health centers globally. The defined
model (DeTraC) got the maximum accuracy of 93% and 100% sensitivity in iden-
tifying COVID-19 images of chest X-ray (CXR) from normal cases and respiratory
lung problem cases.
This paper is organized into various sections: Sect. 2 has included a detail survey
of the COVID-19 disease using existing methods. Research methodology has been
explained in Sect. 3. In Sect. 4, discussed the dataset description, simulation result
analysis, and result discussion. The conclusion and further work have given in Sect. 5.
2 Literature Review
Abbas et al. [9] designed a deep learning-based model which was based on convo-
lutional neural network (CNN) known as DeTraC that utilized to categorize the
COVID-19 images of CXR. The discovery of class boundaries is used with the help
of class decomposition and handles non-identical datasets. DeTraC was established
by utilizing several pre-trained CNN models, in which an extreme accuracy rate was
achieved by VGG19 in DeTraC. The outcomes proved the ability of DeTraC to recog-
nize COVID-19 test cases from a descriptive image dataset gathered from various
health centers globally. The defined model (DeTraC) got the maximum accuracy of
93% and 100% sensitivity in identifying COVID-19 images of chest X-ray (CXR)
from normal cases and respiratory lung problem cases. Raajan et al. [10] suggested
a strategy that used the ResNet architecture and convolution neural network for
training the pictures provided by the CT scan to efficiently diagnose coronavirus-
affected individuals. The infected person was correctly determined by comparing the
training and testing files. On the public dataset based on computed tomography, the
accuracy and specificity were 95.09% and 81.89%, respectively. The results were
alone taken without incorporation of other statistics such as specific regions, the
density of population. Based on the findings, it was clear that the proposed approach
would accurately classify corona positive cases of patients. To classify COVID-19
chest X-ray frames, Zebin et al. [11] used a matrix factorization approach. The
researchers employed two openly searchable chest X-ray datasets 1, 2. The classi-
fication method distinguishes between infections in the lungs caused by corona and
pneumonia and those caused by inflammation. The implementation used ResNet50,

VGG16, and EfficientNetB0.The pre-trained features extractors were worked as the
backbone of the convolutional network. The model with VGG16 had achieved detec-
tion accuracy of 90% and 94.3% of accuracy with ResNet50, and 96.8% with Effi-
cientNetB0 pertained feature extractor. For the identification of the corona minority
type, a generative adversarial system was trained. Pereira et al. [12] proposed a clas-
sification method that took into account multi-class and hierarchical classifications.
It was also demonstrated how to re-balance the distribution of squares using a re-
sampling procedure. To make use of the model’s strengths, a fusion method using the
base classifiers and several texture descriptors was implemented. The various X-ray
images of pneumonia and healthy lungs were used to train the model. A database,
called RYDLS-20, was developed which consisted of multiple images of a chest
X-ray. The suggested method used in RYDLS-20 yielded 0.65 as an F1-score using
a multi-class method for recognition of COVID-19 and 0.89 as an F1-score in the
hierarchical case. Punn et al. [13] suggested a stochastic over assembling approach
based on pre-trained neural networks like ResNet, DensNet, and others. State-of-the-
art results were achieved using a variety of weighted class loss function algorithms,
including Inception-V2, V3, Baseline ResNet, DensNet189, and NASNetLarge. The
two category classifications (infected cases and normal) and even COVID-19, the
common case of multi-category classification such as pneumonia and poster ante-
rior CXR images were used in it. Different metrics such as the area under the curve
(AUC), precision rate (PRE), recall rate (REC), and accuracy were utilized to assess
the model’s performance. When compared to other methods, NASNetLarge produced
better results. Table 2 represented the results and future enhancement of existing
techniques. The existing methods with layers and values are shown in Table 3.
Table 2 Results and future enhancement of existing techniques

Author’s names Results Future enhancement
Abbas et al. [9] Accuracy-98.23% Specificity-100% To improve the model’s functionality,
Sensitivity- 87.09% automated analysis component will
be included
Rajan et al. [10] Accuracy-95.09% CNN will be enhanced with
Specificity-81.89% pre-trained model
Zebin et al. [11] Accuracy-96.8% TensorFlow lite will be used for
Recall-97.5% enhancing the capabilities of deep
learning
Pereira et al. [12] Score-0 to 16 More parameters will be used for
evaluating the efficiency
Punn et al. [13] Specificity-98% Pre-processing of image will be
Precision-87% enhanced for better results
Recall-90%
AUC-99%, ACC-98%
Table 3 Existing methods with layers and values

Author’s name Methods Layers name with values
Abbas et al. [9] AlexNet Convolutional layers-5, fully
connected layers-3, and
max-pooling layers 3 × 3
Raajan et al. [10] CNN Convolution layers-50, pooling
nodes-3, identity layer-16, and fully
connected layer-1
Zebin et al. [11] EfficientNetB0 Convolution layers = 1, pooling
layer-3, fully connected layer-5
Pereira et al. [12] VGG19 Convolutional Layers-16, fully
connected layers-3
Punn et al. [13] Inception, ResNetv2, DenseNet 169, Convolution layers-2, pooling
NASNET Large, Baseline ResNet layer-4, fully connected layer-169
The proposed approach is a hybrid approach, which is having different modules

like feature extraction, optimization, and classification. All these modules help to
process the image and classify the infection from uploaded samples. The objective
of the first module of proposed methodology is to extract the features from the chest
X-ray images. Principal component analysis (PCA) is implemented for the feature
extraction phase and used to reduce the dimensionality of the images. The partic-
ular pattern of the innovative parts is the main component that is found with PCA.
While progressing from 1 to n, the significance from every component declines,
indicating that the 1 principal component is the most important as well as the n
principal component is the least problematic. Once all the features stored in feature
matrixes, optimization module takes place to reduce the dimensions of extracted
features and reduce error probability. PSO seems to be a new computational intelli-
gence approach in which each possible alternative is viewed as just a particular axis
around the computational complexity at the particular speed. At the last stage of the
classification process, multiclass support vector machine (MSVM) module is used to
classify the samples. The total dataset divided into three parts as training, validation,
and testing. The ratio of division is 60% (training)-20% (validation)-20% (testing).
Machine learning-based MSVM termed multi-support vector machine categorization
is a growing field of study that seeks to address detection problems across a variety
of fields. Essentially, MSVM uses certain specified kernels that fulfill constraint to
do a linear separation during an augmenting domain. All data points are mapped
into such a highly dimensional domain, potentially boundless through dimensions,
so horizontal segregation is much more possible. Subsequently, by increasing the
difference across multiple categories throughout the given region, a straight separa-
tion hyperplanes are discovered. As a result, the design as well as the qualities of
the applied kernels determine the complexities including its separation hyperplanes.
The main motive of this research algorithm is to enhance the procedure of diagnosis
Upload the (Chest Resize the input Convert 3D Add ar ficial Noise
X-ray) Images image to 2D image (Salt & Pepper)
Op mized
Feature Regions or edges Smooth Image
classifica on Extrac on calculated Calculated (Median
using
Using PCA (Canny) Filter)
PSO+MSVM
Model Disease Evaluated

Saved simula on detected Parameters
database (Train and test)
Trained
Fig. 2 Proposed work
and detection/classification of the COVID-19 CXR images. For this, a new hybrid
approach is used to enhance the accuracy rate, time consumption, and other param-
eters. Generally, we will be using machine learning concepts which have to follow
two phases like training and testing process. Figure 2 is the flowchart of the proposed
methodology, and the detailed steps are described as below:
3.1 Medical Image Acquisition and Pre-processing Phase
This proposed work searches the dataset from the online sites. The proposed dataset
was named KAGGLE (COVID-19-radiography-database) [14]. The dataset is a
collection of 133 chest X-ray (CXR) images. These datasets are of different types
such as viral pneumonia, COVID, and normal X-ray images. Initially, upload the
CRX medical images and resize the image 0–255 range set. In research work, 3D
image is converted to 2D dimensional image. It optimized the image dimensional
size of the input image. It only detects unwanted noise in the converted image. In pre-
processing, the step is implemented median filter method to remove the unwanted
noises in the input image. After that, the filtration process detects the ROI component
using the CANNY edge detection method.
3.2 Medical Image Feature Extraction Phase
This phase implemented the principal component analysis (PCA) method for feature
extraction. This method is a dimensionality optimization method and searches the V
(eigenvectors) of a covariance matrix with maximum E (eigenvalues) and then utilizes
those to project the CRX image data into a novel sub-space of equal or minimum
dimensions. PCA steps are discussed as (i) input image converts into matrix (row,
column), (ii) mean value calculated, (iii) covariance matrix (Co) calculated, (iv)
calculate the E and V (eigenvalues and vectors), and (v) sort the E by decreasing
order to rank the corresponding V.
3.3 Medical Image Hybridization Method Used for Detection

Phase
This proposed method will implement a population-inspired meta-heuristic or

instance selection method using PSO approach. It is a simple and more effective
method that calculates the selected feature sets based on the fitness function. This
approach is based on the nature analysis of birds. Feature subset selection depends
on the PSO method that gives better performance and optimized feature vectors. This
optimized feature vector will pass the MSVM classifier to predict the disease. After,
extracting features, the hybridization classification is evaluated by using enhanced-
MSVM. The validation and training procedure are between the significant phases
in developing a precise procedure model using enhanced-MSVM. The database for
validation and training procedures consists of two sections: The train feature set that
is utilized to train the classifier model and test feature sets are utilized to identify the
accuracy rate of the train SVM Model. The following technique is used for outlining
the suggested hybrid PSO-MSVM approach: (i) Without modification, PSO is started
with the overall population, inertial weight, as well as generations. (ii) Each particle’s
fitness is evaluated. (iii) Compute the global optimal as well as local optimal particles
by analyzing fitness values. (iv) Update individual particle’s positions and veloci-
ties until the fitness values correspond. (v) The global optimal particle throughout
the population is sent to the classification MSVM model for learning when it has
converged. And (vi) the classification MSVM method is being trained. Lastly, the
disease kind with precise value is evaluated and calculates the performance metrics
such as accuracy rate, sensitivity, specificity, time consumption, and compared with
the existing method.
4 Simulation Result Analysis
The performance of the model is examined on KAGGLE COVID-19 CXR images.

The dataset is sub-divided into various parts such as COVID-19 positive case, normal
case, non-COVID lung infection, and viral pneumonia images. The parameters used
for the examination of the model are accuracy, mean square error (MSE), sensitivity,
and specificity.
The data collection is the most critical feature of the project. We are using CXR
images in this project, which are divided into three sections. The chest X-ray picture
dataset is freely available on the KAGGLE Website. There are total of 477 images
present in the dataset which is the collection of three different types of images. The
first type is COVID-19 positive case images that have 133 images. The second type
is normal CXR medical images that are 133 in numerals. The third and the least type
is viral pneumonia images that hold 133 images. It defines the dataset collection in
different types such as COVID, normal, and virus of pneumonia [14].
4.2 Mathematical Formulas
(i) MSE: MSE is defined as the estimator that measures the mean of square of
the difference between expected and actual values [15]. The mathematical
equation of MSE is shown in Eq. 1.

n
2
MSE = 1/m xi − xi∧ (1)
k=0
Here, MSE—mean square error, M—number of total data points, xi —

expected values, xi ˆ—actual values.
(ii) Accuracy: Accuracy refers to model’s capacity to calculate an accurate esti-
mation. To put it another way, the degree to which the calculated value is
similar to a conventional or genuine value. Minimal observations are used to
achieve accuracy [16]. The formulation form of accuracy is presented in Eq. 2.
TP + TN
Accuracy = (2)
TP + FP + FN + TN
Here, TP = true positive, TN = true negative, FP = false positive, and

FN = false negative.
(iii) Specificity: The fraction of negative accurately detected (— in other words,

the number of individuals who may not have the condition (untouched) who
are correctly classified as not exhibiting the condition) is known as specificity
(true negative rate) [16]. The specificity is shown in Eq. 3.
TN
Specificity = (3)
TN + FP
Here, TN = true negative and FP = false positive.

(iv) Sensitivity: The fraction of positives that are clearly predicted (i.e., the number
of individuals who seem to have a condition (affected) who are accurately
described as having the condition) is referred to as the true positive rate [16].
The formula for sensitivity is presented below in Eq. 4.
TP
Sensitivity = (4)
TP + FN
Here, TP = true positive and FN = false negative.
4.3 Result Analysis
The proposed work is performed in the MATrix Laboratory (MATLAB) using simu-
lation tool and GUI designed DESKTOP APPLICATION. A total of 133 * 3 images
have been used in these experimental results, including COVID-19, COVID-19 pneu-
monia, normal CXR (X-ray images of the chest) images are given. It almost 133 * 3
images of CXR images have been used for training (knowledge domain). The testing
module has used 97 images for detection system. The testing domain represents
the proposed model hybridization (PSO + MSVM) classification approach; long-
term dataset CXR images of COVID-19 diseases have been selected for the testing
domain.
Figure 3 defines the COVID-19 disease category CXR image is uploaded, and
secondly defines the resize the test uploaded image 0–255 range. It converts the 3D
Fig. 3 Upload test input, resize, and grayscale CXR image

CXR image into a 2D CXR image and calculates the grayscale image. It mitigates
the dimensionality of the CXR uploaded images.
Figure 4 shows and identifies the noise data in the uploaded test CXR image.
Firstly, apply the median filtration method and evaluate the smooth or noise-free
image. Secondly, the canny edge detector method is applied to calculate the edges
of the filtered image. It works very smoothly without affecting the features of edges.
Figure 5 defines the feature extraction line-graph using the PCA algorithm. PCA
will be utilized to minimize the size of the chest X-ray images as well as to remove
the unwanted feature vectors. The images of the dataset are of m * n pixel size. The
identical features of X-ray images of the chest will be identified with the help of
eigenvectors and values. The input image will be compared with other images of the
dataset and the features of images will be compared. Then, the similar features will
be extracted.
Figure 6 defines the feature selection procedure using PSO method. In this method,
to choose the reliable or valuable feature sets and defines them in matrix format.
This approach depends on proposed performance metrics modification, varied search
Fig. 4 Noisy, filtered, and edge detection image
Fig. 5 Feature extraction

(PCA)
Fig. 6 Diagnose detection
strategy, and modifies the solution-space to generate the search simply using various
global and public features. After that MSVM algorithm is implemented to select
features then classifies the disease. The message box defines the diagnosed disease
category in the various CXR images. The diagnosis detection procedure is done by
PSO + MSVM (hybridization). Figure 7 shows the accuracy rate with proposed
model. Figure 8 defines the confusion matrix of the proposed model. This proposed
model has improved the accuracy rate as compared with the existing model. And
after, the accuracy graph shows the confusion matrix which is represented by the
TN, TP, FN, and FP. The full form of these terms like TP = true positive, TN =
true negative, FN = false negative, and FP = false positive. Figures 9, 10, and 11
Fig. 7 Accuracy rate with

proposed model
Fig. 8 Confusion matrix for Confusion Matrix

hybrid method (PSO +
MSVM)
0 206 133
True Label
1 97 477
0 1
Predicted Label
Fig. 9 Comparison—accuracy (Acc.) rate
Fig. 10 Comparison—sensitivity (SN) rate

Fig. 11 Comparison—specificity (SP) rate
Table 4 Comparison performance

Parameters Proposed hybridization PSO + MSVM DeTrac GoogleNet SqueezeNet
Acc (%) 97.51 ~ 98 97.30 89.60 82.70
SP (%) 98 98.20 90.30 83.80
SN (%) 97.49 96.30 88.80 81.40
show the comparison analysis with proposed and existing methods using GoogleNet,
SqueezeNet, and DeTrac method. This proposed method has improved the accuracy
rate, specificity (SP), and sensitivity (SN).
Table 4 shows the comparison between hybridization and existing methods using
PSO + MSVM, DeTrac, GoogleNet, and SqueezeNet classifiers. The proposed
system performance value of accuracy is 97.51 with hybridization PSO + MSVM;
accuracy value is 97.30%, GoogleNet value is 89.60%, and SqueezeNet value is
82.70%. The existing DeTrac approach SP performance value is 98.20%, GoogleNet
is 90.30%, and SqueezeNet classifier value is 83.80%. The existing DeTrac approach
SN performance value is 96.30%, GoogleNet is 88.80%, and SqueezeNet classifier
value is 81.40%. The proposed work performance with parameters such as Acc. value
is 97.51%, SP value is 97.49, SN value is 98.0, and MSE value is 0.0030.
5 Conclusion and Future Scope
It concluded the COVID-19 (novel coronavirus illness) is a serious illness that has
killed millions of civilians and contaminated people all over the world. There are
ample approaches for the detection of COVID-19 but still, there are some challenges
that are not resolved by existing methods. The most challenging parameter is the
accuracy of the methods. The hybrid methodology has been used for the detection
of COVID-19. Digital image processing is very helpful in the medical field. There
are several advantages of digital image processing that are discussed in this work.
Machine learning (ML) and optimization techniques are used for efficient results.
The MSVM is used for the classification of images. The KAGGLE COVID-19 CXR
image dataset is used for the training and testing purpose of the network. There is
around 133 * 3 images exist in the dataset. The dataset is divided into three sub-
parts that are COVID-19 positive cases, normal cases, and viral pneumonia cases.
The proposed system is compared with existing systems and achieved better perfor-
mance, and the compared systems are DeTrac, GoogleNet, and SqueezeNet. The
performance of the proposed methodology is examined with four different parame-
ters which are accuracy of 97.51%, specificity of 97.49%, and 98.0% of sensitivity.
Moreover, there are certain flaws in this research work. A more extensive analysis, in
particular, necessitates a higher number of healthcare data, particularly COVID-19
statistics.
For more efficient results, in future, pre-trained feature extractor will be imple-
mented. More CXR images will be used for validation of the model. The effec-
tive method will be introduced to improve the precision rate and mitigate time
consumption.
References
1. M. Ghaderzadeh, F. Asadi, Deep learning in the detection and diagnosis of COVID-19 using
radiology modalities: a systematic review. J. Healthcare Eng. 2021 (2021). https://doi.org/10.
1155/2021/6677314
2. N. Chen et al., Epidemiological and clinical characteristics of 99 cases of 2019 novel
coronavirus pneumonia in Wuhan, China: a descriptive study. Lancet 395(10223), 507–513
(2020)
3. S. Minaee et al., Deep-COVID: predicting covid-19 from chest X-ray images using deep transfer
learning. Med. Image Anal. 65, 101794 (2020)
4. N. Zhu et al., A novel coronavirus from patients with pneumonia in China, 2019. New Engl. J.
Medicine (2020). https://doi.org/10.1056/NEJMoa2001017
5. W.G. Dos Santos, Natural history of COVID-19 and current knowledge on treatment therapeutic
options. Biomed. Pharmacother. 110493 (2020)
6. T. Singhal, A review of coronavirus disease-2019 (COVID-19). Indian J. Pediatr. 87(4), 281–
286 (2020)
7. Q. Cai, S.-Y. Du, S. Gao, G.L. Huang, Z. Zhang, S. Li, X. Wang, P.-L. Li, P. Lv, G. Hou, L.-N.
Zhang, A model based on CT radiomic features for predicting RT-PCR becoming negative in
coronavirus disease 2019 (COVID-19) patients. BMC Med. Imag. 20(1), 1–10 (2020)
8. A. Mohanty, A. Kabi, S. Kumar, V. Hada, Role of rapid antigen test in the diagnosis of COVID-
19 in India. J. Adv. Med. Med. Res 77–80 (2020)
9. A. Abbas, M.M. Abdelsamea, M.M. Gaber, Classification of COVID-19 in chest X-ray images
using DeTraC deep convolutional neural network. Appl. Intell. 51(2), 854–864 (2021)
10. N.R. Raajan, V.S. Lakshmi, N. Prabaharan, Non-invasive technique-based novel corona
(COVID-19) virus detection using CNN. Nat. Acad. Sci. Lett. 44(4), 347–350 (2021)
11. T. Zebin, S. Rezvy, COVID-19 detection and disease progression visualization: deep learning
on chest X-rays for classification and coarse localization. Appl. Intell. 51(2), 1010–1021 (2021)
12. R.M. Pereira et al., COVID-19 identification in chest X-ray images on flat and hierarchical
classification scenarios. Comput. Methods Programs Biomed. 194, 105532 (2020)
13. N.S. Punn, S.K. Sonbhadra, S. Agarwal, COVID-19 epidemic analysis using machine learning
and deep learning algorithms. MedRxiv (2020)
14. COVID-19 Radiography Database (2021). Retrieved 1 April 2021, from https://www.kaggle.
com/tawsifurrahman/covid19-radiography-database
15. Z. Wang, A.C. Bovik, Mean squared error: love it or leave it? A new look at signal fidelity
measures. IEEE Signal Process. Mag. 26(1), 98–117 (2009)
16. F. Khatami, M. Saatchi, S.S.T. Zadeh, Z.S. Aghamir, A.N. Shabestari, L.O. Reis, S.M.K.
Aghamir, A meta-analysis of accuracy and sensitivity of chest CT and RT-PCR in COVID-19
diagnosis. Sci. Rep. 10(1), 1–12 (2020)
QCM Sensor-Based Alcohol
Classification Using Ensembled Stacking
Model
Pemmada Suresh Kumar, Rajyalaxmi Pedada, Janmenjoy Nayak,

H. S. Behera, G. M. Sai Pratyusha, and Vanaja Velugula
Abstract Alcohol consumption is the global yoke of injury and disease attributable
as per the early study. The excessive intake of alcohol is coupled with unconstructive
consequences and jeopardizing future prospects. This paper presents an ensemble
model made of an array of five chemical compounds of quartz crystal microbalance
(QCM) sensors to find the corresponding compositions of a gas mixture. This study
makes use of QCM sensor responses to determine the gas compositions. These phys-
ical device sensors are used to sense the resonance frequency change of gas sensors
by classifying the chemical compounds and recognizing their harmful effects. The
main focus of the study is to determine the reaction of QCM sensors to five different
alcohols, such as 1-octanol, 1-propanol, 2-butanol, 2-propanol, and 1-isobutanol, and
to determine the effective sensor type in the classification of these compounds. The
experiment is conducted to classify and identify the constituent component amount
through an ensemble classifier to progress the efficiency of the QCM sensors. The
results of 125 different scenarios illustrated that various alcohols could be classified
effectively using a stacking classifier from the QCM sensor data.
Keywords QCM sensor · Alcohol · Machine learning · Stacking · Ensemble

learning
P. Suresh Kumar
Department of Computer Science and Engineering, Aditya Institute of Technology and
Management (AITAM), Tekkali, India
H. S. Behera
Department of Information Technology, Veer Surendra Sai University of Technology, Burla
768018, India
R. Pedada · G. M. Sai Pratyusha · V. Velugula
Department of Computer Science and Engineering, Dr. Lankapalli Bullayya College of
Engineering, Visakhapatnam 530013, India
J. Nayak (B)
652 P. Suresh Kumar et al.
1 Introduction
Nowadays, the detection of chemical compounds and their effects plays an important
role. The alcohols have adverse effects when used in high quantities. Generally,
alcohols like ethanol, propanol, and methanol are using in many skincare products,
medicine, drugs, and cleansers. Some food items will also have alcohol for the long
durability of the item unintentionally consuming by the human. In order to reduce
the effects caused by alcohol, there must be some intelligent computing techniques
that detect alcohol. Now, the pattern classification with gas sensors is best suited for
recognition, detection, and classification. For the classification of alcohol, a highly
selective sensor is required. Hence, QCM is a promising technology in detecting
alcohols. QCM is an acoustic sensor having an array of gas sensors. The fact behind its
sensor ability is that it detects the mass deposited on its crystal surface by estimating
the change in its resonance frequency (f). QCM is an e-nose sensor that resembles the
human nose to detect alcohols through its odor since different alcohols have different
aromas.
Many industries strive for technologies that inexpensively detect chemical
compounds. QCM is a sensor with high sensitivity, stability, low cost, low power
requirements, small size, and low weight. By its thin layer, it can be used in gas
absorption studies [1]. The alcohols would place on a thin layer of QCM and are
identified by changing their fundamental oscillation frequency, as shown in Eq. 1.
C f f 02
f = m (1)
A
where
A is the area of sensitive layers,
Cf is the mass sensitivity constant of the quartz crystal,
f0 fundamental resonance of the quartz crystals,
m mass change.
The array of gas sensors is used as a detecting system to measure the alcohol
mixtures. The sensors would produce a signal when the mass is placed on its thin
layer. The signals from the sensors are preprocessed and taken as a dataset by using
preprocessing algorithms. Machine learning algorithms train the dataset to find the
corresponding composition of alcohols [2].
Machine learning is more accessible with information as it can process, analyze,
and generate data from existing data, and this process can be automated. Machine
learning can have a continuous upgrade as we can add new information without
changing the historical data such that we can add new features to the model and
improve the algorithm for better accurate results. Sometimes machine learning
models have time constraints with fewer data, and more training to the system
is required while classifying to make the predictions and decisions easily. The
algorithms must be a promising technique to verify the data [3].
QCM Sensor-Based Alcohol Classification Using Ensembled … 653
Artificial neural network (ANN) is one of the most used machine learning tech-
niques for alcohol classification [4]. Classification results are used for different
approaches like effects of alcohol in cosmetics and hygiene products, predicting
flavors in Chinese liquors [5], estimating anesthesia dose [6], and many. Palaniappan
et al. [7] addressed the problem of alcohol classification and proposed an array of
operational amplifiers. Because of its prediction and decision, many classification
models in machine learning are used to find the type of alcohol placed on QCM.
Connor et al. [8] combined clinical experiences with machine learning algorithms
such as decision trees, which promise a better understanding of complex relation-
ships. Ordukaya and Karlik [9] analyzed the raw data is collected from fruit juice
and alcohol mixture, fruit juice halal authentication, and alcohol mixture with e-
nose by using KNN, support vector machines are used in most of the studies. Also,
hybrid models such as Kanna et al. [10] utilized multilayer perceptron using the
backpropagation algorithm to classify the alcohol abusers on the features extracted
from gamma brand spectral power of the multichannel visual evoked potential signal
in the time domain. The ensemble learning technique is a machine learning technique
that produces the best predictive output from the base models. So that it deals with
the pros of machine learning by ignoring its cons, we can choose the best technique
to apply to data and predict the best outcome.
The significant contribution of this research includes:
(a) A stacking classifier is proposed to classify the different types of alcohol
(b) Classification of alcohol experimentation is done by using the dataset from the
UCI repository proposed by Fatih et al. [11]
(c) Compare the performance of the proposed stacking classifier with several
machine learning methods such as stochastic gradient descent (SGD), Gaussian
naive Bayes (GNB), quadratic discriminant analysis (QDA), multilayer percep-
tron (MLP), linear discriminant analysis (LDA), linear regression (LR), deci-
sion trees (DT), K-nearest neighbor (KNN), gradient boosting (GB), AdaBoost.
From the results, it is evident that the proposed method outperformed well.
The remaining paper is organized into five sections: Sect. 2 presents the literature
study on alcohol classification using QCM sensors. Section 3 describes the method-
ology used, including the proposed method, performance measures, and experimental
setup. Section 4 explains the environment setup and result analysis of various machine
learning and ensemble learning algorithms on alcohol classification using QCM
sensors have been discussed in Sect. 5. Section 6 concludes the paper.
2 Literature Study
Adak et al. [12] developed a model to classify the alcohols obtained by QCM sensors
with different characteristics using ant bee colony-based neural network. Optimiza-
tion is achieved by an artificial bee colony that is based on nectar searching behavior.
The performance of the proposed model is evaluated using mean absolute percentage
error and mean square error. Based on the evaluation results of 300 scenarios, the
proposed method successfully classified the alcohols.
Katardjiev et al. [13] applied support vector machine, random forest, decision
trees, and K-nearest neighbor algorithms on clinical trial data of alcohol addicted
patients by a Uppsala-based company for alcohol relapse forecasting. K-nearest
neighbor predictor fitted with a radial basis function (RBF) kernel to model the data,
it predicts the best results for explained variance and root mean square error. Hence,
it is evident that ML-based models help predict addicted patients.
Li et al. [5] proposed a random forest technique that is optimized by reversing the
number of decision trees used to predict the flavors of Chinese liquors. The proposed
random forest classifier is compared with various machine learning classifiers such as
linear discriminant analysis, backpropagation artificial neural network (BP-ANN),
and support vector machine. Modified random forest outperformed well in terms of
accuracy than other models.
Some more literature on alcohol classification in QCM sensors using different
machine learning algorithms has been presented in Table 1.
Table 1 Literature of alcohol classification in QCM sensor using machine learning

S. No. Author Year Intelligent Dataset Evolution Reference
method factors
1 Triyana et al. 2018 Chisten-based Ethanol, Resonance [14]
QCM n-propanol, frequency,
isoamyl molecular
alcohol, and weight, vapor
n-amyl alcohol pressure,
boiling point,
sensitivity
2 Pisutaporn 2018 Decision tree, Dataset of Precision, [15]
et al. random forest Portuguese recall,
students accuracy
3 Zhu et al. 2018 Random forest Forty-six Accuracy, [16]
controls and 46 precision
short-term
abstinent
patients
4 Palaniappan 2017 MLP Student alcohol Accuracy, [7]
et al consumption squared error,
dataset and root mean
squared error
5 Ordukaya and 2016 Naïve Bayesian, Halal Sensitivity, [9]
Karlik KNN, LDA, DT, authentication specificity,
ANN, and SVM of fruit precision,
classifiers juice–alcohol accuracy,
mixture set error rates
Stacking is a supervised machine learning method that provides an optimal amal-

gamation of predictions that support binary classification, multi-classification, and
regression. Stacking is also called a stacked regression [17] or superlearner [18] devel-
oped in the year of 1992 [19]. Though it was introduced many years ago, bagging and
boosting are utilized widely compared to stacking, which is challenging to examine
theoretically. The working principle of stacking is different compared to bagging
and boosting, as both use different base learners while stacking uses the same type
of base learner. It involves second-level training called meta-learner that will find
optimal prediction from the combination of base learners. Base-level learners are
generated by applying various learning algorithms to a stated dataset [20].
Algorithm: Ensembled Stacking Classifier

Input: Data for training DS = {xi , yi }i=1
m
Output: Ensemble classifier H

Level 1: Learning algorithm at the base level classifier
for l = 1 to L do
learn h l based on DS
end for
Level 2: Creating various datasets for predictions
for i = 1 to m do

DSh = xi , yi , where, xi = {h 1 (xi ), . . . , h L (xi )}
end for
Level 3: meta-classifier learning
learn H based on DSh
return H
Considered features X = {xi ∈ R m }, set of class labels Y = {yi ∈ N } and data for
training are given as DS = {xi , yi }i=1
m
, here the learning model is M on the training
data DS. In the first level, learning is performed on the original training dataset with
distributed weights, and learning parameters are tuned on the base classifier. New
datasets are created and predicted; the labels from the output of first-level classifiers
are considered new features. In place of using predicted labels, we can use probability
estimators of the said first-level classifiers.
The proposed stacking approach utilizes three base classifiers: K-nearest neighbor,
random forest, and Gaussian naive Bayes to predict alcohol type, and logistic regres-
sion is used as a meta-classifier in the proposed method. Integration of indepen-
dent methods of the proposed model and overall systematic process structure is
represented in Fig. 1.
Fig. 1 Framework of the proposed stacking classifier
4 Environment Setup
In this section, we have discussed the dataset we have used for experimentation,
performance measures considered to evaluate the models, and finally, explained the
simulation environment and parameter setting of various models. Data preprocessing
is applied to the dataset through data cleaning methods and then transforms into struc-
tured vectors. These vectors are divided into an 80:20 ratio for training and testing.
We used the Intel i5 processor with 6 GB RAM on Windows 10 operating system. The
proposed method and various machine learning algorithms are implemented using
Scikit-learn, which is an open-source machine learning library based on python.
4.1 Empirical Data
Alcohol QCM sensor data is considered in this experiment and is available openly
at the UCI machine learning repository [11]. Five different gas sensors are used
for classification with various gas measurements such as 1-octanol, 1-propanol, 2-
butanol, 2-propanol, and 1-isobutanol [12]. The feature distribution helps determine
the dataset’s characteristics. From the feature distribution, we can identify the data’s
possible temporal range and recurrence of occurrences. In comparison with partially
skewed and fully skewed features, ordinarily, normally distributed characteristics are
highly useful in obtaining good accuracy. Feature distribution of the alcohol dataset
is shown in Fig. 2.
Fig. 2 Feature distribution of alcohol QCM sensor data

4.2 Performance Measure
The dataset is analyzed, and the performance has been evaluated by various evalua-
tion metrics like True Positive (TP), False Positive (FP), True Negative (TN), False
Negative (FN), True Negative Rate (TNR), False Negative Rate (FNR), and accuracy,
recall, precision [21].
4.3 Experimental Setup
The experimentation was carried on the alcohol dataset obtained by the QCM sensor
available in the UCI machine learning repository. Other competitive ML-based algo-
rithms are simulated on the QCM sensor alcohol dataset along with the stacking
classifier to obtain effective performance. The proposed method is compared with
various machine learning techniques such as K-nearest neighbor, decision trees,
stochastic gradient descent, Gaussian naive Bayes, logistic regression, linear discrim-
inant analysis, multilayer perceptron, quadratic discriminant analysis, and some more
ensemble methods such as AdaBoost, and gradient boosting. The parameter setting
for the proposed method and other compared methods is shown in Table 2.
5 Result Analysis
This study shows the performance of ensemble stacking classifier for the various
machine learning models which are presented in Table 3. The SGD and GNB classi-
fiers deliver an enormous misclassification rate performance for all the classes. The
accuracy of SGD and GNB is 0.37 and 0.44.
The classifiers QDA and MLP were classified precisely with the same results.
Class 1-octanol is categorized accurately. The 1-propanol class classifies 12 instances
correctly, and the remaining 8 instances are misclassified as 2-butanol, 2-propanol,
and 1-isobutanol. The 2-butanol class classifies 9 instances correctly, and the
remaining 11 instances are misclassified as 1-propanol, 2-propanol, and 1-isobutanol.
The 2-propanol class shows that 17 of them are classified correctly, and the remaining
6 of 1-propanol and 2-butanol classes are misclassified as 2-propanol. The 1-
isobutanol class classifies overall 20 instances correctly, but 3 instances of 1-propanol
and 2-butanol are misclassified as 1-isobutanol. The overall accuracy of these is 0.77,
whereas the individual accuracy of 100% is achieved for the 1-octanol class. The
false positive, false negative, and FPR give 0 and recall, F1-score, precision give 1
for 1-octanol class compared to other classes.
The LDA classification classifies the classes 1-octanol and 1-isobutanol precisely,
i.e., the individual accuracy of these classes is 100%. The 1-propanol class shows
that 15 of them are classified correctly, and the remaining 5 are misclassified as
Table 2 Parameter setting of various models with ‘Alcohol using QCM sensor’
Technique Parameter setting
K-nearest neighbor n_neighbors = 5,
weights = ‘uniform’,
algorithm = ‘auto’
Decision tree criterion = ‘gini’,
splitter = ‘best’,
max_depth = none,
Stochastic gradient descent Solver = ‘hinge’
max_iter = 1000
Gaussian Naive Bayes var_smoothing = 1e-09
Logistic regression random_state = 1,
solver = ‘newton-cg’
Linear discriminant analysis solver = ‘svd’
Multilayer perceptron activation = ‘logistic’,
batch_size = 10,
random_state = 2,
solver = ‘adam’
Quadratic discriminant analysis reg_param = 0.0,
store_covariance = False,
tol = 0.0001
AdaBoost base_estimator = DecisionTreeClassifier(max_depth = 5),
n_estimators = 50,
learning_rate = 1.0
Bagging base_estimator = DecisionTreeClassifier(),
n_estimators = 10,
max_samples = 1.0
max_features = 1.0,
Gradient boosting n_estimators = 40,
random_state = 1
Stacking classifiers = [KNN, RF, GNB], meta_classifier =
LogisticRegression(),
use_probas = True, use_clones = False
2-butanol and 2-propanol. The 2-butanol class shows that 14 of them are classified
correctly, and the remaining 6 are misclassified as 1-propanol and 2-propanol. The
2-propanol class shows that 14 of them are classified correctly, and the remaining 7
are misclassified 1-propanol and 2-butanol. The individual accuracy of 1-propanol,
2-butanol, and 2-propanol classes is 0.92, 0.87, and 0.87. The LDA gives an overall
accuracy of 0.83.
The result analysis of the LR classifier in the case of 1-octanol and 1-isobutanol
class is classified precisely; therefore, the TPR, F1-score, precision, and accuracy
of these classes are 100%. The 1-propanol class shows that 16 of them are classified
correctly, and the remaining 4 are misclassified as 2-butanol and 2-propanol. The
2-butanol class shows that 16 of them are classified correctly, and the remaining 4
Table 3 Result analysis of proposed method and various classifiers

Class label Model TP TN FP FN TPR FPR F1-score Precision Accuracy Over all
accuracy
1-octanol SGD 4 77 3 15 0.21 0.04 0.31 0.57 0.81 0.37
GNB 17 80 0 2 0.89 0.00 0.94 1.00 0.97 0.44
QDA 19 80 0 0 1.00 0.00 1.00 1.00 1.00 0.77
MLP 19 80 0 0 1.00 0.00 1.00 1.00 1.00 0.77
LDA 19 80 0 0 1.00 0.00 1.00 1.00 1.00 0.83
LR 19 80 0 0 1.00 0.00 1.00 1.00 1.00 0.9
DT 19 80 0 0 1.00 0.00 1.00 1.00 1.00 0.94
GB 19 80 0 0 1.00 0.00 1.00 1.00 1.00 0.95
KNN 16 80 0 3 0.84 0.00 0.91 1.00 0.96 0.97
AdaBoost 19 80 0 0 1.00 0.00 1.00 1.00 1.00 0.98
Stacking 19 80 0 0 1.00 0.00 1.00 1.00 1.00 0.99
1-propanol SGD 4 75 4 16 0.20 0.05 0.29 0.50 0.79
GNB 5 71 8 15 0.25 0.10 0.30 0.38 0.76
QDA 12 72 7 8 0.60 0.09 0.62 0.63 0.84
MLP 12 72 7 8 0.60 0.09 0.62 0.63 0.84
LDA 15 76 3 5 0.75 0.04 0.79 0.83 0.92
LR 16 75 4 4 0.80 0.05 0.80 0.80 0.92
DT 18 79 0 2 0.90 0.00 0.94 1.00 0.97
GB 18 76 3 2 0.90 0.04 0.88 0.86 0.95
KNN 20 78 1 0 1.00 0.01 0.98 0.95 0.98
AdaBoost 19 78 1 1 0.95 0.01 0.95 0.95 0.98
Stacking 20 78 1 0 1.00 0.01 0.98 0.95 0.98
2-butanol SGD 9 59 20 11 0.45 0.25 0.37 0.31 0.68
GNB 10 65 14 10 0.50 0.18 0.45 0.42 0.75
QDA 9 73 6 11 0.45 0.08 0.51 0.60 0.82
MLP 9 73 6 11 0.45 0.08 0.51 0.60 0.82
LDA 14 72 7 6 0.70 0.09 0.68 0.67 0.87
LR 16 75 4 4 0.80 0.05 0.80 0.80 0.92
DT 20 76 3 0 1.00 0.03 0.93 0.86 0.96
GB 20 78 1 0 1.00 0.01 0.98 0.95 0.99
KNN 20 79 0 0 1.00 0.00 1.00 1.00 1.00
AdaBoost 20 78 1 0 1.00 0.01 0.98 0.95 0.98
Stacking 20 79 0 0 1.00 0.00 1.00 1.00 1.00
2-propanol SGD 10 61 18 10 0.50 0.23 0.42 0.36 0.71
GNB 11 47 32 9 0.55 0.41 0.35 0.26 0.58
(continued)
Table 3 (continued)
Class label Model TP TN FP FN TPR FPR F1-score Precision Accuracy Over all
accuracy
QDA 17 73 6 3 0.85 0.08 0.79 0.74 0.90
MLP 17 73 6 3 0.85 0.08 0.79 0.74 0.90
LDA 14 72 7 6 0.70 0.09 0.68 0.67 0.87
LR 18 77 2 2 0.90 0.03 0.90 0.90 0.96
DT 19 78 1 1 0.95 0.01 0.95 0.95 0.97
GB 18 78 1 2 0.90 0.01 0.92 0.95 0.97
KNN 20 79 0 0 1.00 0.00 1.00 1.00 1.00
AdaBoost 20 79 0 0 1.00 0.00 1.00 1.00 1.00
Stacking 20 79 0 0 1.00 0.00 1.00 1.00 1.00
1-isobutanol SGD 10 62 17 10 0.50 0.22 0.43 0.37 0.72
GNB 1 78 1 19 0.05 0.01 0.09 0.50 0.79
QDA 20 76 3 0 1.00 0.04 0.93 0.87 0.96
MLP 20 76 3 0 1.00 0.04 0.93 0.87 0.96
LDA 20 79 0 0 1.00 0.00 1.00 1.00 1.00
LR 20 79 0 0 1.00 0.00 1.00 1.00 1.00
DT 18 78 1 2 0.90 0.01 0.92 0.94 0.96
GB 19 79 0 1 0.95 0.00 0.97 1.00 0.99
KNN 20 77 2 0 1.00 0.03 0.95 0.91 0.97
AdaBoost 19 79 0 1 0.95 0.00 0.97 1.00 0.98
Stacking 19 79 0 1 0.95 0.00 0.97 1.00 0.98
are misclassified as 1-propanol and 2-propanol. The 2-propanol class shows that 18
of them are classified correctly, and the remaining 2 are misclassified as 1-propanol
and 2-butanol each. The LR classifier achieves an overall accuracy of 0.90.
The DT Classifier in the 1-octanol class is classified precisely; therefore, these
classes’ TPR, F1-score, precision, and accuracy are 1. The 1-propanol class shows
that 18 instances are classified correctly, and 2 instances are misclassified as 2-butanol
and 2-propanol. The 2-butanol class shows that 20 of them are classified correctly,
and 3 instances of 1-propanol and 1-isobutanol are misclassified into 2-butanol. The
2-propanol class shows that 19 instances are classified correctly, and 1 instance is
wrongly predicted as 1-isobutanol. The 1-isobutanol class shows that 18 instances are
classified correctly, and 2 instances are misclassified as 2-butanol. The DT Classifier
classifies each class approximately and achieved an overall accuracy of 0.94. The
class 1-octanol achieves an individual accuracy of 1.00, as 1-propanol has a false
negative of 2 and 2-butanol has a false positive of 3 that gives an individual accuracy
of 0.97 and 0.96. The classes’ 2-propanol and 1-isobutanol attained the individual
accuracy of 0.97 and 0.96.
The GB Classifier predicts the 1-octanol class precisely; therefore, the TPR, F1-
score, precision, and accuracy of these classes are 1. The 1-isobutanol shows that 19
instances are classified correctly, and the single remaining instance is misclassified
as 1-propanol. The 1-propanol class shows that 18 instances are classified correctly,
and the remaining 2 instances are misclassified as 2-butanol and 2-propanol. The
2-butanol class shows that 20 of them are classified correctly, and one instance of
1-propanol is wrongly predicted as 2-butanol. The 2-propanol class shows that 18
are classified correctly, and only 2 are wrongly predicted as 1-propanol. The class
1-octanol achieves an individual accuracy of 1.00, as 2-butanol and 1-isobutanol
achieve an individual accuracy of 0.99, 2-propanol, and 1-propanol with an individual
accuracy 0.97 and 0.95. The FPR for 2-butanol and 2-propanol gives a value of 0.01;
i.e., one instance is misclassified for each and achieved an overall accuracy of 0.94.
The KNN Classifier classifies the classes’ 2-butanol and 2-propanol precisely.
The 1-octanol shows that 16 instances are classified correctly, and the two instances
are misclassified as 1-isobutanol. The 1-propanol class shows that 20 of them are
classified correctly, where one instance of 1-octanol is misclassified as 1-propanol.
The1-isobutanol class shows that 20 instances are classified correctly, where two
instances of 1-octanol are misclassified as 1-isobutanol. The FPR for 1-propanol and
1-isobutanol is 0.01 and 0.03 and for other classes is 0. The precision for 1-octanol,
2-butanol, and 2-propanol is 1.00, 1-propanol is 0.95, and 1-isobutanol is 0.91. The
KNN Classifier achieves an overall accuracy of 0.97.
The AdaBoost classifier classifies the classes’ 1-octanol and 2-propanol precisely.
The class1-propanol shows that 19 of them are classified correctly, and one instance
is wrongly classified as 2-butanol. The class 2-butanol shows that 20 of them are clas-
sified correctly, and one instance of 1-propanol is improperly classified as 2-butanol.
The table shows that only two instances are wrongly classified, i.e., 1-propanol is
misclassified as 2-butanol, and 1-isobutanol is misclassified as 1-propanol, with an
overall accuracy of 0.98.
The stacking classifier precisely classifies all the classes except 1-propanol and
1-isobutanol. The 1-propanol class shows that 20 of them are classified correctly,
and one instance of 1-isobutanol is wrongly classified as 1-propanol. Where the false
positive of all classes that are 0 except 1-propanol is 1. The F1-score and precision
for 1-octanol, 2-butanol, and 2-propanol are 1. The proposed method gives an overall
accuracy of 0.99.
Figure 3 shows the ROC curve of all the proposed models, and finally, Fig. 4
represents the various graph plots for TPR, FPR, F1-score, precision, accuracy, and
overall accuracies of the models for the dataset classes.
Figure 4 illustrates that TPR and F1-score of all the models with the target feature,
where the GNB evaluates the lowest value for 1-isobutanol and SGD evaluate for
1-propanol and 1-octanol. Figure 4 shows that GNB and SGD evaluate the highest
FPR for classes’ 2-propanol and 2-butanol compared to other classes with various
models. Figure 4 shows that F1-score to 1 for the 1-isobutanol class with LDA and
LR models. The QDA, MLP, LR, LDA, DT, GBC models show the value of F1-score
to 1 for the 1-octonal class. Similarly, STK, ADB, KNN evaluate the F1-score to
1 for the 2-proponal class. Figure 4 shows that all the models evaluate precision
value to 1 for the class 1-octanol except for the model SGD. Similarly, LDA, LR,
DT, GBC, ADB, and STK evaluate precision value to 1 for the 1-isobutanol class.
B: ROC curve of GNB

A: ROC curve of SGD
D: ROC curve of MLP

C: ROC curve of QDA
F: ROC curve of LR
E: ROC curve of LDA
H: ROC curve of GBC

G: ROC curve of DT
J: ROC curve of ABC

I: ROC curve of KNN
K: ROC curve of STK
Fig. 3 ROC curves of all models

A B
C D
E F
Fig. 4 a TPR versus models, b FPR versus models, c F1-score versus models, d precision versus
models, e accuracy versus models, f accuracies of models
Figure 4 represents the individual class accuracies of all the models. Compared
to other classes, 1-octanol and 1-isobutanol classes give good accuracy for all the
models, except for k-Ninth ADB. STK has classified all the classes accurately (100%)
except for 1-propanol and 1-isobutanol with 98%. Figure 4 clearly shows that the
proposed model produces better accuracy than other models in identifying target
classes.
6 Conclusion
This study proposed a classifier based on an ensemble model for determining the
reaction of QCM sensors to five different alcohols such as 1-octanol, 1-propanol, 2-
butanol, 2-propanol, and 1-isobutanol and classifying the sensors. Also, various ML
algorithms are considered for effective analysis of the performances for classifying
the accurateness. The classifiers QDA and MLP have driven the same results with an
accuracy of 0.77 and performed well in categorizing the 1-octanol class. The LDA
and LR performed well in classifying the classes’ 1-octanol and 1-isobutanol, where
LDA gives an accuracy of 0.83 and LR with an accuracy of 0.90. The DT and GB
classifiers perform well in predominantly categorizing the 1-octanol class and other
classes. The KNN performs well in classifying the classes 2-butanol and 2-propanol,
but the 1-octanol class is misclassified to 1-propanol and 1-isobutanol. The AdaBoost
performs well in categorizing the classes 1-octanol and 2-propanol. However, the
proposed model shows that the classes 1-octanol, 2-butanol, and 2-propanol are
categorized correctly. The class 1-isobutanol is misclassified as 1-propanol and gives
an accuracy of 0.99. From this study, it is clear that all the models can classify 1-
octanol class precisely except KNN. The QDA, DT, MLP, and GB can classify the 1-
octanol class precisely. The LDA and LR classifiers are also able to classify 1-octanol
and 1-isobutanol classes. AdaBoost is also able to classify the classes 1-octanol
and 2-propanol. The overall results show that 1-octanol, 2-butanol, 2-propanol, and
1-isobutanol classes are categorized except the class1-propanol. Among all these
models, the proposed method signifies its performance to a larger extent by correctly
classifying the alcohol classes. In the future, a deep study may be conducted on
other properties of alcohol by using deep learning methods for various practical
applications.
References
1. X. Zeng, X. Jin, Y. Huang, A. Mason, Multichannel monolithic quartz crystal microbalance

gas sensor array. Anal. Chem. 81(2), 595–603 (2009). https://doi.org/10.1021/ac8018697
2. A. Özmen, F. Tekce, M.A. Ebeoǧlu, C. Taşaltin, Z.Z. Öztürk, Finding the composition of gas
mixtures by a phthalocyanine-coated QCM sensor array and an artificial neural network. Sens.
Actuators, B Chem. 115(1), 450–454 (2006). https://doi.org/10.1016/j.snb.2005.10.007
3. B.K. Rao, P.S. Kumar, D.K.K. Reddy, J. Nayak, B. Naik, QCM sensor-based alcohol classi-
fication by advance machine learning approach. pp. 305–320 (2021). https://doi.org/10.1007/
978-981-15-8439-8_25
4. Y.C. Leung, D.H.F. Yip, W.W.H. Yu, An analogue ANN for classification of alcohol, in 1997
IEEE International Conference on Systems, Man, and Cybernetics. Computational Cyber-
netics and Simulation, vol. 4(852), pp. 4010–4015 (1997). https://doi.org/10.1109/ICSMC.
1997.633299
5. Q. Li, Y. Gu, N. Wang, Application of random forest classifier by means of a QCM-based
E-nose in the identification of Chinese liquor flavors. IEEE Sens. J. 17(6), 1788–1794 (2017).
https://doi.org/10.1109/JSEN.2017.2657653
6. H.M. Saraoğlu, B. Edin, E-nose system for anesthetic dose level detection using artificial neural
network. J. Med. Syst. 31(6), 475–482 (2007). https://doi.org/10.1007/s10916-007-9087-7
7. S. Palaniappan, N.A. Hameed, A. Mustapha, N.A. Samsudin, Classification of alcohol
consumption among secondary school students. JOIV Int. J. Inf. Vis. 1(4–2), 224 (2017).
https://doi.org/10.30630/joiv.1.4-2.64
8. J.P. Connor, M. Symons, G.F.X. Feeney, R.M. Young, J. Wiles, The application of machine
learning techniques as an adjunct to clinical decision making in alcohol dependence treatment.
Subst. Use Misuse 42(14), 2193–2206 (2007). https://doi.org/10.1080/10826080701658125
9. E. Ordukaya, B. Karlik, Fruit juice–alcohol mixture analysis using machine learning and elec-
tronic nose. IEEJ Trans. Electr. Electron. Eng. 11, S171–S176 (2016). https://doi.org/10.1002/
tee.22250
10. P.S. Kanna, R. Palaniappan, K.V.R. Ravi, Classification of alcohol abusers: an intelligent
approach, in Third International Conference on Information Technology and Applications
(ICITA’05), vol. 1, pp. 470–474 (2005). https://doi.org/10.1109/ICITA.2005.95
11. A. Fatih, P. Lieberzeit, P. Jarujamrus, N. Yumusak, UCI machine learning repository: alcohol
QCM sensor dataset data set (n.d.). Retrieved April 17, 2020, from http://archive.ics.uci.edu/
ml/datasets/Alcohol+QCM+Sensor+Dataset
12. M.F. Adak, P. Lieberzeit, P. Jarujamrus, N. Yumusak, Classification of alcohols obtained by
QCM sensors with different characteristics using ABC based neural network. Eng. Sci. Technol.
Int. J. 23(3) (2019). https://doi.org/10.1016/j.jestch.2019.06.011
13. N. Katardjiev, S. Mckeever, A. Hamfelt, A machine learning-based approach to forecasting
alcoholic relapses (2019)
14. Triyana et al., Chitosan-based quartz crystal microbalance for alcohol sensing. Electronics
7(9), 1–11 (2018). https://doi.org/10.3390/electronics7090181
15. A. Pisutaporn, B. Chonvirachkul, D. Sutivong, Relevant factors and classification of student
alcohol consumption, in 2018 IEEE International Conference on Innovative Research and
Development (ICIRD), pp. 1–6 (2018). https://doi.org/10.1109/ICIRD.2018.8376297
16. X. Zhu, X. Du, M. Kerich, F.W. Lohoff, R. Momenan, Random forest based classification of
alcohol dependence patients and healthy controls using resting state MRI. Neurosci. Lett. 676,
27–33 (2018). https://doi.org/10.1016/j.neulet.2018.04.007
17. L. Breiman, Stacked regressions. Mach. Learn. 24(1), 49–64 (1996). https://doi.org/10.1007/
BF00117832
18. M.J. van der Laan, E.C. Polley, A.E. Hubbard, Super learner. Stat. Appl. Genet. Mol. Biol. 6(1)
(2007). https://doi.org/10.2202/1544-6115.1309
19. D.H. Wolpert, Stacked generalization. Neural Netw. 5(2), 241–259 (1992). https://doi.org/10.
1016/S0893-6080(05)80023-1
20. G. Wang, J. Hao, J. Ma, H. Jiang, A comparative assessment of ensemble learning for credit
scoring. Expert Syst. Appl. 38(1), 223–230 (2011). https://doi.org/10.1016/j.eswa.2010.06.048
21. P. Suresh Kumar, H.S. Behera, J. Nayak, B. Naik, Bootstrap aggregation ensemble learning-
based reliable approach for software defect prediction by using characterized code feature.
Innov. Syst. Softw. Eng. 1–22 (2021). https://doi.org/10.1007/s11334-021-00399-2
A Novel Image Falsification Detection
Using Vision Transformer (Vi-T) Neural
Network
Manikyala Rao Tankala and Ch. Srinivasa Rao
Abstract Image classification and image recognition are significantly impacted by

the creation of Vision Transformers. It is no wonder that CNN architectures utilize
multiple layers and a considerable number of hyper-parameters are necessary for
training, as they are much more complex. While training and implementing, CNN
uses a considerable number of resources. Alternately, Vision Transformers are the
innovative architecture of neural networks that allows it to take in an image and break
it into little pieces, then use an attention mechanism to search for correlations in the
pieces. For localization of the attention mechanism, the transformers are well trained
on small visual patches. With the introduction of image manipulation tools, much of
the picture data available on the Internet today are designed to fool consumers into
thinking the image is genuine. To authenticate photographs, we must utilize many
different neural networks. The main purpose of this research was to detect forgery
and locate the tampered image utilizing the transformers and attention mechanism
concept. Using the RTX-3080 graphic card, the study effort is implemented on two
benchmark datasets. These datasets are used to get the evaluation results, and they
are compared to the state-of-the-art techniques. During the training and validation
process, training accuracy of 98% and validation accuracy of 97% are achieved on
benchmark datasets.
Keywords Image forgery · Copy-move image forgery detection · Spliced image

forgery detection · Attention mechanism · Localization of tamper regions ·
RTX-3080 graphic card · Epochs
M. R. Tankala (B) · Ch. Srinivasa Rao

Department of Electronics and Communication Engineering, JNTU Kakinada, Kakinada 533001,
India
Ch. Srinivasa Rao
e-mail: chsrao.ece@jntukucev.ac.in
668 M. R. Tankala and Ch. Srinivasa Rao
1 Introduction
With the invention of image editing software, any image can be modified to appear
legitimate when compared to the original. The usage of image editing these days is
rather popular, and in fact, this is rather relevant in the digital era. Facebook, What-
sApp, Twitter, and Web-based pages all contain plenty of modified photographs,
which is made easier utilizing image manipulation programs like GIMP, Adobe
Photoshop, Coral Draw, Paint 3D. Copy-move image forgery, spliced image
forging, and resampling are the primary picture manipulation methods. To be
employed along with the primary and focused methodology for image forgery detec-
tion, these approaches fall under pixel-based image forgery detection [1, 2] tech-
niques, which are a fundamental and versatile method for spotting counterfeit among
images. Techniques including zooming, resampling, rotation, the addition of noise,
and stitching are commonly employed in picture editing to achieve visuals that are
more closely matched to those of the user choice. Splicing involves the blending of
two distinct photographs, one is taken of each prospect, and then the photographs are
blended to make a composite shot, which presents them as the original. For gaining
the copy-move image forgeries, one must copy and paste the targeted area in the
same image. Resampling changes the picture resolution (and hence the resolution
of objects inside the picture) concerning the sampling frequency and compression
standard.
As a result, we require a new method for image forgery detection. The area of
picture forensics is concerned with establishing the authenticity of photographs and
safeguarding the user’s rights. Figure 1 illustrates the methods for detecting picture
forgeries. There are primarily two methods for detecting forgeries. The first is an
active technique, which entails introducing a digital signature at the transmission end
Fig. 1 Classification of
image forgery detection
methods
A Novel Image Falsification Detection Using Vision Transformer … 669
and retrieving it at the receiver’s end via specified hardware. Digital watermarking
schemes are employed in the active approach. This is a rather time-consuming method
that requires specialized training for application across sequential processes. In the
second strategy, commonly referred to as the passive technique proposes the proce-
dure of image forgery. This approach consists of three techniques for forgery and can
be interpreted for forgery findings through software programming on a collection of
photographs that includes both real and altered images. Additionally, this approach
is easily configurable, owing to the algorithms well-designed and well-implemented
evaluation datasets. By contrast, a passive approach does not require a watermark or
signature to establish authenticity.
Figure 2 [3], Fig. 3 [3], Fig. 4 [4] illustrate copy-move forgery and spliced forging
approaches obtained following post-processing attacks on the benchmark datasets
[3, 4]. When compared to actual photographs from the benchmark datasets, these
images appear to be genuine. Figure 4 shows spliced image forgery obtained by
joining two different images to form a composite image. Convolution neural networks
Fig. 2 Original picture (left) and forged image (right) obtained after manipulation. (Image courtesy
from Frith dataset [3])
Fig. 3 Original picture (left)

and forged image (right)
manipulated with
post-processing technique
(image courtesy from Frith
dataset [3])
Fig. 4 Forged image manipulated with post-processing techniques (left) and original picture (right)
indicating splicing image forgery (courtesy CASIA image manipulation dataset[4])
(CNN) have now been extensively preferred for a few years over the classic detec-
tion approach. Tasks of computer vision, such as classification and localization, are
well attained through deep learning, including neural networks such as ResNets,
Fast-RCNN, generative adversarial networks (GAN), sparse encoders, YOLOv4
networks. The drawback with CNN is they take more computation resource and
time during the process of convolution process on images. Also, they are many deep
layers that are to be trained for image classification tasks. Methods like transfer
learning are well suited in achieving a required classification task. But again, they
require extensive training on both training and testing datasets.
To overcome the problem of extensive training on many layers with convolution
properties and feature maps, a new method with Vision Transformer-based neural
network is proposed in this paper for image forgery detection and localization of
forged areas. The primary advantage of Vision Transformer is that it does not involve
the use of convolutional filters, which are a vital process in many deep learning algo-
rithms for image classification and image recognition. This paper primarily focuses
on image tamper classification and localization of forged areas using Vision Trans-
former and attention networks. The proposed method using Vision Transformer does
not employ convolutional filters during the process of tamper image classification,
and thus, the approach finds a new objective for image falsification detection and
classification between real and fake images.
2 Related Work
There are two approaches for image forgery detection and classification [5]
using algorithms, and the following Sects. 2.1 and 2.2 describe them toward the
implementation.
2.1 Traditional Approach
In the traditional approach, the picture was divided into regionally prudent or block
prudent, and features extracted from the sample image were compared with original
image features, using a keypoint technique or a block-said strategy to get keypoint
pairs. For similarity verification, the distance between key pairs was calculated and
employed for defining a picture as tamper or as an authentic image. Finally, a classifier
is utilized to provide confusion matrix categorization for the assessment. This tech-
nique is an old way of finding a correlation between the real and morphed images; later
on, with hybrid methods extracting characteristics like local binary pattern (LBP),
Fourier descriptors and HSV color moments were used. These hybrid characteristics
are taught on an image classification learning system, and outcomes are evaluated in
terms of accuracy and F1-score. To determine the performance of any learnt model,
precision against recall is also compared to a true positive versus false positivity rate.
To verify the validity of any image, both block-based and keypoint techniques use
features such as SURF, PCA, DWT, DCT. The traditional approach looks for hidden
patterns, features in images for training a classifier accordingly and is confined to
perform well on a small image dataset. Also, it takes more time for computation and
testing the forgery images and will not work for new data unless better trained for
large datasets. To overcome this problem, deep learning algorithms are proposed and
are compared to traditional approaches for implementation, localization, and testing
on images.
2.2 Deep Learning Approach
In this approach, CNNs are extensively used, with the given dataset being supple-
mented using data augmentation techniques, then applied to layers of neural nets for
extraction of low-level features and ultimately to the softmax function for turning
into a probability value to discover the class of attributes. This method is effective for
classifying [6] actual and altered images with improved accuracy. CNNs necessitate
a lot of computing power and a lot of data. Deep learning approaches such as RNN,
RESNET, SPARSE-NETS, and CNN-LSTMS are good at picture classification, but
they take a lot of training on a given dataset.
When compared to single-task fully convolutional network (SFCN) approaches,
the multi-task fully convolutional network (MFCN) presented by Salloum et al. [7]
performs well in locating tamper regions. Amerini et al. [8] used a multi-domain
CNN strategy to localize double-JPEG compression. They examined both the spatial
and frequency domains, as well as fully connected layers to obtain higher accuracy
on the UCID dataset. Based on the frequency domain, CNN takes DCT coefficients
for each patch as input. Frequency domain-based network has two convolutional
layers, two pooling layers, and three full connections make up CNN. Multi-domain
CNN connects the outputs coming from fully linked layers of these two networks,
and it classifies the patch into one of three classes: uncompressed, single, or double
compressed [6].
3 Proposed Method
In this research work, we propose a novel method for image forgery detection using
Vision Transformers for forgery detection in images where it is robust in the detection
of tamper areas and classification of images concerning real and morphed images.
As shown in the methodology in Fig. 5, these are basic steps required to follow
for image forgery classification and localization. The first is preprocessing, in which
a preprocessed image dataset is fed into the Vision Transformer. The image is then
divided into smaller patches, with correlation among the patches being found with
attention-based networks. Next, evaluation metrics are generated and plotted against
the number of epochs trained for the dataset. The penultimate phase is locating the
image tamper region and predicting the dataset’s test picture.
Figure 6 presents an overview of the model. A 1D input of token embedding is
sent to the standard transformer. To process 2D images, we reshape the image (H, W,
C) into a sequence of flattened 2D patches (N = HW /P2). The transformer employs
constant latent vector size D through all of its layers; thus, we flatten the patches and
map to D dimensions via a trainable linear projection given in Eq. (1). We call the
patch embedding produced by this projection the output.
As we introduce the core components of the transformer, we will touch on multi-
head self-attention (MSA), multi-layer perceptron (MLP), and linear neural network
(LN) (layer normalization). Multi-head self-attention (MSA) module takes the inputs
as X ∈ Rn × d is linearly transformed to three parts, i.e., queries Q ∈ Rn × dk , keys K ∈
Rn × dk , and values V ∈ Rn × dv where n is the sequence length and d, dK, dV are the
dimensions of inputs, queries (keys), and values, respectively, in Eq. (1). The scaled
dot-product attention is applied on Q, K, V.
Fig. 5 Block diagram of proposed method using Vision Transformers (ViT)

Fig. 6 Vision transformer base model (ViT base) block diagram

QK T
Attention(Q, K , V ) = Softmax √ V (1)
dk
A linear layer is employed to create the output. Multi-head self-attention divides

the queries, keys, and values into their separate components, and attention is then
applied to those portions simultaneously while processing the results. The results of
each head are then concatenated and projected to generate the final output.
MLP(X ) = FC(σ (FC(X ))) (2)
where FC(X ) = X · W + b (3)
Feature transformation and nonlinearity are applied between self-attention layers

using the MLP. Equations (2) and (3) explain the output of multi-layer perceptron
(MLP) using the parameters w, x, b where b refers to bias, w refers to weights, and
x refers to the input function.
Algorithm: Image forgery detection using Vi-T
Input: Reading the image forgery training dataset { Pi,qi} where i =1 to n

Output: Predicting the test samples in forgery dataset
Step 1: Set the batch size, bs = 2048 with epochs = 100 or 200,
Adam optimizer (learning rate =0.001), Image dimension= 64 or 128
Step 2: Set the number, for mini-batch size as nb = n/bs
Step 3: for iteration = 1 to n
3.1 from batch 1 to nb
1 choose a batch from training dataset
2 train the model for both original and fake images by minimizing the cross-
entropy loss
3 back-propagate the loss
4 update the model parameters
Step 4: Classify test images

______________________________________________________________
The proposed algorithm is expressed as shown above. We utilized the ViT base
model, according to the settings of model parameters in all tests. The model has
12 encoder layers and 12 attention heads in each one. It features 768-dimensional
embedding and a 3072-dimensional feed-forward sub-network. For the fine-tuning,
we used a pre-trained model that was trained on ImageNet-21 k, and then fine-tuned
on ImageNet-1 k. For the last 30 iterations, we ran the machine learning algorithm
with a batch size of 2048, all on image forgery data. We used the Adam technique to
optimize it, and we set the learning rate to 0.001. We had a sequence with 196 tokens
length with a fixed image size of 224 pixels by 224 pixels and a fixed patch size
of 16 pixels by 16 pixels. For accuracy comparison, we checked how the approach
performed by assessing standard overall accuracy (OA), the number of correctly
identified photographs divided by the total number of images.
The overall architecture in Fig. 6 is termed as Vision Transformer (ViT), and the
process is explored in the following Steps [9, 10].
1. Split an image (real or morphed image) into patches taken from the dataset and
smooth the patches.
2. Convert the flattened patches into lower-dimensional linear embedding.
3. Add positional embedding.
4. Feed the sequence as an input to a conventional transformer encoder.
5. Pre-train the model with image labels (totally supervised on a big dataset)
6. Fine-tune on the downstream dataset for picture classification.
The transformer [9] building blocks are scaled attention units for dot products.
Transformer encoder is shown in Fig. 7. When an image is routed through a trans-
former model, simultaneous attention weights are calculated between each token.
The attention unit generates embedding for each token in context that contains infor-
mation on the token itself as well as a weighted combination of other relevant tokens,
and each of which is weighted according to its attention weight.
Fig. 7 Transformer encoder

block used in Vi-T base for
image forgery detection
Image patches are equivalent to a sequence of tokens that are similar to the
sequence of tokens used in natural language processing to represent sentences. With
the help of transformer layers, encoding processing and training on the model takes
place to obtain position embedding values. Following that, it is used for MLP after
going via multi-head attention, among other things. The ViT model [10] is trained
on both the original and fake images, and labeling is performed once the encoding
process is completed with the help of the transformer block. Finally, after the process,
fine-tuning is carried out. The model contains 12 layers, a hidden size D of 768, an
MLP size of 3072, 12 heads, and a total of 86 M parameters.
4 Experimental Test Bench
Dell desktop configuration with Intel Core i7-940 (2.93 GHz) Quad processor, 64 GB
RAM, and NVIDIA GeForce RTX-3080 Ti GPU was used for all experiments done
on two datasets. Coding was implemented in TensorFlow which is a free and open-
source deep learning neural network library written in the Python programming
language. The sea-born library was used to create the visualization, and attention
maps were utilized to find the locations of the tampering areas based on the training
of the model. TensorBoard was used to plot and visualize all graphs showing training,
Table 1 Datasets used for evaluation

S. Dataset Proposed Image type No. of Training Validation
No. method images accuracy accuracy
(%) (%)
1 Media integration Vision 2048 × 1536 in 2000 98 97
communication transformer JPEG format
center laboratory (ViT)-based Copy-move
MICC-MF2000 image image forgery
classification
2 CASIA 2 Vision 240 × 160 to 12,614 90 64
transformer 900 × 600
(ViT)-based dimension
image JPEG, TIFF,
classification BMP format
Copy-move
image forgery
testing, and loss curves of the model. Table 1 indicates forgery benchmark datasets
used for classification and image forgery localization using Vision Transformer [11].
ReLU function is used in multi-layer perceptron for updating the parameters during
backpropagation, and also, it does not suffer with vanishing gradient problem which
is an advantage in neural networks. ReLU is an example for an activation function
which allows output only when it has positive inputs else zero outputs zero.
5 Results
Performance metrics on the MICC-MF2000 dataset achieved good training and vali-
dation accuracy and were executed for 200 steps for model training. Also, loss
obtained in the end proves to be satisfactory for the implementation of image clas-
sification for forgery detection. Figure 8 indicates the precision and recall matrices
Fig. 8 Precision matrix and recall matrix obtained on MICC-MF 2000 dataset
Fig. 9 Training and validation accuracies versus epochs on MICC-F2000 dataset
obtained on the original class and predicted class. Figure 9 shows graphical results
obtained after training and validating accuracy against epoch loss.
MICC-MF2000 is a 2000-image dataset that is regarded a benchmark for detecting
picture forgeries. Vision Transformer performed well on this dataset and was also
well trained for demonstrating generalized performance on a fresh dataset during the
forgery detection testing phase. The well-known CASIA 2 dataset also performed
well during the training phase when the suggested technique was used, but achieved
lower validation accuracy than previous deep neural networks. Localization, too, is
best accomplished through the use of an attention mechanism.
Performance metrics on the CASIA 2 dataset achieved good training and average
validation accuracy and were executed for 100 steps for model training. Also, loss
obtained in the end proves to be satisfactory for the implementation of image clas-
sification for forgery detection. Figure 10 indicates the precision and recall matrix
obtained on the original class and predicted class. Figure 11 shows graphical results
Fig. 10 Precision matrix and recall matrix obtained on CASIA 2 dataset

Fig. 11 Training loss and validation loss versus total epochs on MICC-MF2000 dataset
Table 2 Comparison of proposed model in terms of parameters

S. No. Model Parameters Trainable Non-trainable
1 Vision transformer Total params: Trainable Non-trainable
(ViT)-b16 model 85,660,416 parameters: parameters: 0
summary 85,660,416
2 ViT model’s compiled Total params: Trainable Non-trainable
summary on image 87,005,634 parameters: parameters: 0
forgery dataset 87,005,634
3 ViT model Optimizer = “Adam,” – –
hyper-parameters used Loss = “categorical
for 100 and 200 cross-entropy,”
epochs on dataset Metrics = [“accuracy”]
Batch size = 2048
Activation = “ReLU”
obtained after training and validation loss against total epochs. Table 2 indicates train-
able and non-trainable parameters of two models of Vision Transformer obtained
using ViT-b16-model weights for two datasets of image forgery used for evalua-
tion. Table 2 indicates the hyper-parameters used for training the neural network and
summary obtained after model compilation on image forgery datasets MICC-MF
2000 and CASIA 2. Epoch loss and accuracies are shown in Figs. 9 and 11.
5.1 Localization Using Attention Networks
Previously, initial layers of a CNN model are used for visualization, but in recent
times, a well-trained model utilizes smooth and nice filters for visualization. A vision
transformer that is well trained is allowed to find the localization [7] basing on
Fig. 12 Forged image (left) and localization using attention networks (right)
attention mapping where distance is calculated using multiplication of weight with

pixel value to evaluate the required result. Figure 12 indicates localization [12] of
tamper region basing on ViT model weights and pixel values of the desired class
used in training. The picture is considered from the tamper dataset for finding the
forgery areas in the test image. Attention mapping is used in the present research
work for identifying the tamper areas in forged images by taking into consideration
of ViT model. While making predictions of the given image, two classes were taken
into account as real and morphed, and mapping was done using the h5 format of the
trained model. During this process, the image is resized to 64 * 64 dimensions and
predicted for finding the type of class.
6 Conclusion and Future Scope
The results section and graphing of performance metrics against multiple benchmark
datasets includes tamper detection such as copy-move forgery and spliced picture
forgery provide a complete evaluation approach using Vision Transformer (ViT).
The MICC-F2000 dataset achieved 97% validation accuracy. However, CASIA 2
dataset achieved 64% validation accuracy over test images. Further localization of
forged areas is achieved by attention networks using the model. The work will be
expanded to identify photographs using deep learning methods for tamper detection
while employing cloud computing GPU resources in the future.
Compliance with Ethical Standards Conflict of Interest: The authors declare that they have no
conflict of interest.
References
1. A.H. Saber, M.A. Khan, B.G. Mejbel, A survey on image forgery detection using different
forensic approaches. Adv. Sci. Technol. Eng. Syst. J. 5(3), 361–370 (2020)
2. M.A. Qureshi, M. Deriche, A bibliography of pixel-based blind image forgery detection
techniques. Signal Process.: Image Commun. 39, 46–74 (2015)
3. K. Asghar, X. Sun, P.L. Rosin, M. Saddique, M. Hussain, Z. Habib, Edge-texture feature based
image forgery detection with cross-dataset evaluation. Mach. Vis. Appl. 30(7–8), 1243–1262
(2019). https://doi.org/10.1007/s00138-019-01048-2
4. J. Dong, W. Wang, T. Tan, CASIA image tampering detection evaluation database, in 2013
IEEE China Summit and International Conference on Signal and Information Processing,
pp. 422–426 (2013). https://doi.org/10.1109/ChinaSIP.2013.6625374
5. A. Jegorowa et al., Deep learning methods for drill wear classification based on images of holes
drilled in melamine faced chipboard. Wood Sci. Technol. 55(1), 271–293 (2021)
6. Z.J. Barad, M.M. Goswami, Image forgery detection using deep learning: a survey, in 2020
6th International Conference on Advanced Computing and Communication Systems (ICACCS)
(2020). https://doi.org/10.1109/ICACCS48705.2020.9074408
7. R. Salloum, Y. Ren, C.-C. Jay Kuo, Image splicing localization using a multi-task fully
convolutional network (MFCN). J. Vis. Commun. Image Representation 51, 201–209 (2018)
8. I. Amerini, T. Uricchio, L. Ballan, R. Caldelli, Localization of JPEG double compression
through multi-domain convolutional neural networks (2017). https://doi.org/10.1109/CVPRW.
2017.233
9. A. Dosovitskiy et al., An image is worth 16 x 16 words: transformers for image recognition at
scale. arXiv preprint arXiv:2010.11929 (2020)
10. S. Paul, P.-Y. Chen, Vision transformers are robust learners. arXiv preprint arXiv:2105.07581
(2021)
11. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł Kaiser, I. Polosukhin,
Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017)
12. E.W. Teh, M. Rochan, Y. Wang, Attention networks for weakly supervised object localization.
BMVC (2016)
Design of Intelligent Framework
for Intrusion Detection Platform
for Internet of Vehicles
Ch. Ravi Kishore, D. Chandrasekhar Rao, Janmenjoy Nayak,
and H. S. Behera
Abstract Technology is moving toward more intelligent and connected devices in

this digital world. The transport system is also not untouched by this phenomenon.
Vehicles are becoming intelligent and autonomous with the use of sensors and
communication techniques. Progress in IoT has given rise to a new field in transporta-
tion and vehicular networks, namely the Internet of Vehicles (IoV). As a result of their
vulnerability, autonomous vehicles and the Internet of Vehicles (IoV) are susceptible
to several cyber-attacks, including flooding and fuzzing, as well as normal and replay
assaults. This paper mainly focused to improve the accuracy of the intrusion detec-
tion system as well as reduce the computational cost. In real life generally, network
data are imbalanced. This challenge was solved by using random oversampling and
Synthetic Minority Oversampling Techniques (SMOTE) to create more data from
minority classes. The proposed method adaptive boosting ensemble learning-based
model has shown a significant improvement over the state-of-the-art methods simu-
lated with generated balanced Controller Area Network (CAN) bus data set. The
performance metrics like f1-score used here as the imbalance nature present in data
imply the efficiency for the proposed approach.
Keywords Internet of Vehicles · CANBus · Intrusion detection · SMOTE
1 Introduction
Vehicles today can have up to 70 electronic control units (ECUs), all of which are
interconnected and communicate through specialist automotive bus systems such as
Ch. R. Kishore (B) · D. C. Rao · H. S. Behera

Department of Information Technology, Veer Surendra Sai University of Technology, Burla,
Sambalpur, Odisha 768018, India
D. C. Rao
e-mail: dcrao_it@vssut.ac.in
J. Nayak
682 Ch. R. Kishore et al.
Fig. 1 CAN bus attack

interfaces
CAN, MOST, or Flexray. As a result of this, increased connectivity and functionality

are accompanied by increased exposure and vulnerability [1]. Attackers may attempt
to gain access to the automotive network to insert messages, modify data, or obtain
sensitive information as shown in Fig. 1. One of the vehicle’s numerous external
connections, for example, might be exploited to inject a malicious packet, causing
the vehicle to malfunction [2].
On the road and to other road users, intelligent transportation systems allow for the
use of intelligent vehicles. This communication system for smart vehicles is referred
to as vehicle to everything (V2X) or vehicle area network (VANET). The VANET
communication system for smart vehicles is comprised of the following primary
types of communications. Vehicle to Road (V2R), Vehicle to vehicle (V2V), Vehicle
to sensor (V2S), Vehicle to human beings (V2H), and Vehicle to infrastructure (V2I)
are the five types of vehicle-to-road communication [3]. Roadside infrastructures
such as location sensors and traffic monitoring systems, as well as smart vehicles, are
included in the V2I concept. V2V refers to a smart vehicle that can communicate with
other smart vehicles on the road to share information. In the event of a cyberattack
on these communication systems of smart vehicles, the risks to security and privacy
could increase significantly.
With the emergence of new communication technologies in the realm of wireless
and mobile communication, helped by advances in the Internet of Things (IoT), the
Internet of Vehicles (IoV) has emerged as a new potential paradigm. Throughout the
city, sensors are installed in cars and other vehicles, as well as in roadside devices
and fixed infrastructure to collect and process crucial information. This data is then
securely transmitted between smart vehicles, allowing for real-time control and direc-
tion to be provided. IoV is still in its infancy, and only a small amount of research
has been made public [4].
The IoV, an ad-hoc network designed for vehicles, has evolved from the classic
vehicle-to-vehicle ad-hoc network which is a network of different entities involved
Design of Intelligent Framework for Intrusion Detection … 683
in road transportation, such as humans, vehicles, roads, parking lots, and city infras-
tructure, that allows for real-time communication between them [5]. IoV system
security breaches might have major consequences for people, automobiles, roads,
and other infrastructure. Automobiles, as well as human lives, are at risk from faulty
or deceptive information. Hackers or intruders can take over the Car system and
cause chaos and accidents [6].
AVs are susceptible to both internal and external communication assaults. In the
CAN, all ECUs in a vehicle can communicate with one another via bus communica-
tion (ECUs). Because of its effective error detection method for steady transmission
[7], it can minimize wiring costs, weight, and complexity. If the CAN bus is compro-
mised, however, all ECUs are vulnerable to a variety of attacks because all ECUs
communicate with one another via this network. If malicious messages are injected
into the network traffic or if hostile attacks are conducted, the nodes will execute
them without confirming their origin. As a result, message injection attacks on the
CAN bus can be divided into three for example, a denial of service (DoS) attack or
spoofing can be used to take over resources or send destructive information, such as
gear and RPM data. Unwanted states or malfunctions are caused via fuzz attacks,
which inject arbitrary messages into the CAN bus [8].
The CAN bus distinctive characteristics could be vulnerable due to no considera-
tion of security. Many injection attacks, such as flooding, fuzzing, spoofing, replay,
and bus–off attacks, exploit CAN bus vulnerabilities [9]. The CAN bus has no authen-
tication for source and destination addresses, so the injected data can be processed
in normal ECUs without verification. The flooding attack causes CAN message
delaying of normal ECUs by sending a bunch of messages that have the highest
priority (CAN ID 0 × 000). The fuzzing attack injects random CAN IDs and data.
The spoofing attack controls certain vehicle functions by setting specific CAN ID
and the data field. The replay attack is to inject normal CAN bus traffic, which was
collected during driving.
We have set our research focus on two hypothetical situations. The first issue
that needs to be addressed is data imbalance, which is prevalent in network data.
Additionally, the IDS should be constructed with the help of ML algorithms. In this
paper, we examine the security mechanism for detecting malicious attacks on the
CAN bus, which is called the intrusion detection system. IDS can be viewed as a
classification problem. To create intelligent identification cards for the CAN bus,
we proposed some tree-based machine learning algorithms and traditional machine
learning models together (Decision Tree (DT), Random Forest (RF), Ada Boost,
Extra Tree, HG Boost, XG Boost, Cat Boost, Gradient Boost, Gaussian Naive Bayes
(NB), K Nearset Neighbor (KNN)). An intelligent intrusion detection system should
be both highly accurate and inexpensive to run. This approach, which improved our
detection accuracy, used a Multiclass Adaptive Boosting Ensemble learning model.
In this paper Sect. 2 describes related works. Section 3 explains about proposed
intrusion detection system. Section 4 describes performance evaluations and Sect. 5
concludes the research work.
2 Related Works
An assailant who is already inside a vehicle is being detected using different

approaches. It is the industry standard for in-vehicle communication between the
ECUs of cyber-physical systems in automobiles, the CAN. It is unfortunate that
despite its excellent reliability, the CAN bus lacks any built-in security mechanisms
to defend against unauthorized external tampering. A hacker with access to one ECU
can easily take control of the vehicle’s other vital cyber-physical systems as a result
of this. Using a fuzzing attack or a basic Denial of Service (DoS) assault, this might
be performed by broadcasting falsified orders over the network to gather the essential
information.
Tejasvi et al. [10] used vehicular networks along with deep learning and edge
computing. To protect the information infrastructures’ information veins, the authors
developed a deep learning framework and associated edge computing devices,
and two classification techniques namely the Coarse-grained classification method
(CGCM), Fine-grained classification method (FGCM) were developed to classify
the misbehaving vehicle in the network. Zhou et al. [11] proposed the concept of the
Deep Neural Network (DNN) which uses CAN bus sequencing and features to pull
out special CAN bus sequences for training. The most important thing in this stage
of training is that the weight paramete the proposed method ignores the messages’
rs are shared. The three random CAN bus data sequences are then imported into the
triplet loss network, and these sequences are used to calculate the similarity between
them.
Entropy-based intrusion detection for in-vehicle networks was proposed by
Micheal et al. [12]. An intrusion detection system is programmed using the entropy
of a vehicle’s network. But these tests showed that it was difficult to detect small-
scale attacks that could be the result of typical vehicle or user behaviors. Kang
et al. [13] proposed a novel intrusion detection using a deep neural network. Their
system includes a monitoring module and a profiling module. The monitoring module
predicts whether the incoming CAN message is an attack or not based on the trained
features of known attacks.
A deep convolutional neural network-based intrusion detection system was devel-
oped by Song et al. [14] to protect the vehicle’s CAN. The Deep Convolutional
neural network (DCNN) learns network traffic patterns and detects malicious activity
without the need for any custom-designed features. They utilized Inception-ResNeT
architecture to learn the sequential patterns between the in-vehicle network and traffic
data.
3 Proposed Intrusion Detection System
Building an IDS system begins with gathering enough information on network traffic,
both regular and atypical, as a result of a variety of attacks. The most prevalent
CAN bus threats are message injection attacks, hence collecting data from CAN
messages/frames is the first step in building an IDS for CAN bus intrusion. When it
comes to assaults, the most important aspects are the CAN IDs and the data fields.
It is critical to detect intrusions in CAN bus on time. The vehicle’s ECUs will
generate a large amount of data, making detection of an intrusion more difficult. As
a result, the data must be compressed. Network data is often class imbalanced and
attack-label examples are often insufficient because networks in real life are usually in
a normal state. One way to obtain more class-imbalanced data is by extracting it from
the minority classes in which insufficient data exists. This concept is shown in Fig. 2.
This issue is dealt with by using a combination of two approaches: oversampling and
SMOTE [15]. Because the features of a replay attack are similar to those of a normal
attack, replay is frequently misclassified as a normal attack. To address this problem,
the proposed method ignores the messages’ time stamps as repeated message data
will be the same only even if timestamps changes, this process will result in reducing
data set which will result in a reduction of computational time of the system which
is a major concern in CAN IDS system and also produce a balanced data between
normal and replay.
Here we have implemented our proposed method (adaptive boosting) on the
preprocessed dataset obtaining after applying SMOTE. Here the decision tree is
used as a base classifier for the proposed method to detect intrusion detection in
vehicles as we have a multiclass problem, here our proposed method is used to boost
the performance of a base classifier. Here, our proposed method predicts the intrusion
detection types from the “N” number of base classifiers constructed in the training
data, and base classifiers are added sequentially and getting trained by the weight
samples and the error is observed. This process is continued until the finest model is
obtained, which will give the best performance. Using the weighted average of the
base classifier predictions, the final intrusion detection categories were determined.
Algorithm for the proposed adaptive boosting-based model for intrusion detection
on CANBus dataset.
(1) Initialize the weights [Eq. (1)] of each S i ∈ S, Where S = All samples
wi = 1/N , i = 1, 2, 3, 4 . . . . . . M (1)
(2) For m = 1 to N
(a) Fit a model classifier Z m (x) to the training data using the weight wi
(b) Compute prediction error [Eq. (2)]
M
i=1 M y i = Z m (x i )
Er r m = M (2)
i=1 W i
(c) Estimate weighting factor on the basis of predicted error [Eq. (3)]
αm = log (1 − Er r m /Er r m ) (3)

Fig. 2 Schematic diagram of the proposed method

(d) Updating previous weights to new weights [Eq. (4)]
(wi = wi ∗ exp[(αm ∗ I y i = Z m (x i )], i = 1, 2, 3 . . . . . . ..M) (4)
(3) Finding Predicted output [Eq. (5)]

N
Z (x) = sign α m Z m (x) (5)
m=1
Furthermore, the proposed system has been evaluated and analyzed using state-
of-the-art methods, and it has demonstrated a significant improvement over other
simulation models used in this paper.
3.1 Dataset
Intrusion detection in the CANBus dataset was prepared by Kang et al. [16]. The
dataset was generated for the car security competition ‘Car Hacking: Attack &
Defence Challenge- 2020’. The competition problem was to develop attacks and
detection algorithms for CAN, a widely used standard of in-vehicle communication.
The data was collected from Hyundai Avante CN7. The dataset contains 86, 81,500
samples of various attacks such as Replay, Normal, Flooding, Fuzzy, and Spoofing
as shown in Table 1. However, a huge imbalance between the samples belongs to
normal, and replay attacks need to be addressed.
4 Result Analysis
In this section, first obtained is the confusion matrix from the proposed model for
IDS’s performance on the CAN dataset. Initially, it shows that the proposed model
is misclassifying replay attacks to normal attacks and finding a low recognition rate.
Table 1 The dataset before

Class Number of records Number of records (after
and after pre-processing
(before preprocessing) preprocessing)
Normal 7,808,258 1,152,376
Replay 110,474 1,152,376
Flooding 345,859 345,859
Fuzzing 216,571 216,571
Spoofing 200,338 200,338
Fig. 3 a Confusion matrix of DT, b Confusion matrix of DT_SMOTE
Fig. 3a depicts the IDS’s performance on the CAN dataset before it was pre-processed
using the decision tree algorithm.
The following is the procedure for balancing the normal and replay classes:
1. Table 1 shows that the normal class has 7,808,258 records, whereas the replay
attack has only 110,474 records.
2. By ignoring the time stamp of normal data and identifying distinct data points
in normal, the number of records is reduced from 7,808,258 to 1,152,376.
3. Even after reducing the normal data set, normal data is still 90% more than
replay data, indicating a clear problem of class imbalance.
4. To overcome class imbalance, SMOTE was used on the minority class, which
resulted in a balanced class (as shown in Table 1) and an improvement in replay
attacks (as shown in Fig. 3b.
The performance of the proposed methodology has achieved a significant
improvement over the state-of-the-art methods. The results of various decision tree
algorithms on the intrusion dataset have been shown in Table 2.
4.1 Evaluation
We worked on many machine learning models on the CAN dataset in this paper, and
all model parameters were set by picking appropriate values as shown in Table 3.
After putting the test dataset through the classifier, we evaluate its performance
using typical evaluation measures. To assess the effectiveness of the suggested
strategy, various performance measures such as Accuracy, Precision, Recall, F1
Table 2 Performance analysis of IDS on CAN dataset

Method Attacks (true positives)
Flooding Fuzzy Normal Replay Spoofing
RF [17] 1.0 1.0 0.991 0.994 1.0
AdaBoost [18] 1.0 0.999 0.987 0.991 1.0
Gradient boost [19] 1.0 0.97 0.948 0.864 1.0
Extra tree [20] 1.0 1.0 0.991 0.993 1.0
HGBoost [21] 1.0 0.998 0.968 0.947 1.0
XGBoost [22] 1.0 0.958 0.943 0.828 0.981
CatBoost [23] 1.0 0.998 0.978 0.97 1.0
DT 1.0 0.995 0.987 0.991 1.0
Gaussian naïve bayes 1.0 0.784 0.169 0.813 0.692
KNN 1.0 0.972 0.944 0.984 1.0
DT_SMOTE 1.0 0.996 0.737 0.886 1.0
RF_SMOTE 1.0 0.999 0.741 0.877 1.0
AdaBoost_SMOTE 1.0 0.977 0.999 0.386 1.0
Gradient boost_SMOTE 1.0 0.953 1.0 0.04 1.0
Extra tree_SMOTE 1.0 1.0 0.738 0.879 1.0
HGBoost_SMOTE 1.0 0.994 0.998 0.226 1.0
XGBoost_SMOTE 1.0 0.917 0.999 0.0 0.981
CatBoost_SMOTE 1.0 0.998 0.999 0.208 1.0
Gaussian naïve bayes_SMOTE 1.0 0.544 0.846 0.0 0.842
KNN_SMOTE 1.0 0.973 0.988 0.323 1.0
Score, and ROC-AUC were computed and compared. True Positive, True Nega-
tive, False Positive, False Negative, and ROC-AUC are the terms used here. Table 4
shows the results of analyzing the performance of various strategies using the above
measures.
We used a variety of machine learning classification models in this research. Gaus-
sian NB, Gradient Boost, KNN, CatBoost, Hist Gradient Boosting, Extra Tree, RF,
DT, and Adaptive Boosting are some of the techniques used. Original car hacking
data samples were used to evaluate all of the models. All models, however, were
implemented using SMOTE, using car hacking data samples. The total correct clas-
sification rate and F1 score of all SMOTE-enabled models are higher than that of
non-SMOTE-enabled models. Among all models, the Adaptive Boosting ensemble
classification model with SMOTE has the highest accuracy and F1 score with minimal
execution time as shown in Table 4.
The proposed system outperforms the methods of Kang et al. [16] in terms of
accuracy and F1-score. Proposed AdaBoost has shown the best accuracy while NB
and DT have got minimal accuracy. Graphical representation of confusion matrix for
AdaBoost is represented in Fig. 4
Table 3 Parameters of all considered model

Models Parameters
Gaussian naive bayes_SMOTE Var_smoothing = 1e-09
XGBoost_SMOTE n_estimators = 100, max_depth = 6, reg_lambda = 0,
reg_alpha = 0
Gradient boost_SMOTE Learning_rate = 0.1, min_samples_leaf = 1,
validation_fraction = 0.1, n_estimators = 100, criterion =
‘friedman_mse’, max_depth = 3, warm_start = False, tol =
0.0001, ccp_alpha = 0.0, loss = ‘deviance’, subsample = 1.0,
min_samples_split = 2
KNN_SMOTE n_neighbors = 5, weights = ‘uniform’, algorithm = ‘auto’,
leaf_size = 30, p = 2, metric = ‘minkowski’
CatBoost_SMOTE Iteration = 100, learning_rate = 0.1
HGBoost_SMOTE Learning_rate = 0.1, n_estimators = 100, subsample = 1.0
ExtraTree_SMOTE n_estimators = 100, criterion = ‘gini’, max_depth = None,
min_samples_split = 2, min_samples_leaf = 1
RF_SMOTE Criterion = ‘gini’, min_samples_leaf = 1, max_features =
‘auto’, min_impurity_decrease = 0.0, bootstrap = True,
warm_start = False, class_weight = ‘balanced’, n_estimators
= 100, min_samples_split = 2
DT_SMOTE Splitter = ‘best’, presort = False, min_samples_split = 2,
class_weight = ‘balanced’, criterion = ‘gini’,
min_samples_leaf = 1
Proposed AdaBoost_SMOTE Base_estimator = DecisionTreeClassifier, n_estimators = 1,
algorithm = ‘SAMME.R’, learning_rate = 1.0,
Graphical representation of ROC-AUC for AdaBoost is represented in Fig. 5.
5 Conclusion
This research majorly focused on building an intrusion detection model to detect

attacks in in-vehicle networks. The proposed system improves the accuracy and
detects the attacks in less time. AdaBoost has achieved in 0.16 s using the pre-
processed car hacking dataset. The pre-processing focuses to balance the imbalance
classes using the SMOTE oversampling method, and ignoring the time stamp of the
messages. The proposed system has achieved 99.6% accuracy and F1-score about
0.99 (AdaBoost). The proposed method achieved a significant F1-score compared
to F1-score 0.86 achieved by Kang et al. method [17] and from the results it is
evident that Naive base class performance is very bad and even KNN has given
better performance but classification time is 1499 s which is very high compared to
the proposed approach.
Table 4 The performance analysis of various Tree-based methods

Method Accuracy (%) F1 score Execution time (sec)
Gaussian naive bayes 83.4 0.54 0.77
XGBoost 98.4 0.78 22.94
Gradient boost 98.6 0.81 14.74
KNN 98.9 0.88 2398.71
CatBoost 98.9 0.86 8.29
HGBoost 98.8 0.86 24.88
Extra tree 76.2 0.788 49.43
RF 76.5 0.788 38.60
DT 76.5 0.788 0.38
AdaBoost 99.0 0.90 0.77
Gaussian naive bayes_SMOTE 58.2 0.614 0.23
XGBoost_SMOTE 90.9 0.946 8.35
Gradient boost_SMOTE 92.7 0.956 5.06
KNN_SMOTE 97.1 0.98 14.99
CatBoost_SMOTE 98.0 0.988 1949
HGBoost_SMOTE 96.7 0.996 10.25
Extra tree_SMOTE 99.4 0.996 25.15
RF_SMOTE 99.4 0.996 15.51
DT_SMOTE 99.12 0.996 0.16
AdaBoost_SMOTE 99.15 0.996 0.41
Fig. 4 Confusion matrix of

AdaBoost
Fig. 5 ROC-AUC of AdaBoost
Moreover, all the results and analyses show that the proposed model is able to
reach highly improved accuracy as well as F1 score when compared to other machine
learning models. Further research is to build a highly generalized model which will
work on large datasets with a greater number of attacks.
References
1. N. Khatri, R. Shrestha, S.Y. Nam, Security issues with in-vehicle networks, and enhanced
countermeasures based on blockchain. Electronics 10(8) 893 (2021)
2. M.-J. Kang, Kang J.-W., Intrusion detection system using deep neural network for in-vehicle
network security. PloS one 11(6) e0155781 (2016)
3. S. Alam, M. Shuaib, A. Samad, A collaborative study of intrusion detection and prevention
techniques in cloud computing, in Proceedings of ICICC 2018, vol. 1 (2019) https://doi.org/
10.1007/978-981-13-2324-9_23
4. W. Wu, Z. Yang, K. Li., Internet of vehicles and applications, in Internet of Things (Morgan
Kaufmann, 2016), pp. 299–317
5. Y. Sun et al., Attacks and countermeasures in the internet of vehicles. Ann. Telecommun.
72(5–6) 283–295 (2017)
6. E. Seo, H.M. Song, H.K. Kim, Gids: Gan based intrusion detection system for in-vehicle
network, in 2018 16th Annual Conference on Privacy, Security and Trust (PST) (IEEE, 2018)
7. H. Lee, S.H. Jeong, H.K. Kim, OTIDS: A novel intrusion detection system for in-vehicle
network by using remote frame, in 2017 15th Annual Conference on Privacy, Security and
Trust (PST) (IEEE, 2017), pp. 57–5709
8. M.L. Han, B.I. Kwak, H.K. Kim, Anomaly intrusion detection method for vehicular networks
based on survival analysis. Veh. Commun. 14, 52–63 (2018)
9. N.V. Chawla et al., SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res.
16, 321–357 (2002)
10. T. Alladi, V. Kohli, V. Chamola, F.R. Yu, Securing the internet of vehicles: a deep learning-based
classification framework. IEEE Networking Lett. 3(2), 94–97 (2021)
11. A. Zhou, Z. Li, Y. Shen, Anomaly detection of CAN bus messages using a deep neural network
for autonomous vehicles. Appl. Sci. 9(15), 3174 (2019)
12. M. Muter, N. Asaj, Entropy-based anomaly detection for in-vehicle networks. [IEEE 2011
IEEE Intelligent Vehicles Symposium (IV) - Baden-Baden, Germany (2011.06.5-2011.06.9)]
2011 IEEE Intelligent Vehicles Symposium (IV), pp. 1110–1115 (2011). https://doi.org/10.
1109/ivs.2011.5940552
13. M.J. Kang, J.W. Kang, A novel intrusion detection method using deep neural network for in-
vehicle network security, in 2016 IEEE 83rd Vehicular Technology Conference (VTC Spring)
(IEEE, 2016), pp. 1–5
14. H.M. Song, J. Woo, H.K. Kim, In-vehicle network intrusion detection using deep convolutional
neural network. Veh. Commun. 21, 100198 (2020)
15. D. Elreedy, A.F. Atiya, A comprehensive analysis of synthetic minority oversampling technique
(SMOTE) for handling class imbalance. Inf. Sci. 505, 32–64 (2019)
16. H. Kang et al., Car hacking and defense competition on in-vehicle network. Workshop Automot.
Auton. Veh. Secur. (AutoSec). 2021 (2021)
17. L. Breiman, Random forests. Mach. Learn. 45(1), 5–32 (2001)
18. R.E. Schapire, Explaining adaboost, in Empirical Inference (Springer, Berlin, 2013), pp. 37–52
19. J.H. Friedman, Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002)
20. P. Geurts, D. Ernst, L. Wehenkel, Extremely randomized trees. Machinelearning 63(1), 3–42
(2006)
21. A. Guryanov, Histogram-based algorithm for building gradient boosting ensembles of piece-
wise linear decision trees, in Analysis of Images, Social Networks and Texts. AIST 2019. Lecture
Notes in Computer Science, vol. 11832, eds. by W. van der Aalst et al. (Springer, Cham, 2019)
22. T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in Proceedings of the 22nd
Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, pp. 785–794
23. A.V. Dorogush, V. Ershov, A. Gulin, CatBoost: gradient boosting with categorical features
support. arXiv preprint arXiv:1810.11363 (2018)
Autonomous Vehicles: A Survey
on Sensor Fusion, Lane Detection
and Drivable Area Segmentation
Tejas Morkar, Suyash Sonawane, Aditya Mahajan, and Swati Shinde
Abstract Most road traffic accidents are caused due to human errors and lead to fatal
injuries, hospitalization, and even deaths. Autonomous vehicles pose to be a good
option to minimize the possibility of such human errors. But autonomous systems
today are not capable of handling extreme uncertainties that an everyday driver has
to face on the road. Researchers are trying to study new ways of making autonomous
vehicles better and safe. In this paper, we have discussed some of the most important
parts of a modular autonomous vehicle: Sensor fusion, lane detection, and drivable
area segmentation. We also present a detailed survey of existing and state-of-the-art
approaches for these modules. Understanding these techniques and how they work,
can lay a proper foundation for the planning and acting phase of autonomous vehicle
systems.
Keywords Autonomous vehicles · Sensor fusion · Lane detection · Drivable area

segmentation
1 Introduction
According to a fact sheet by World Health Organization (WHO), approximately

1.35 million people die each year due to road accidents [1]. The same fact sheet
mentioned various risk factors that lead to road traffic accidents and many of them
were a direct result of human errors such as speeding, driving under the influence
of alcohol and other substances, distracted driving, and others. Advanced Driver
Assistant Systems and Autonomous Vehicles assist in minimizing aspects of human
T. Morkar (B) · S. Sonawane · A. Mahajan · S. Shinde

Department of Computer Engineering, Pimpri Chinchwad College of Engineering, Pune, India
e-mail: tejas.morkar18@pccoepune.org
S. Sonawane
e-mail: suyash.sonawane18@pccoepune.org
A. Mahajan
e-mail: aditya.mahajan18@pccoepune.org
696 T. Morkar et al.
error in the causes of road accidents. Autonomous machines can be broken down
into three important modules of sensing, perceiving, and planning and acting.
In this paper, we have conducted a thorough survey on the sensing and perceiving
block of the autonomous vehicle systems as shown in Fig. 1. Sensor fusion is the
most vital part of converting deep random data into meaningful information for the
perceiving blocks. We discuss sensor fusion techniques that address specific problems
like configurations, calibration errors, and required data fusion. We also examine the
state-of-the-art techniques for lane detection and drivable area segmentation further
and provide comparison analysis to understand and choose the correct techniques
based on the required parameters like available computing power, response time, and
input video stream quality. Our contribution through this paper is to provide a deep
study on these modules and strengthen the base for further research in the acting
phase of the autonomous vehicle systems.
The remainder of this paper is organized as follows: Literature survey is discussed
module wise in detail in Sect. 2 where part 2.1 is Sensor fusion, part 2.2 is Lane
detection, and part 2.3 is Drivable area segmentation. In Sect. 3, we have discussed
of the survey to understand the outcomes of the available literature on all these
modules for an autonomous vehicle. In Sect. 4, we have conclusion for the study
performed.
Fig. 1 Block diagram of an autonomous vehicle with 3 main modules: Sense, perceive, and plan
and act
Autonomous Vehicles: A Survey on Sensor Fusion, Lane … 697
2 Literature Survey
This section is divided into literature survey of the sensor fusion, lane detection, and
drivable area segmentation modules, respectively.
2.1 Sensor Fusion
For being able to perceive the real world accurately, autonomous systems need to
be able to interpret that data. Sensor fusion is needed here to combine data from
multiple sensors on the vehicle and provide self-awareness and situational awareness
parameters for the vehicle like its localization, positioning, object detection, and
tracking. There are various issues while combining multi-sensor data like calibration,
data parallax, and synchronization issues. Some of the focus areas of studies in this
domain are about which sensor configurations should be used [2, 3], how to achieve
proper data synchronization by automation [4, 5], and ways of achieving better
sensor fusion using techniques like Kalman filter [6, 7], Fuzzy logic [8, 9], and
Deep Learning-based models [10, 11]. We will provide a thorough survey and recent
advancements in this topic for autonomous vehicles.
2.1.1 Sensor Configurations
Various types of sensors are available to provide critical information about the
surrounding of autonomous vehicles such as RADAR (Radio Detection And
Ranging), LIDAR (Light Detection and Ranging), ultrasound, camera, thermal
camera, and other few. Every sensor has its own pros and cons as mentioned in
Fig. 2 and a detailed comparison can be seen in Table 1. And these sensors provide
data which in itself is not enough for an autonomous system to make inference
from and so we need to use configurations of these sensors to obtain some useful
Fig. 2 Comparison of strengths and weaknesses of different sensors shown on a radar chart. Data
taken from [12]
Table 1 Comparison of
Sensor Range Reliability in all condition Feasibility
sensors with respect to their
properties LIDAR Good High Low
Ultrasound Bad High High
RADAR Good High High
Camera Best Low High
information. Sensors can be categorized into various categories based on their func-
tioning, way of working, and various other parameters [10]. Understanding these
sensor technologies and choosing the correct configuration is vital.
The sensor configuration must be in such a way that the vehicle needs no heavy
alterations, increase the area of sensing for the vehicle, and maximize the use of
already existing sensors on the vehicle. And keeping this in mind, Cho et al. [13]
have designed a system with 6 RADARs, 6 LIDARs, and 3 cameras that can detect
any object within 200 m of the vehicle and at least 2 different sensors can detect
an object if it is within 60 m of range from the vehicle. This system was used
for detecting moving objects and tracking them in an urban driving environment.
Sensors selected for the system should not only be as accurate as they can be but
also feasible and efficient. A sensor fusion framework was proposed where tasks
like road segmentation, obstacle detection, and object tracking are performed with
low-cost, high efficiency, and robust methods using an encoder-decoder-based fully
convolutional neural network and an Extended Kalman Filter [2]. They were able to
obtain a detailed map of the environment and successfully implemented the sensor
fusion of LIDAR, RADAR, and camera data.
Being able to implement an efficient and useful configuration of sensors on an
autonomous vehicle is an important step and so is the location and orientation of the
sensors on it. VESPA (Vehicle Sensor Placement and orientation for Autonomy) [3],
is a system proposed to optimize the placement and orientation of the sensors on a
vehicle given its field of view zones, set of sensors, and required ADAS (Advanced
Driver Assistance System) features on the vehicle. They have tested the system for
optimizing the perception performance of 2 vehicles, namely, the 2016 Chevrolet
Camaro and the 2019 Chevrolet Blazer. Studies have also been conducted to maxi-
mize the results by altering the sensors and the ways they are used like in the case
of Radar [14, 15] and LIDAR [16, 17].
2.1.2 Sensor Data Synchronization and Calibration
When using multiple sensors, due to the absence of a time clock, the data streams
from different sensors can be out of sync and it can create a problem in the process
of sensor fusion. Various techniques for synchronization and calibration of sensor
data are being studied by researchers. Car vibrations and steering events can be used
to synchronize the driving data from various inputs in some cases [4]. The data
was properly synchronized with an average synchronization error of 13 ms by the
authors and this method requires no manual annotation before, during, or after the
data collection phase, and no artificially created synchronization events are needed
but this method has not been evaluated on large multi-vehicle on-road datasets.
An automatic method of calibration is proposed in a paper [5] where the authors
show how multiple 2D and 3D sensors like LIDARs and cameras can be calibrated
automatically using a rolling target in front of the vehicle such as a ball. They use
various algorithms to detect and track the center of the rolling ball through all the
sensors to synchronize their data streams. 3D maps are usually generated by sensor
fusion of LIDAR and camera data to help the decision-making module of autonomous
vehicles. Data calibration of LIDAR and camera [18, 19] is very important to get
a precise 3D map output. An ROS framework was also proposed by Oliveira et al.
[20] which was able to calibrate the multiple sensors in a multi-modal manner with
similar accuracy as compared to the standard pairwise calibrations of sensors. High
definition 3D map generation using multiple sensors like GNSS (Global Navigation
Satellite System), IMU (Inertial Measurement Unit), and LIDAR using point clouds
[21] can be used for navigation systems in vehicles where accurate sensor calibration
is required.
2.1.3 Sensor Fusion Techniques
Techniques used for sensor fusion vary from sensors to sensors and use-cases to use-
cases. The most widely used technique in sensor fusion is using Extended Kalman
Filters (EKF) to fuse different sensor data streams and minimize the noise. An EKF
that reflects on the distance characteristics of LIDAR and RADAR sensors [6] can
be used to accurately estimate the distances from the vehicle and hence, improve
the accuracy of position estimation. For better navigation system and localization on
maps, an Extended Kalman Filter is used for fusing data from 3D LIDAR, GNSS, and
inertial data [7]. This allows to accurately localize the car on a map in urban areas too.
Also, Adaptive Kalman Filter (AKF) with an attenuation factor can help decrease
noise and assist in navigation based on INS/GPS and the accuracy reported by Liu
et al. [22], was 20% higher than that of a traditional AKF. A fuzzy logic enhanced
Kalman filter to fuse the information from machine vision, laser radar, IMU, and
speed sensors on the vehicles can also be used as an efficient sensor fusion technique
[8]. Fuzzy logic enhanced Kalman filter enables to reduce noise from sensors and
translate it into meaningful information for the autonomous system.
Deep learning is also used for sensor fusion in autonomous vehicles [10] for
purposes of localization and perception of the environment. While using sensors like
cameras, RADARs, LIDARs, and thermal cameras, deep learning techniques provide
a way to combine the data streams and make useful representations. Another proposed
method in deep learning is that feature level fusion can be done while working with
thermal cameras, visual cameras, and radars using RVNets and TVNets [11]. These
networks work together along with the skip connections to extract features to the
output branch. The feature fusion and object detection is done after the transfer.
Tasks like object detection require data fusion to work better and a deep learning-
based model, Camera, Radar Fusion Net (CRF-Net) [23] is specially designed for
identifying correct data fusion behavior for camera and radar sensors in order to
provide better results for detection purpose. Human activity recognition is also a
key feature to detect on roads and a study [24] shows how they used Long Short-
Term Memory (LSTM) networks to train the deep learning models that detect human
activity by applying it to the classification model at a sensor fusion level of the system.
It aids in better performance of the model.
2.2 Lane Detection
Lane detection is one of the most important components of a modular autonomous

vehicle system in terms of making decisions. Lane markers provide crucial infor-
mation for where the vehicles are and where they are going. As a result, identifying
them with pixel-level accuracy is critical for self-driving vehicles. To perceive lanes,
handcrafted computer vision techniques and Deep Learning models are used and to
boost accuracy, lanes are further divided into instance segments. To compensate for
camera angles, road width, and curves, various transforms are used, and lightweight
and fast methods for lane detection are preferred to boost overall device throughput.
All these vital points regarding the lane detection module will be discussed in detail
and a comparison among all the mentioned methods can be found in Table 2 at the
end of Sect. 2.2.
2.2.1 LaneNet: Real-Time Lane Detection Networks for Autonomous

Driving
Wang et al. [25] proposed LaneNet, a Lane detection system to achieve a diverse,
less computational, and real-time solution. The proposed architecture divides the
task in stages: Lane Edge Proposal and Lane Localization network, it runs a binary
classification on the image pixels to generate lane edges, these are then fed into the
Table 2 Accuracy and F1 score metrics comparison of different approaches for lane detection
Approach Accuracy F1 score
Towards End-to-End lane detection: an instance segmentation approach 0.964 –
Ultrasound 0.9563 0.9677
RADAR 0.89a –
Camera 0.942 0.739
aRONELD is not a standalone technique it is used to improve on other techniques used in lane
detection
Fig. 3 Block diagram of LANET architecture
localization network in the next stage. This can be seen in the block diagram from
Fig. 3.
These stages are optimized for precision and computational cost yet running
speed. Alight-weight encoder-decoder architecture is composed of stacked depth
wise convolution layers, with 11 convolution filters for fast feature encoding. It is
then converted into lane edge coordinates for feeding into next levels, now the high
speed localization comes into picture which consists of point function encoder and
LSTM decoders for robust detection in various scenarios.
This double staged technique creates a space for additional optimization in the
lane edge map created by the first network which serves as interpretable inline
features, this mitigates the effect of the neural network-based method’s black-box
property and makes identification failures more trackable. This further refines param-
eters of the network in a weakly-supervised manner, alleviating manner that needs
proper-annotated training data. Last but not least, the proposal network feature can be
merged into the semantic segmentation network, further lowering the total computing
expense of driving assistance systems.
2.2.2 Towards End-to-End Lane Detection: An Instance Segmentation

Approach
Neven et al. [26], trained a neural network from start till finish for lane detection,
this takes into consideration the abrupt lane switching issue as well as the available
number of lanes. These features can be accomplished by treating the problem as
instance segmentation. Lane Net architecture uses binary segmentation with a clus-
tering loss function to optimize single-shot segmentation. Every lane is assigned an
identifier of its corresponding lane. As the network generates a series of pixels per
lane, they must also fit the curvature across these pixels to obtain lane parameters.
Traditionally, the lane pixels are projected onto a “bird’s-eye view” representation
first, using a transformation matrix. This creates an issue with generalization of
transition parameters with non-flat grounds such as slopes, hills, this can be addressed
with a set of learnable parameters of a neural network called H-net. Instead of a
handcrafted method, this learning-based method fits a transform with a polynomial
curve rather than linear transforms.
Results of this approach outperforms techniques with fixed transformation when
H-Net is used for lane detection. It also gives us a better mean-squared-error-score and
can match points even in undulated slopes. This paper stood 4th in the tuSimple chal-
lenge with just 0.5% differential from the first entry. These results were achieved on a
model which was only trained on the tuSimple dataset yet has a good generalization
of parameters.
2.2.3 YOLinO: Generic Single Shot Polyline Detection in Real Time
Meyer et al. [27] proposed a novel approach for detecting road markers, lane bound-
aries, and central lines by reformulating the polyline detection problem as a bottom-
up composition of small line segments capable of detecting bounded, dotted, and
continuous polylines with one head. This approach has some significant benefits
over previous approaches. The approach is well-suited for real-time applications
with almost no restrictions on the forms of the observed polylines at 187 FPS.
A sequence of Recurrent Neural Networks (RNNs) has been proposed for highly
effective yet automatic instance segmentation. They each anticipate a bounding
box/crop around an entity and then use gated graph neural networks to predict
the polyline vertices node by node, with optional refinement. Not only are those
RNNs typically slow and difficult to train, but they often require extra care to predict
the initial vertex, which is based on a more general initialization that works well
for instance segmentation but not at all for their applications. Though quicker than
recurrent methods, 30 ms is still a long time in comparison to single shot detectors.
Furthermore, the same head can distinguish both dotted and continuous lines. This
allows robotic applications such as road marking identification and lane centerline
determination, but it is also feasible for a wide range of other applications, such
as blood vessel detection. Although they demonstrated the general concept with
YOLO9000, using a modern backend such as Efficient Net and adding ideas from
YOLOv4 is expected to increase efficiency even more.
2.2.4 Keep Your Eyes on the Lane: Real-Time Attention-Guided Lane

Detection
Lucas et al. [28] proposed LaneATT, an anchor-based deep lane detection model
that, like other generic deep object detectors, uses anchors for feature pooling. This
hypothesis that since lanes follow a normal pattern and are highly correlated, global
knowledge could be critical in some cases to infer their locations. As a result, this
paper proposes a novel anchor-based focus framework for aggregating global infor-
mation. The model was thoroughly tested on three of the most commonly used
datasets in the literature.
LaneATT is a single-stage anchor-based paradigm for lane detection (similar to
YOLOv3 or SSD). By integrating local and global functionality, the model can more
effectively use details from other lanes, which could be needed in situations where
conditions such as occlusion or not visible lane markers exist. Finally, the merged
features are sent to completely connected layers, which forecast the final output lanes.
The system achieves the second-highest registered F1 on TuSimple despite being
much quicker than the top-F1 method (171 vs. 30 FPS). The method stood as a
new state-of-the-art for real-time methods on CULane in terms of both speed and
precision (+4.38% of F1 compared to the state-of-the-art method with a comparable
speed of about 170 FPS). Furthermore, on all three backbones, the system scored a
high F1 (+93%) on the LLAMAS benchmark.
2.2.5 RONELD: Robust Neural Network Output Enhancement

for Active Lane Detection
The research work by Chng et al. [29] differs in the fact that others are techniques to
detect lanes on road and the method proposed in this paper works as an enhancing
technique for the already existing lane detection method. RONELD is used in pairs
with deep learning models that perform poorly for effective lane detection. This
approach is based on observations that forecast lane marking from this network’s
probability map will boost accuracy efficiency.
Accuracy results can be improved in various folds. RONELD is expected to be a
prime approach that uses network probability maps to improve the system’s output.
This solution has low computational time, making it eligible for real-time systems.
The proposed method was tested with Spatial CNN and Efficient Net, resulting
in higher accuracy and lower processing time for RONELD. This shows that by
combining this method the overall effectiveness of the techniques can be improved.
The precision of traditional techniques can be improved by 69.4% with 0.3 to 0.4 IoU
thresholds, and further up by 2 folds on the more constrained 0.5 threshold against
both previous methods.
Table 3 Metrics comparison of different approaches for Drivable area segmentation

Paper mIoU (%) FPS Input resolution GPU
ICNet 70.6 30 1024 × 2048 Nvidia Titan X
BiSeNet 68.4 105 2048 × 1024 Nvidia Titan XP
Light-weight RefineNet 72.1 55 512 × 512 GTX 1080 Ti
Pre-trained ImageNet architectures 75.5 39.9 1024 × 2048 GTX 1080
2.3 Drivable Area Segmentation
Another important module of the perceiving and decision-making part in an

autonomous vehicle is the drivable area segmentation. The drivable area can be
defined as a portion of the area seen by the vehicle where it can drive safely. There
are various techniques used to identify the drivable area like simple classification,
object detection, semantic segmentation, instance segmentation, and others. Segmen-
tation is a process of dividing the input image into sets of pixels called image objects
such as cars, trucks, pedestrians, traffic signals, and many more. Segmenting the
input frames of the video seen by the vehicle into different classes helps the system
find and define a drivable area on the road and it further supports better object detec-
tion, ground plane estimation of the road, and lane boundary assessment. A brief
comparison among all the methods discussed further can be found in Table 3, at the
end of this section.
2.3.1 In Defense of Pre-trained ImageNet Architectures for Real-Time

Semantic Segmentation of Road-Driving Images
Oršić et al. [30] have shown the success and robustness of semantic segmentation
methods on real road driving datasets, even in challenging visibility conditions.
However, real-time inference remains a challenge due to the tremendous computa-
tional power requirements. With this paper, the authors have proposed a light-weight
and faster approach for achieving semantic segmentation in real-time.
Currently, deep fully-convolutional models provide the best results for semantic
segmentation, but require extraordinary computational resources. Most of the
approaches which currently try to deal with this issue, make use of some custom
made light-weight architectures. These approaches are not ideal for visual percep-
tion on a large scale. The authors state that these approaches don’t make use of a
huge regularization benefit offered by transfer-learning from a larger dataset, with
more diverse data. Thus, they are prone to overfitting.
Oršić et al. [30] proposed a segmentation model which is built upon a pre-trained
ImageNet encoder which benefits from the regularization induced by knowledge
transfer, and a decoder that helps restore the resolution of encoded features. On the
Cityscape dataset, this approach is able to achieve 75.5% mIoU while processing
1024 × 2048 images at 39.9 Hz on a GTX 1080Ti, and thus the authors argue that
this method provides an acceptable speed-accuracy trade-off.
2.3.2 Light-Weight RefineNet for Real-Time Semantic Segmentation
Nekrasov et al. [31]. modified an already existing semantic segmentation architec-

ture called RefineNet, into a more compact architecture, which is ideal for tasks
that require fast and accurate real-time performance. The authors have identified
the computationally intensive blocks from the original RefineNet architecture and
proposed two new modifications that are able to reduce the computation. These steps
achieve a two-fold model reduction, while the performance remains nearly intact.
In the proposed approach, the authors achieve model reduction by performing 2
steps: (1) Replacing 3 × 3 convolutions with 1 × 1 convolutions. Results showed
that this modification does not hurt performance, this way, they were able to reduce
parameters by 2 × and FLOPs by3x. (2) Omitting RCU Blocks. On testing it was
observed that omitting RCU blocks did not affect accuracy in any way. Thus, the
final architecture contains only CRP blocks and does not rely on RCU blocks. The
CRP blocks use 1 × 1 convolutions and 5 × 5 max pooling, making this architecture
extremely fast. Using this modified version of RefineNet, the authors were able to
achieve 55 FPS on 512 × 512 inputs.
2.3.3 In Defense of Pre-trained ImageNet Architectures for Real-Time

Semantic Segmentation of Road-Driving Images
Yu et al. [32] have tried to address the dilemma of spatial resolution and real-time
performance trade-off with a new approach, using a Bilateral Segmentation Network
(BiSeNet). The proposed architecture, as shown in Fig. 4, achieves an optimal balance
between segmentation performance and speed on datasets like Cityscapes, CamVid,
and COCO-Stuff.
To minimize the loss of spatial details, most current approaches use the U-Shaped
architecture. But it has two major drawbacks. (1) The U-Shaped structure can slow
down the model due to computational power required for high resolution maps. (2)
Most of the spatial information cannot be recovered after being lost in the crop-
ping processes. To combat this, the authors propose a new Bilateral Segmentation
Network (BiSeNet) with two sections: Context Path (CP)and Spatial Path (SP). These
components are designed to resolve the issues like the loss of spatial information and
shrinkage of the receptive field.
On testing, it was found that this method could obtain a large receptive field very
rapidly. With the rich spatial details and a large receptive field, this architecture
achieves 68.4% mIoU on Cityscape’s dataset at 105 FPS.
Fig. 4 BiSeNet architecture

proposed by the authors.
Image recreated from [32]
2.3.4 ICNet for Real-Time Semantic Segmentation on High-Resolution

Images
Zhao et al. [33] proposed an image cascade network (ICNet). This method utilizes
multi-resolution branches under proper label guidance to resolve the challenge of
reducing the amount of computational power required for pixel-wise inference. They
have provided a detailed analysis of their framework and introduced the cascade
feature fusion unit to achieve real-time segmentation with high accuracy.
After an in-depth analysis of time budget, the authors developed an Image Cascade
Network (ICNet), a high efficiency segmentation method. It makes use of the effi-
ciency of processing low-resolution images, and the high inference quality of high-
resolution images. In this approach, the low-resolution images through a full semantic
perception network to generate a coarse prediction map. In the second step, cascade
feature fusion units and cascade label guidance strategies are used to combine the
medium and high resolution features, which help to refine the coarse prediction map.
This architecture is able to achieve 5 × improvement in inference time and 5 ×
reduction in memory consumption. It can run at 30 FPS on 1024 × 2048 input from
various datasets such as Cityscapes, CamVid, and COCO-Stuff.
3 Discussion
Sensor fusion can be used to improve the quality of available data, decrease noise,
increase reliability, estimate unmeasured states and increase coverage of the sensors.
There are various issues while combining multi-sensor data like calibration, data
parallax, and synchronization issues. We saw the various techniques of how sensor
configurations should be used, how to achieve proper data synchronization by

automation and ways of achieving better sensor fusion using techniques like Kalman
filter, Fuzzy logic, and Deep Learning based models. The work in a few mentioned
studies has only been done on the machine vision, laser radar, IMU, and speed
sensors. It needs to be tweaked for fusion of data from more sensors required in an
autonomous vehicle.
We discussed how to detect lanes using Computer Vision and Deep Learning tech-
niques. Deep Learning approaches outperform Computer Vision due to their ability
to remember, retrieve, and generalize from given data by appropriate weight assign-
ment and modification. LaneNet’s two-stage architecture adds additional attractive
properties. Its procedure refines the parameters of the lane line localization network in
a weakly-supervised manner, alleviating the high demand for well-annotated training
samples. YOLino’s approach has some significant benefits over previous approaches.
Not only is the approach well-suited for real-time applications with almost no restric-
tions on the forms of the observed polylines at 187 FPS and can distinguish both
dotted and continuous lines. LaneATT is a single-stage anchor-based paradigm for
lane detection. The system achieves the second-highest registered F1 on TuSimple
despite being much quicker than the top-F1 method (171 vs. 30 FPS). RONELD is
a solution with a low computational cost, making it ideal for real-time use on.
The recent progress of semantic segmentation for drivable area segmentation
using deep learning techniques has presented effective results. Resnet is a powerful
segmentation architecture that belongs to the family of encoder-decoder segmenta-
tion networks. The BiSeNet outperforms with 105 FPS and with highest resolution
on Nvidia Titan XP, while pre-trained ImageNet Architectures show highest mIoU
of 75.5% with 39.9 FPS. For low end purposes Lightweight Refine Net can be used
as it provides a decent 55 FPS with 72.1% mIoU on the GTX1080Ti autonomous
vehicles.
4 Conclusion
Sensor fusion is the most important bridge between the real world raw data and
the decision making unit of any autonomous vehicle. Working on improving sensor
fusion techniques is one the promising ways to improve performance and safety
of autonomous vehicles. Lane detection poses various problems such as weather
condition, road conditions, and computation needed. It is necessary to choose a lane
detection technique with good performance for better understanding of the roads
for the vehicle systems. The segmentation of drivable areas are critical capabilities
to achieve autonomous navigation for autonomous vehicles too. One of the main
issues faced here is the computational cost and the trade-off between the quality and
real-time response needs to be handled carefully according to the use cases of the
systems.
The research works studied in this paper were identified from recent papers with
state-of-the-art techniques and a thorough investigation was carried out to select
important ones that will prove to help in further studies and discussions. A proper
in-depth analysis was also done regarding the performance, response time, and
computational cost to highlight the pros and cons of the mentioned methods.
All these modules of sensor fusion, lane detection, and drivable area segmentation
were discussed in detail and can serve as a strong base for building a proper, efficient,
and safe autonomous vehicle system.
References
1. World Health Organization, Road traffic injuries. Retrieved from https://www.who.int/news-

room/fact-sheets/detail/road-traffic-injuries. (Accessed: 5 Oct 2021) (2020 Feb 7)
2. B. Shahian Jahromi, T. Tulabandhula, S. Cetin, Real-time hybrid multi-sensor fusion framework
for perception in autonomous vehicles. Sensors 19(20), 4357 (2019)
3. J. Dey, W. Taylor, S. Pasricha, VESPA: A framework for optimizing heterogeneous sensor
placement and orientation for autonomous vehicles. IEEE Consum. Electron. Mag. 1–1 (2020)
4. L. Fridman, D.E. Brown, W. Angell, I. Abdić, B. Reimer, H.Y. Noh, Automated synchronization
of driving data using vibration and steering events. arXiv:1510.06113v2 [cs.RO] (2016)
5. M. Pereira, D. Silva, V. Santos, P. Dias, Self calibration of multiple LIDARs and cameras on
autonomous vehicles. Robot. Auton. Syst. 83, 326–337 (2016)
6. T. Kim, T.-H. Park, Extended Kalman Filter (EKF) design for vehicle position tracking using
reliability function of radar and lidar. Sensors 20(15), 4126 (2020)
7. Q. Li, J.P. Queralta, T.N. Gia, Z. Zou, T. Westerlund, Multi-sensor fusion for navigation and
mapping in autonomous vehicles: accurate localization in urban environments, World Scientific
Pub Co Pte Lt. Unmanned Syst. 08(03), 229–237 (2020). https://doi.org/10.1142/S23013850
20500168
8. V. Subramanian, T.F. Burks, W.E. Dixon, Sensor fusion using fuzzy logic enhanced Kalman
Filter for autonomous vehicle guidance in citrus groves. Trans. ASABE 52(5), 1411–1422
(2009)
9. S. Shinde, U. Kulkarni, Extended fuzzy hyperline-segment neural network with classification
rule extraction. Neurocomputing 260 (2017)
10. J. Fayyad, M.A. Jaradat, D. Gruyer, H. Najjaran, Deep learning sensor fusion for autonomous
vehicle perception and localization: A review. Sensors 20(15), 4220 (2020)
11. V. John, S. Mita, Deep feature-level sensor fusion using skip connections for real-time object
detection in autonomous driving. Electronics 10(4), 424 (2021)
12. Massachusetts Institute of Technology, in MIT 6.S094: Deep Learning for Self-Driving Cars.
Retrieved from https://deeplearning.mit.edu. (Accessed: 5 Oct 2021) (2020)
13. H. Cho, Y.-W. Seo, B.V.K.V. Kumar, R.R. Rajkumar,A multi-sensor fusion system for moving
object detection and tracking in urban driving environments. in 2014 IEEE International
Conference on Robotics and Automation (ICRA) (2014), pp. 1836–1843. https://doi.org/10.
1109/ICRA.2014.6907100
14. D. Ma, N. Shlezinger, T. Huang, Y. Liu, Y.C. Eldar, Joint radar-communications strategies for
autonomous vehicles. arXiv:1909.01729v2 [cs.IT] (2020)
15. M. Lange, J. Detlefsen, 94 GHz three-dimensional imaging radar sensor for autonomous
vehicles. IEEE Trans. Microwave Theory Tech. 39(5), 819–827
16. J. Rapp, J. Tachella, Y. Altmann, S. McLaughlin, V.K. Goyal, Advances in single-photon lidar
for autonomous vehicles: Working principles, challenges, and recent advances. IEEE Signal
Process. Mag. 37(4), 62–71 (2020)
17. J. Liu, Q. Sun, Z. Fan, Y. Jia, TOF lidar development in autonomous vehicle, in 2018 IEEE
3rd Optoelectronics Global Conference (OGC) (2018), pp. 185–190. https://doi.org/10.1109/
OGC.2018.8529992
18. Y. Zhu, C. Li, Y. Zhang, Online camera-LiDAR calibration with sensor semantic information.
IEEE Int. Conf. Robot. Autom. (ICRA) 2020, 4970–4976 (2020)
19. E.-S. Kim, S.-Y. Park, Extrinsic calibration between camera and LiDAR sensors by matching
multiple 3D planes. Sensors 20(1), 52 (2020)
20. M. Oliveira, A. Castro, T. Madeira, E. Pedrosa, P. Dias, V. Santos, A ROS framework for the
extrinsic calibration of intelligent vehicles: A multi-sensor, multi-modal approach. Rob. Auton.
Syst. 131 (2020)
21. V. Ilci, C. Toth, High definition 3D map creation using GNSS/IMU/LiDAR sensor integration
to support autonomous vehicle navigation. Sensors 20(3), 899 (2020)
22. Y. Liu, X. Fan, C. Lv, J. Wu, L. Li, D. Ding, An innovative information fusion method with
adaptive Kalman filter for integrated INS/GPS navigation of autonomous vehicles. Mech. Syst.
Signal Process. 100, 605–616 (2018)
23. F. Nobis, M. Geisslinger, M. Weber, J. Betz, M. Lienkamp, A deep learning-based radar and
camera sensor fusion architecture for object detection. Sens. Data Fusion Trends Solutions
Appl. (SDF) 1–7 (2019). https://doi.org/10.1109/SDF.2019.8916629
24. S. Chung, J. Lim, K.J. Noh, G. Kim, H. Jeong, Sensor data acquisition and multimodal sensor
fusion for human activity recognition using deep learning. Sensors 19(7), 1716 (2019). https://
doi.org/10.3390/s19071716
25. Z. Wang, W. Ren, Q. Qiu, LaneNet: Real-time lane detection networks for autonomous driving.
arXiv:1807.01726 [cs.CV] (2018)
26. D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, L. Van Gool, Towards End-to-End
lane detection: An instance segmentation approach. arXiv:1802.05591 [cs.CV] (2018)
27. A. Meyer, P. Skudlik, J.-H. Pauls, C. Stiller, YOLinO: Generic single shot polyline detection
in real time arXiv:2103.14420 [cs.CV] (2021)
28. L. Tabelini, R. Berriel, T.M. Paixão, C. Badue, A.F. De Souza, T. Oliveira-Santos, Keep your
eyes on the lane: Real-time attention-guided lane detection. arXiv:2010.12035 [cs.CV] (2020)
29. Z.M. Chng, J.M.H. Lew, J.A. Lee, RONELD: Robust neural network output enhancement for
active lane detection. arXiv:2010.09548 [cs.CV] (2020)
30. M. Oršić, I. Krešo, P. Bevandić, S. Šegvić, In defense of Pre-trained ImageNet architectures for
real-time semantic segmentation of road-driving images. arXiv:1903.08469 [cs.CV] (2019)
31. V. Nekrasov, C. Shen, I. Reid, Light-weight RefineNet for real-time semantic segmentation.
arXiv:1810.03272 [cs.CV] (2018)
32. C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, N. Sang, BiSeNet: Bilateral segmentation network for
real-time semantic segmentation. arXiv:1808.00897 [cs.CV] (2018)
33. H. Zhao, X. Qi, X. Shen, J. Shi, J. Jia, ICNet for real-time semantic segmentation on high-
resolution images. arXiv:1704.08545 [cs.CV] (2018)
Identification of Malicious Access in IoT
Network from Connection Traces
by Using Light Gradient Boosting
Machine
Etuari Oram, Bighnaraj Naik, and Manas Ranjan Senapati
Abstract In this paper, light gradient boosting machine (light GBM) ensemble
learning model based model has been used for identification of malicious access
in IoT network from connection traces. This proposed approach starts with exclusive
feature bundling (EFB) process to combine features that are mutually exclusive. Then
the Gradient-based One-Side Sampling (GOSS) for generation of re-sampled dataset
in which data selection is based on loss of the initial model and the absolute values
of gradients. The final model is designed by using re-sampled data in which efficient
splitting on data is achieved through information gain. The performance evalua-
tion of the proposed model is conducted with state-of-the-art ensemble learning and
machine learning based model in order to get overall generalized performance and
efficiency.
Keywords IoT security · Ensemble learning · Light gradient boosting machine
1 Introduction
The Internet of Things has made the system where people immensely connected
through the networks and IoT devices. It has transformed their lifestyle and also has
a significant impact on the way they work every computational activity, ultimately
transforming them from traditional to the modern time. The Internet of Things (IoT)
is an exciting and fantastic way to connect thousands or billions of devices for data
collection, analysis, sensing, and controlling other devices. It has encouraged new
opportunities and continuously providing endless solutions to many current diffi-
cult challenges. Due to exponential growth of IOT applications even within a short
period of time it’s not so far that, more than trillions of IoT gadgets may connect to
E. Oram · M. R. Senapati
B. Naik (B)
Department of Computer Application, Veer Surendra Sai University of Technology, Burla,
e-mail: bnaik_mca@vssut.ac.in
712 E. Oram et al.
the internet by the end of 2022. But at the same time, as demand for IoT devices and
more applications are emerging; designs architecture of working models becomes
increasingly complicated. The expansion of IoT environments is introducing a lot
of redundant risk factors to the systems knowingly or unknowingly provoking a red
signaling situation in the near future. As it operates on the data-based applications
and data-driven models for peoples, it become complicated than manual ways. As a
result, having a suitable solution for processing massive data in any complex scenario
is self-evident. The main concerns of any IoT-connected device and its application
areas are security privacy and the steps taken to address security concerns. Distributed
attacks and cross-platform scripting are other two major IoT security concerns. As
the majority of IoT devices are very adaptable for common household applications,
the framework must ensure that they are used safely and securely. However, dealing
with potential attacks like denial of service, man at the middle attack, congestion,
data interruption, malware threats, hacking, tampering etc. is a difficult issue for
IoT infrastructure. IoT security issues are of two categories: technological obstacles
and security management concerns [1]. The first one is associated with the technical
hurdles of the electronic devices and their nature of working in a virtual environment,
while the second entails the technicalities of the software framework’s failure. Tech-
nical issues can be resolved through adaptive physical intervention by professionals,
but the second category demands authorization and systems enabled with trust-based
authorization and point-to-point connectivity. Though the innovations and upgrades
of all linked devices are taking place, IoT-based security confirms command execu-
tion and develops the framework. There are several approaches to deal with IoT
device security issues, including all connected IoT devices to confirm each other’s
operations. And then most importantly, it is requiring each IoT device to first make
clear its own functionality before making connection with other devices for commu-
nication. Within distributed and high-performance computing networks with limited
computation energy like issues, it needs to retain authenticity of confidentiality,
accessibility, availability, and consistent information in the IoT system.
In this study, a novel Light boost GBM (Gradient Boosting machine) ensemble
learning approach [2] is used for the identifications of intrusive behaviors in IoT
framework. The ensemble method helps to enhance the execution of a single method
by adding variety of independent models. The total work constitutes two parts where
the first part focuses on feature construction by considering prediction of base classi-
fiers to detect the anomalies and the second part focuses on the construction of meta-
learners. The proposed technique gives better output in comparison with the other
ML approaches with a minor difference of performance parameters. The followings
are the major contributions of this study:
i. An optimized advanced ensemble learning model LGBM has been used for
identification of malicious IoT activities in the IoT network.
ii. The performance evaluation of the proposed model is conducted with state-of-
the-art ensemble learning and machine learning based model in order to get
overall generalized performance and efficiency.
Identification of Malicious Access in IoT Network from Connection Traces … 713
The rest segment of the article has been organized into following sections: Sect. 2
includes problem formulation followed by the representation of proposed model;
Experimental result and analysis are discussed in Sect. 3; and Sect. 4 is Conclusion
and Future scope of this research.
2 Proposed Work
This proposed study makes use of Light GBM for prediction of anomalies type.
Let I = {I1 , I2 ...In } is the past IoT network activity
traces with anomalies instances
collected over the time, and Ii = Ii,1 , Ii,2 ...Ii,m , ai is the ith instance of past activity
traces with anomalies instance ai . The IoT device’s connection traces and communi-
cation profile are considered as a vector of m number of features and anomaly type
ai can be any of the anomaly types ai ∈ a. The attack type ‘a’ can be visualized as
a = {a1 , a2 , a3 , a4 , a5 , a6 , a7 , a8 } where a1 = dP (dataProbing), a2 = DoS (DoSat-
tack), a3 = mC (malitiousControl), a4 = mO(malitiousOperation), a5 = spying,
a6 = scan, a7 = wSU(wrongSetUp), and a8 = Normal. We have considered IoT
dataset from kaggle [3] for the experimentation. This is a synthetic data collected by
simulation through a virtual setup called DS2OS (Distributed Smart Space Orches-
tration System). This dataset is the collection of the communications traces between
numbers of IoT nodes connected in the network. This dataset has 357,952 samples
with 13 no. of features and eight no. of classes. It includes a variety of anomalies
such as DoS, mC, mO, scan, spying, and wSU, having the percentage of distributions
3.4%, 57.7%, 8.8%, 8%, 15.4%, 5.3%, and 1.2% respectively. This is an imbalance
data and hence random naive oversampling [4, 5] is used in this work prior to training
the model in order to get rid of class imbalance problem [6]. This makes uniform
sample distribution in the dataset, roughly achieving sample distribution of 14.28%
for each class.

ai = lightGBM Ii (1)
In Eq. 1, a Ii represents predicted anomaly type, Ii is the ith instance of past

network trace without any label (i.e., activity type information) and lightGBM Ii
is the prediction on Ii . The min–max scalar and label encoding has been used for
data pre-processing. In this experiment, the poor prediction performance is observed
in machine learning models such as DT (Decision Tree), LDA (Linear discrimi-
nant analysis), MLP (Multi Layer Perceptron), NB (Naïve Bayes), and LR (Logistic
regression). So, one efficient boosting ensemble learning technique, i.e., light GBM
has been used, which is light weight in terms of computation and still efficient as
compared to other boosting approach. The major steps of the proposed approach are:
(i) Using exclusive feature bundling (EFB) to combine features that are mutually
exclusive, (ii) Designing an initial model with minimum loss θ 0 (iii) Computation of
absolute values of gradients, (iv) Using Gradient-based One-Side Sampling (GOSS)
714 E. Oram et al.
Fig. 1 Proposed model
for generation of re-sampled dataset S1 and S2, and Merge S1 and S2 and create a
dataset D* = S1 + S2, (v) Finding optimal split by calculating information gain, (vi)
Update the model θ m = θ m-−1 + θ m * and (vii) Obtaining the final model θ M (Fig. 1).
3 Simulation Results
The Table 1 presents the performance of DT, LDA, MLP, NB, LR, RF (Random
Forest) [7], Bagging [8], AdaBoost [9], GBT [10], XGBoost [11], and Proposed
Table 1 Performance evaluation metrics of all the prediction models

Models Performance evaluation metrics
Precision Recall F1 score F2 score Fbeta score ROC-AUC
DT 0.901285 0.901285 0.901285 0.900172 0.903377 0.943678
LDA 0.830596 0.830596 0.830596 0.819735 0.810906 0.902646
MLP 0.773830 0.773830 0.773830 0.770598 0.798307 0.869642
NB 0.696001 0.696001 0.696001 0.686180 0.686291 0.827997
LR 0.562146 0.562146 0.562146 0.562056 0.562056 0.602056
RF 0.973578 0.973578 0.973578 0.973327 0.975152 0.985035
Bagging 0.900651 0.900651 0.900651 0.899178 0.901380 0.942476
AdaBoost 0.999375 0.999375 0.999375 0.999374 0.999375 0.999652
GBT 0.982147 0.982147 0.982147 0.981934 0.982086 0.990031
XGBoost 0.998571 0.998571 0.998571 0.998569 0.998578 0.999207
Proposed LightGBM 0.999642 0.999642 0.999642 0.999642 0.999643 0.999801
LightGBM. The performance of these studied models is evaluated by considering

various performance metrics such as ‘Precision’, ‘F1 Score’, ‘Recall’, ‘F2 Score’,
‘ROC-AUC’, and ‘Fbeta Score’. The predictions of anomalies type by the studied
models are presented in Fig. 2. It is observed that the proposed model performs better
in most of the performance evaluation metrics (Table 1) and anomalies prediction
(Fig. 2).
4 Concluding Remark and Future Directions
The communications between several IoT devices in the network should be secure
to keep up the necessary protection. In line with this, a system that monitors and
identifies suspicious activities in the network is desirable due to increasing use of IoT
devices. In this work, we have used light GBM for identifying anomalies in the IoT
environment. This proposed model makes use of past IoT devices’ connection traces
with various anomalies types for designing a model which can identify anomalies in
future connection traces. Due to increasing use of IoT devices in many applications,
the security, privacy, and reliability became a challenging goal to achieve. Although
no system can provide complete security solution, this may be an add-on to existing
security solutions.
716 E. Oram et al.
Fig. 2 Model’s prediction of anomalies a DT, b LDA, c MLP, d NB, e LR, f RF, g Bagging, h
AdaBoost, i Gboost, j XGBoost, k Proposed LightGBM
Fig. 2 (continued)
718 E. Oram et al.
Fig. 2 (continued)
Fig. 2 (continued)
720 E. Oram et al.
References
1. P.N. Mahalle, B. Anggorojati, N.R. Prasad, R. Prasad, Identity authentication and capability
based access control (IACAC) for the Internet of Things. J. Cyber Secure. Mobil. 1, 309–348
(2013)
2. G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.Y. Liu, Lightgbm: A highly
efficient gradient boosting decision tree. Adv. Neural Inf. Proc. Syst. 30, 3146–3154 (2017)
3. M.-O. Pahl, F.-X. Aubet, DS2OS traffic traces, 2018, (https://www.kaggle.com/francoisxa/ds2
ostraffictraces). [Online; Accessed 29 Dec 2018]
4. N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic minority over-
sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
5. A. Liu, J. Ghosh, C.E. Martin. Generative oversampling for mining imbalanced datasets, in
DMIN (2007), pp. 66–72
6. G. Lemaître, F. Nogueira, C.K. Aridas, Imbalanced-learn: A python toolbox to tackle the curse
of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(1), 559–563 (2017)
7. S. Messaoud, A. Bradai, S.H.R. Bukhari, P.T.A. Qung, O.B. Ahmed, M. Atri, A survey on
machine learning in internet of things: Algorithms, strategies, and applications. Internet of
Things, 100314 (2020)
8. L. Breiman, Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
9. T. Hastie, S. Rosset, J. Zhu, H. Zou, Multi-class adaboost. Stat. Interface 2(3), 349–360 (2009)
10. J.H. Friedman, Greedy function approximation: A gradient boosting machine. Ann. Stat. 1189–
1232 (2001)
11. T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, H. Cho, Xgboost: extreme gradient
boosting. R Package Version 0.4–2 1(4) (2015)
Big Data in Education: Present
and Future
Janmenjoy Nayak, H. Swapnarekha, Ashanta Ranjan Routray,

Soumya Ranjan Nayak, and H. S. Behera
Abstract Any mechanism aims to formulate the existence of a human being simple
and comfortable. Big data is using to take out the important data from a huge quan-
tity of structured and unstructured data. From the last few years, usage of learning
management system has been rapidly rising. Students have initiated utilizing cell
phones, mainly smartphones that have become a part of their daily basis to get the
online content. Large amount of idle information produced from the student’s online
activities is not able to process further. This scenario has outperformed in the diffu-
sion of big data techniques into education field. Big data changed the learning style by
building it into simple, unproblematic and exciting. This results in the enhancement
of big data techniques to process the large quantity of data. This study looks into the
challenges faced by many students and latest applications of big data technologies
in educational sectors.
Keywords Big data · Education · Dropout · Recommender system · Learner’s

behavior
J. Nayak (B)
H. Swapnarekha
Department of Information Technology, Aditya Institute of Technology and Management
(AITAM), Tekkali, K Kotturu, AP 532201, India
A. R. Routray
Department of ICT, Fakir Mohan University, Balasore, Odisha 756019, India
S. R. Nayak
Amity School of Engineering and Technology, Amity University Uttar Pradesh, Noida, India
H. S. Behera
722 J. Nayak et al.
1 Introduction
Nowadays, the Internet is playing a vital role, and the number of the online commu-
nity is growing vastly. Billions of Internets users are producing a huge amount of data
and transforming that information to remaining users, respectively [1]. Here, data
will store the characters, symbols or quantities that are performed by a computer. The
term ‘big data’ refers to any kind of data [2] that is very huge, and traditional applica-
tions are not sufficient to process them. So, no other conventional data management
tools are capable to restore it efficiently. Examples of big data contain the quantity of
data collected on the Internet every day, twitter feeds, mobile phone location infor-
mation, stock exchanges, social media sites, jet engines and YouTube videos viewed,
etc. Big data is of three types: structured, unstructured and semi-structured. Struc-
tured data possesses predefined format and is easy to access. Unstructured data is of
natural type and is not formatted till their usage. The combination of both structured
and unstructured data is of the type semi-structured. Volume, variety, velocity and
variability are the main characteristics of big data [3]. Volume is related to the size
which plays a vital role in verifying the value out of data. Variety is associated with
various resources and behavior of data. Velocity is the speed of data, and variability is
the inconsistency that hampers the procedure of being capable to hold and control the
data efficiently. Better decision-making, enhanced customer service and improved
operational effectiveness are few advantages of big data.
We are in the era of data, where we can acquire a big quantity of data [4]. Gener-
ally, data is produced from all the sectors such as the aviation sector, social media and
sports as well as in the sector of education. The major focus of this study is to inves-
tigate the role of big data in the educational field and focus on various domains of
education for better solution. The role of big data in concerned domain offers various
advantages to students as well as educational institutions. The education sector has
been important for both individuals and society. A wealthy economy requires expe-
rienced workers with the ability to initiate and make businesses in one hand [2].
Similarly, on the other hand, persons with career objectives will always be looking
to reside at the skill acquisition and cutting edge of education. In the current learning
environments, users are learning through various online communities such as online
chats, discussion forums, instant messaging. To learn the required contents, students
started browsing the Internet for accessing their courses. Due to these activities of
the student, a huge amount of data has been produced by the learning manage-
ment systems. On the other hand, many educational organizations have also created
large data that utilizes applications to deal with the classes, students and courses.
As a result, the amount of available data is massive. Conventional data processing
methods cannot be handled to process those. Due to this, the educational institutions
have initiated discovering big data technologies to progress the educational data.
Big Data in Education: Present and Future 723
2 Applications of Big Data in Education
The quality of education can be enhanced by applying big data within various settings
of education system such as administration, student learning and teaching delivery
process. The application of big data in education allows the educational institutions
to understand the challenges encountered by the students and to find out the strategies
that address these issues. The following Fig. 1 depicts some of the applications of
big data in education.
2.1 Prediction of Student Performance
Providing quality education and enhancing the behavior of the students are considered
as fundamental objective of the educational system. Therefore, predicting students’
performance assists the teacher in identifying and enhancing the performance of
the weak students which in turn enhances the overall quality of education. Various
prediction models have been proposed by several researchers for predicting the
student performance. Punlumjeak et al. [5] have proposed feature selection and
machine learning model for predicting the student performance on cloud environ-
ment. The proposed model is used for finding the problem areas. In addition, it
Prediction of
student
dropout
Prediction of
Construction
student
of curriculum
performance
Applica ons
of Big Data
in Educa on
Analyzing Course
learners recommender
behaviour system
Fig. 1 Applications of big data in education

724 J. Nayak et al.
Table 1 Comparative analysis of other works performed on prediction of student performance

Author and year Technique Parameter Analysis References
Almasri et al. Cluster-based Students’ grades of Obtained 96.96% [7]
(2020) ensemble 22 MIS courses accuracy
meta-based tree
model
Varela et al. Fuzzy C-means Mandatory Categorizing [8]
(2019) ninth-year subject students into various
grades based on
performance
Hamoud et al. J48 algorithm Questionaries’ on J48 algorithm [9]
(2018) health, social obtains better
activity, performance when
relationship, compared with
student random tree and rep
performance tree algorithm
Hasan et al. Random forest Student’s academic Random forest tree [10]
(2018) tree information and obtains better
students activity performance in
improving the
student’s grades
Singh et al. K-means X, XII, B. Tech Categorizes the [11]
(2016) clustering marks, projects and students into various
internships groups based on the
performance
is also used to understand the factors that impact the performance of the students.
Further, the proposed model has been evaluated on Rajamangala University of Tech-
nology students’ data, and the empirical results indicate that feature selection with
neural network classifier obtained better accuracy of 90.60% when compared with
the feature selection with decision tree. To enhance the quality of educational insti-
tutions, K-nearest neighbor (KNN) classification approach has been suggested by
Nagesh et al. [6]. The suggested approach predicts the performance of the students
in end semester examination using Hadoop MapReduce environment. The following
Table 1 illustrates other research work carried out in the prediction of students’
performance.
2.2 Prediction of Student Dropout
According to the statistics of American Community Survey (ACS), an overall dropout

of 5.3% and dropout of 2.1 million students between the age group 16 and 24 were
observed in the year 2018 [12]. The society also suffers because of the student’s
dropout as the country’s productive capacity is impaired due to the lack of the skilled
workforce [13]. Therefore, higher rate of student’s dropout is considered as a serious
threat to the educational institutions by the academicians and policymakers. Hence,

developing early dropout warning system may assist the educational institutions in
identifying the students who are at the risk of dropping from the institutions and
in implementing strategies to retain the students. To enhance the efficiency of the
dropout prediction, an early dropout warning system based on machine learning
approaches has been suggested by Lee and Chung [14]. To address the class imbal-
ance problem, the suggested model makes use of synthetic minority oversampling
techniques (SMOTE) and the ensemble approaches in machine learning. Further,
the performance of the model on the data of 165,715 high school students from the
National Education Information System (NEIS) in South Korea has been evaluated
using receiver operating characteristic (ROC) and precision–recall (PR) curves. From
the results, it is noticed that boosted decisions tree attains more accurate predictions in
comparison with other machine learning approaches. Other works on the predictions
of student’s dropout have been illustrated in Table 2.
2.3 Course Recommender System
Generally, students are provided with a wide option of courses they can select for
formal or informal classroom learning at the secondary and higher education. Being a
beginner, it is always difficult for the student to choose the correct course. Normally,
students will make a selection based on the recommendation of seniors or teacher’s
expertise in subject or even based on the difficulty or attractiveness of the course.
Certainly, such a selection process cannot provide an overall evaluation of the candi-
date related to the course. Therefore, an intelligent course recommender system is
needed to assign relevant course to the student. A recommender system based on
collaborative filtering has been proposed by Dwivedi and Roshni [20]. Based on the
grade points attained by the candidate in other subjects, the proposed system suggests
elective courses to be selected by the candidate. In addition, item-based recommen-
dation of Mahout machine learning library has been utilized by the authors on the
top of Hadoop framework to produce set of recommendations. The patterns between
grades and subject have been identified using similarity log-likelihood. Further, the
performance of the recommendation has been evaluated using root mean square
error between actual and recommended grades. The following Table 3 represents the
analysis of other works carried out on the course recommender system.
2.4 Analyzing Learners Behavior
Due to the advancement of online learning, a large volume of student activity data
is available. To identify students at risk, it is necessary to analyze large volumes of
student’s activity data to identify the patterns of student’s learning behavior. In addi-
tion, analyzing behavior of student is very essential for generating student-centered
726 J. Nayak et al.
Table 2 Distinct works performed on prediction of student performance

Author and year Technique Attribute selected Analysis References
Tenpipat and Gradient boosting, Student’s Gradient boosting [15]
Khajonpong decision tree and academic year, obtained better
(2020) random forest high school GPA, performance of
channels of 93% when
university compared with
admission, other models
student’s faculty,
and gender
Wu et al. (2019) CLMS-Net Student behavior Automatically [16]
extracts features
and obtains
accurate predictions
in student’s dropout
when compared
with conventional
machine learning
approaches
Hegde et al. Naive Bayes Academics, For early [17]
(2018) classification psychological identification of
algorithm factors, teacher’s dropout student
opinion, health
issues, student
behavior,
demographical
factors,
Haiyang et al. Time series forest Student behavior Obtains prediction [18]
(2018) (TSF) classification data accuracy of 84%
algorithm with only 5% of the
data
Márquez-Vera Modified Sixty different Within the first [19]
et al. (2016) interpretable attributes like age, 4–6 weeks of the
classification rule grade point course, predicts
mining algorithm average, dropout of student
attendance
learning system. Understanding student’s behavior further assists both the teacher
and student to attain the educational goals. A systematic review on the prediction of
the students at risk is performed by Na and Tasir [26]. The authors have analyzed
the student’s learning behavior on online learning in conjunction with the analyt-
ical methods and types of data. From the findings, it was observed that almost all the
analytical methods and various types of data have successfully predicted the students
at risk. Other works on the analysis of learner’s behavior are shown in Table 4.
Table 3 Analysis of other works carried out on course recommender system

Author Technique Dataset Analysis References
and year
Mondal K-means Three hundred student records Provides [21]
et al. clustering with 22 attributes personalized
(2020) algorithm and environment for
collaborative each learner
filtering based on grades
MA et al. Extended E-learning system at Kyushu Enhances the [22]
(2019) association University performance of
rule mining the course
recommendation
system compared
with the tradition
association rule
Alghamdi Fuzzy expert Survey data from 239 Assists students’ [23]
et al. system prospective participants and 392 in selecting their
(2019) university participants university major
Zhang Distributed Three benchmark datasets such Distributed [24]
et al. association as T10I4D100K, T25I10D10K association rule
(2018) rules mining and mining algorithm
algorithm HMXPC13_DI_v2_5–14–14.csv provides more
recommended
efficiency than
conventional
Apriori algorithm
Xiao et al. Combinational Shanghai lifelong learning The [25]
(2018) algorithm network recommended
system enhances
utilization rate of
educational
resources and
also improves the
learning
autonomy of
students
2.5 Construction of Curriculum
The main purpose of integrating big data in education is to assist the policymakers
and academicians to automatically design the course and learning content. Further, it
also encourages the exchange of current learning resources among distinct systems.
Table 5 represents the works carried out in the construction of curriculum using big
data technology.
728 J. Nayak et al.
Table 4 Other works carried out on the analysis of learner’s behavior

Author and year Technique Dataset Analysis References
Purwoningsih Exploratory data 118 student’s data Assists teachers in [27]
et al. (2019) analysis and registered in enhancing
K-means algorithm e-learning course pedagogy before
of Basic Physics 1 the application of
learning
Al Fanah and Random forests, Students’ academic Bayesian network [28]
Muhammad et al. logistic regressions performance attains 80%
(2019) and Bayesian dataset accuracy in
networks (xAPI-Edu-Data) understanding
from Kaggle e-learners behavior
Pérez-Sanagustín Random forest, Behavior patterns Assists in [29]
et al. (2019) neural network, of 572 learners in accurately
SVM polynomial four MOOCs predicting students
and linear model grade using data
from video lectures
and exams of the
course
Yan and Oliver Three-layer 192 student records Assists teacher in [30]
(2019) feedforward neural of ELEC S305F identifying weak
networks and the Autumn 2018 learners and to
scaled conjugate course of OUHK provide the timely
gradient algorithm help
Douglas et al. K-means + + 337 learners’ Assists in early [31]
(2016) clustering behavior and prediction of
algorithm assessment data dropouts in
from nanophotonic MOOCs
modeling MOOC
3 Critical Analysis
In this article, a systematic analysis on the applications of big data in education has
been carried out. From the analysis, it is noticed that various techniques such as
clustering, classification and regression have been successfully utilized for the inter-
pretation of big data in educational system. This section presents a precise analysis
on the applications of big data in educational system along with the percentage of
articles published in distinct applications of big data in the education domain using
various open-source tools and techniques.
3.1 Research Growth of Big Data in Education
Due to the latest advancements in the education system, large amount of educational
data is available in the current learning system. Therefore, to explore the complex
Table 5 Other works performed on the construction of curriculum using big data technology
Author Framework/Technology Application Result References
and year
Hu et al. Virtual reality technology Construction of Provides learners [32]
(2020) curriculum for more diversified and
science, pragmatic sensuous
technology, learning materials
engineering, art
and Mathematics
(STEAM) in
primary and
middle schools
Li et al. MOOCs + SPOC Construction of Enhances teaching [33]
(2020) innovative effectiveness and
curriculum for develops the habit of
computer solving real-world
fundamentals problems by
students
Liu – Design and Allows the health [34]
(2020) evaluation of professional to
humanistic implement
curriculum system humanistic values in
for nursing their services
students
Zhang Hadoop Construction of Enhances the [35]
et al. mathematics teaching reforms of
(2018) education mathematics and
curriculum efficacy of higher
education
Jensen CoNVO framework Incorporating big Undergraduates can [36]
(2017) data tools into be provided with a
undergraduate MIS single course to
curriculum essential prospects
of data science
phenomenon in education and to improve the quality of education, application of

big data in education plays a vital role. The following Fig. 2 shows the growth of
publications on big data in education from 2011 to July 2021. From the figure, it
can be noticed that there is an increase in the percentage of publications from 2011
to 2019. As COVID-19 pandemic created an impact on the publication industry, a
slight decrease is observed in the year 2020.
730 J. Nayak et al.
% of publications in Big data in education

18
16
14
12
10
8
6
4
2
0
2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
(Up to
July)
Fig. 2 Growth of publications in big data in education
3.2 Analysis of Distinct Applications of Big Data

in Education
The quality of education can be improved by understanding the issues experienced

by the students and by implementing appropriate strategies to address these chal-
lenges. Figure 3 depicts the percentage of articles published in various settings of
% of publications in distinct applications of big data in

education
18.81698437 25.48474188
2.192587744
19.47148818
34.03419782
Preci on of student performance Predic on of Student dropout

Course Reccommender system Behavior detec on
Construc ng course content
Fig. 3 Distribution of articles in different applications of big data in education

education system using big data such as prediction of student performance, recom-
mender system for course selection. It is noticed from the figure that 34.03% of
the articles have been published on the intelligent course recommender system at
different levels of education. Next, majority of articles have been published on the
prediction of student performance to improve the quality of education (25.48%).
Among the distinct applications, less work has been published in the prediction of
student dropout using big data, i.e., 2.19%.
3.3 Big Data in Education Using Various Open-Source Tools
To provide more customized services and enhanced efficiencies, analysis of big data
needs to be performed through the usage of various open-source tools. The following
Fig. 4 depicts some of the open-source tools used in mining educational data.
The percentage of articles published in mining educational data using various
open-source tools has been represented in Fig. 5. From the figure, it is identified that
58.49% work has been published using Orange tool which is a Python-based tool
used for processing big data. Next, 18.47% work has been carried out using Weka
tool which is a Java-based tool for processing big data. After Weka tool, 9.7% of the
work has been carried using Hadoop framework which allows distributed processing
of datasets over clusters of computer networks. Next, 8% work has been done using
MapReduce which allows parallel processing of clusters of computer network. It
is also noticed from the figure that least work (i.e., 5.2%) has been carried using
MangoDB which uses JSON-like documents rather than table-based architecture.
Fig. 4 Educational data

mining using various
open-source tools Hadoop
Mango
Weka
DB Open
source
tools
MapRe
Orange
duce
732 J. Nayak et al.
% of ar cles published on Big data in educa on using various

open source tools
8.004691394
0 9.778624835
5.248497288
58.49582173
Hadoop MangoDB Orange MapReduce
Fig. 5 Distribution of articles in educational data mining using various open-source tools
3.4 Role of Intelligent Educational Data Mining Techniques
To address the challenges encountered in the processing of big data in educations,

various techniques have been utilized as shown in Fig. 6.
The percentage of articles published on big data in education using various tech-
niques has been depicted in Fig. 7. It is observed from the figure that majority of
the articles in big data in educations have been published using the classification
• predicts the value of • Based on the

a dependant variable previously categorized
by analyzing the values, predicts the
relationship between category to which the
variables by utilizing value belongs to
statistical analysis Classificati
Regression
on
Nearest
Clustering
neighbor
• Group the records •Values are predicted by
of similar type by using the predicted
finding the distance values of the records
between them in an that are nearest to the
n-dimensional space record
Fig. 6 Educational data mining using various techniques

% of articles published on Big data in education using various

techniques
80
60
40
20
0
% of ar cles published on Big datain educa on using various Techniques
Regression Clustering Classifica on Nearest neighbor
Fig. 7 Distribution of articles in educational data mining using various techniques
technique (i.e., 58.79%). Next, regression technique has been used in addressing the
processing challenges of big data in education. It is also observed that least work has
been published using nearest neighbor.
4 Advantages of Big Data in Education
Big data in education gives extraordinary prospects for instructors to contact and
train students in innovative ways. Various advantages of big data in educational field
include: (i) improvement of student results, (ii) customizing programs, (iii) reducing
dropouts, (iv) targeting international recruitments, (v) analytics of educators, (vi)
proposing of new learning plans or enhance the learning experience, (vii) career
prediction and (viii) credible grading, respectively, as shown in Fig. 8. Such type of
data will permit organizations to regulate their recruitment policies and assign funds
accordingly. Big data makes an advance for a progressive structure where students
will be trained in interesting techniques. Impacts of these applications of big data in
education can be explained briefly in the following sections.
4.1 Improvements of Student Results
With the applications of big data in the education field, the whole educational
system gathers the profits of this technology along with students as well as parents.
Student’s educational performance can be determined from their acquired results.
Here, every individual student creates a unique data track and examines this data
track in the current time which aids to get a better result in producing the best
learning surrounding and understands the individual performance.
734 J. Nayak et al.
Fig. 8 Advantages of big data in education
4.2 Customizing Programs
Customized programs for individual students can be generated by using big data.
Blended learning which is an integration of both the online as well as offline learning
will give a possibility to make customized programs. These can be generated for every
student for providing them the opportunity to pursue classes that they are concerned
in and also can work at their own pace.
4.3 Reducing Dropouts
By taking the benefit of big data, a survey [37] has been done in which a huge online
database was utilized to forecast the success and failure of a student. The study
discovered that measures of pupil’s performance that refused over time were major
forecaster of the probability of dropping out. Females students with earlier transfer
credits or college education and older students are having more risk of dropouts.
Nowadays, due to the improvement of student results by using big data, the rate of
dropouts at institutes is decreasing day by day.
4.4 Targeting International Recruitments
Here, educational sectors can more precisely forecast applicants and also examine
the feasible factors that influence the process of application. Such information will
let institutions to alter their recruit policies and assign funds accordingly. This influx
of information assists students to know the data about the remaining schools whole
over the world and also speeds up the hunting process for international pupils.
4.5 Analytics of Educators
Due to the processing of data-driven systems, educators gather the utmost profits of
big data analytics. So, several institutions generate various learning experiences with
the accordance of student’s preferences as well as their learning abilities. Various
programs were promoted for supporting individuals to select course according to
their wish.
4.6 Enhance the Learning Experience
The educator manages every individual student gradually and commences a signif-
icantly more interesting and deep discussion on the subject of preference. This will
provide the possibility for each student to choose up a better comprehension of the
subjects, respectively. Also, it can improve the course plans and digital reading mate-
rial that is used by the students. This progressive incorporation not only develops
education principles but also builds a space for a better society.
4.7 Career Prediction
The student’s performance feedback will help to know and understand their improve-
ments, strengths and weaknesses. These feedback reports suggest the area in which
students are paying attention and also will aid to know in which area that particular
student is searching for a profession. If any student is paying more attention to a
particular subject, then the decision of that student must be valued and the student
has to be supported to pursue what they want to follow.
4.8 Credible Grading
Here, big data offers educators a quicker and more suitable way to score and rate
student tests, essays and quizzes. In big data, all the educational sectors have a
consistent and credible system for marking several numbers of papers as well as
releasing the results sooner by making the procedure simple for everyone. Big data
has got major changes to several aspects of education. According to research, the
736 J. Nayak et al.
capability to examine educational systems is the most important change carried by

the big data [4].
5 Challenges
Certainly, big data and some other associated technologies like deep learning and
cloud computing are the input to a successful education system. Regardless of the
perspective of learning analytics, there has been doubted and uncertainty because
of challenges that ought to address for attaining preferred learning outcomes [38].
However, among all the chances and advantages, big data provides a group of chal-
lenges in the educational sectors. Certainly, big data resolved several problems, those
were struggled by the educators. Low Internet use competencies, faulty systems, etc.
are the negative impact of big data on education. Several challenges include limited
talent pool, scalability, and storage issue, data errors and data safety concerns.
5.1 Limited Talent Pool
The demand for trained data specialists is high because of having a continued warm-
up of more areas with big data. Due to the lack of data science courses in most of
the colleges, only a few people exist with the required skills to make sure flawless
implementation of big data in the educational field. As a result, major educational
institutes were unable to utilize the technology and proper resources.
5.2 Scalability and Storage Issue
In a few cases, the speed at which the information is gathered and scrutinized leaves
ahead of the processing abilities of obtainable big data machines. Slowdowns and
collides are the events that change the feature of analysis and the outcomes. Due to
this, developers must come forward with extra storage systems as well as scalable
processing that sustains both the current and upcoming needs.
5.3 Data Errors
As big data deals with the huge amount of data, there will be a chance of losing
some data while placing various datasets of the total student population in numerous
categories. This problem is majorly quite a common thing in cloud storage system
that is expensive to approve, yet it requires entirely new data.
5.4 Data Safety Concerns
Data safety is one of the biggest concerns regarding big data in the educational sector.
Data safety is very costly to efficiently run the active or constantly modernizing kind
of data. Proper guidelines require guaranteeing the rights of data, and some privacy
issues should take care so that the information can be protected from misuse.
6 Future Goals
As compared to the early data warehousing technique, big data has emerged as a
reliable and challenging model [6]. Especially, it has been an efficient tool for the
e-learning industry. However, in spite of these sufficient provisions, researchers,
educators and learners are basically driving no perceptibility. Current educational
analytics had the belief of supporting as a sense building factor in piloting an unsure
change by contributing an understandable assessment data and investigation, exhib-
ited through user-controlled visualizations. In many of the applications of education,
big data will be quite helpful for modern designing and planning of educational insti-
tutions as well as boosting for computer-assisted learning systems. Big data can be a
standalone system as a replacement for statistical models as a safeguard against over-
fitting data. There is a huge opportunity for research on the modern teaching–learning
process, and big data is one of the important models for that.
7 Conclusion
As the data concerned in education becomes superior, the purpose of big data tech-
niques turns out to be more essential in learning environments. So, the attention has
been paid toward big data in educational sectors. Big data can decrease dropouts, can
show modified learning surroundings to the users and can enlarge long-term study
plans. All this process is feasible through the successful utilization and improvement
of big data analytics in the educational sectors. With ignoring few drawbacks, this
technique is suitable and effective for applying in some applications. As a result, the
power of big data and its usage in education are still the hot topic of research. Though
it is in the phase of infancy, still it has the possibility to be a game-changer in the
upcoming future. However, with further development, big data can be efficiently put
to employ and get even more benefits for both the students as well as educators.
738 J. Nayak et al.
References
1. K. Sin, L. Muthu, Application of big data in education data mining and learning analytics—A
literature review. ICTACT J. Soft Comput. 5(4) (2015)
2. Wikipedia, “Big data—Wikipedia, The Free Encyclopedia”, https://en.wikipedia.org/w/index.
php?title=Big_data&oldid=669888993. Accessed (2015)
3. https://www.colocationamerica.com/blog/big-data-and-education. Accessed on 15 Apr (2021)
4. S. Ray, Big data in education. Gravity, Great Lakes Mag. 8–10 (2013)
5. W. Punlumjeak, N. Rachburee, J. Arunrerk, Big data analytics: Student performance prediction
using feature selection and machine learning on microsoft azure platform. J. Telecommun.
Electron. Comput. Eng. (JTEC) 9(1–4), 113–117 (2017)
6. A.S. Nagesh, C.V.S. Satyamurty, K. Akhila, Predicting student performance using KNN
classification in bigdata environment. CVR J. Sci. Technol. 13, 83–87 (2017)
7. A. Almasri, R.S. Alkhawaldeh, E. Çelebi, Clustering-based EMT model for predicting student
performance. Arab. J. Sci. Eng. 45, 10067–10078 (2020)
8. N. Varela et al., Student performance assessment using clustering techniques, in International
Conference on Data Mining and Big Data (Springer, Singapore, 2019). https://doi.org/10.1007/
978-981-32-9563-6_19
9. A. Hamoud, A.S. Hashim, W.A. Awadh, Predicting student performance in higher education
institutions using decision tree analysis. Int. J. Interact. Multimedia Artif. Intell. 5, 26–31 (2018)
10. R. Hasan et al., Student academic performance prediction by using decision tree algorithm, in
2018 4th International Conference on Computer and Information Sciences (ICCOINS). (IEEE,
2018). https://doi.org/10.1109/ICCOINS.2018.8510600
11. I. Singh, A.S. Sabitha, A. Bansal., Student performance analysis using clustering algorithm,
in 2016 6th International Conference-Cloud System and Big Data Engineering (Confluence)
(IEEE, 2016). https://doi.org/10.1109/CONFLUENCE.2016.7508131
12. https://nces.ed.gov/fastfacts/display.asp?id=16. Accessed on 15 July 2021
13. J.S. Catterall, On the social costs of dropping out of school. High Sch. J. 71(1), 19–30 (1987)
14. S. Lee, J.Y. Chung, The machine learning-based dropout early warning system for improving
the performance of dropout prediction. Appl. Sci. 9(15), 3093 (2019)
15. W. Tenpipat, K. Akkarajitsakul, Student dropout prediction: A KMUTT case study, in 2020 1st
International Conference on Big Data Analytics and Practices (IBDAP) (IEEE, 2020). https://
doi.org/10.1109/IBDAP50342.2020.9245457
16. N. Wu et al., CLMS-Net: dropout prediction in MOOCs with deep learning, in Proceedings of
the ACM Turing Celebration Conference-China (2019). https://doi.org/10.1145/3321408.332
2848
17. V. Hegde, P.P. Prageeth, Higher education student dropout prediction and analysis through
educational data mining, in 2018 2nd International Conference on Inventive Systems and
Control (ICISC) (IEEE, 2018). https://doi.org/10.1109/ICISC.2018.8398887
18. L. Haiyang et al., A time series classification method for behaviour-based dropout prediction,
in 2018 IEEE 18th International Conference on Advanced Learning Technologies (ICALT)
(IEEE, 2018). https://doi.org/10.1109/ICALT.2018.00052
19. C. Márquez-Vera et al., Early dropout prediction using data mining: a case study with high
school students. Expert Syst. 33(1), 107–124 (2016)
20. S. Dwivedi, V.S.K. Roshni, Recommender system for big data in education, in 2017 5th National
Conference on E-Learning and E-Learning Technologies (ELELTECH) (IEEE, 2017). https://
doi.org/10.1109/ELELTECH.2017.8074993
21. B. Mondal et al., A course recommendation system based on grades, in 2020 International
Conference on Computer Science, Engineering and Applications (ICCSEA) (IEEE, 2020).
https://doi.org/10.1109/ICCSEA49143.2020.9132845
22. B. Ma, Y. Taniguchi, S. Konomi, Design a course recommendation system based on association
rule for hybrid learning environments. Inf. Process. Soc. Jpn. 7 (2019)
23. S. Alghamdi, N. Alzhrani, H. Algethami, Fuzzy-based recommendation system for university
major selection, in IJCCI (2019). https://doi.org/10.5220/0008071803170324
24. H. Zhang et al., MCRS: A course recommendation system for MOOCs. Multimedia Tools
Appl. 77(6), 7051–7069 (2018)
25. J. Xiao et al., A personalized recommendation system with combinational algorithm for online
learning. J. Ambient Intell. Humanized Comput. 9(3), 667–677 (2018)
26. K.S. Na, Z. Tasir, Identifying at-risk students in online learning by analysing learning behaviour:
A systematic review, in 2017 IEEE Conference on Big Data and Analytics (ICBDA) (IEEE,
2017). https://doi.org/10.1109/ICBDAA.2017.8284117
27. T. Purwoningsih, H.B. Santoso, Z.A. Hasibuan, Online Learners’ behaviors detection using
exploratory data analysis and machine learning approach, in 2019 Fourth International Confer-
ence on Informatics and Computing (ICIC) (IEEE, 2019). https://doi.org/10.1109/ICIC47613.
2019.8985918
28. M. Al Fanah, M.A. Ansari, Understanding E-learners’ behaviour using data mining techniques,
in Proceedings of the 2019 International Conference on Big Data and Education (2019). https://
doi.org/10.1145/3322134.3322145
29. M. Pérez-Sanagustín et al., Analyzing learners’ behavior beyond the MOOC: An exploratory
study, in European Conference on Technology Enhanced Learning (Springer, Cham, 2019).
https://doi.org/10.1007/978-3-030-29736-7_4
30. N. Yan, O.T.-S. Au, Online learning behavior analysis based on machine learning. Asian Assoc.
Open Univ. J. (2019). https://doi.org/10.1108/AAOUJ-08-2019-0029
31. K.A. Douglas et al., Big data characterization of learner behaviour in a highly technical MOOC
engineering course. J. Learn. Anal. 3(3), 170–192 (2016)
32. X. Hu et al., Construction and application of VR/AR-based STEAM curriculum system in
primary and middle schools under big data background. J. Phys. Conf. Ser. 1624(3) (2020)
IOP Publishing, 2020
33. M. Li et al., The innovative curriculum construction of computer fundamentals course based
on SPOC+ MOOC in higher education, in 2020 15th International Conference on Computer
Science and Education (ICCSE) (IEEE, 2020). https://doi.org/10.1109/ICCSE49874.2020.920
1872
34. X. Liu, Research on the construction of humanities curriculum and evaluation system for
nursing students according to job needs based on big data technology analysis. J. Phys. Conf.
Ser. 1648(3) (2020) IOP Publishing
35. L. Zhang, X. Yang, Y. Zhang, The research on cloud platform construction of mathematics
education curriculum under big data background [C] (2018). https://doi.org/10.25236/iwass.
2018.056
36. S. Jensen, Integrating big data services into an undergraduate mis curriculum. Int. J. Syst. Serv.
Oriented Eng. (IJSSOE) 7(2), 58–73 (2017)
37. D. Niemi, E. Gitin, Using big data to predict student dropouts: Technology affordances for
research, in Proceedings from the International Association for Development of the Information
Society (IADIS) International Conference on Cognition and Exploratory Learning in Digital
Age (2012)
38. S.O. Fadiya, S. Saydam, E.J. Chukwuemeka, Big data in education; Future technology
integration. Int. J. Sci. Technol. 2(8), 65 (2014)
Breast Cancer Mammography
Identification with Deep Convolutional
Neural Network
Pandit Byomakesha Dash, H. S. Behera, and Manas Ranjan Senapati
Abstract Breast cancer is considered as one of the most common invasive diseases
in women and a major cause of cancer death. One of the most effective diagnostic
methods used for the diagnosis of breast cancer is mammography. However, from
last one decade, the use of intelligent-based techniques by the medical expert and
radiologists has been suggested by the breast cancer researchers. Especially, the
application of deep learning such as convolutional neural network (CNN) in particular
is providing effective performance in classifying the mammograms accurately which
can assist imaging specialists. For getting an accurate classification of mammograms,
the CNN model should be trained with more number of labeled mammograms. But
it is not always available to get mammograms more labels. The main purpose of this
experiment is to perform a highly accurate classification of mammograms through
CNN with dense layers. In our study, we developed two classification models such as
the proposed CNN-based models with a single dense layer and CNN with two dense
layers. Here, the dense layers act as a backbone for the CNN model for the accurate
classification of mammograms. This work is intended to improve the performance
of CNN with more dense layers including the preprocessed mammograms having
multi-views. The final experiments from our two proposed models show that the
first CNN model with a single layer is obtaining 100% accuracy for breast cancer
detection in 38.64 s execution time. The second CNN model with two dense layers
is also achieving the same result in 42.64 s execution time.
Keywords Deep learning · CNN · Mammogram · Breast cancer · CAD
P. B. Dash (B) · H. S. Behera · M. R. Senapati

Department of IT, Veer Surendra Sai University of Technology, Burla, Odisha 768018, India
M. R. Senapati
e-mail: mrsenapati_it@vssut.ac.in
742 P. B. Dash et al.
1 Introduction
The last two decades have witnessed a drastic growth in the number of breast cancer
cases throughout the world. Most of the studies suggested that early detection and
efficient diagnosis can reduce the number of cases. In the modern days, the treat-
ment became advanced due to the latest treatment procedures such as advanced
screening, and electronic imaging systems. The abnormal growth of tumor cells like
two components in the breast is malignant and benign leads to breast cancer. To
make a better-improved treatment of breast cancer, early detection is of great impor-
tance. Mammography is the most common and cost-effective method for the early
detection of breast cancer. Due to the poor clarity and two-dimensional views of
mammograms, it is becoming a complicated task to analyze the mammograms in
a manual process. To get a better analysis for making mammograms more signifi-
cant and useful, many researchers are working on different computational intelligent
algorithms. Good quality with unambiguous mammographic images helps many
medical experts for easy identification of whether the tumor is malignant or benign.
But when the tumor is benign, the expert recommends for next investigations. The
test on histopathological analysis can explore the type of the tumor. Another case
like conventional mammography leads to the non-confirmation of 30% of breast
cancer cases [1]. Due to the dense breast of the younger woman, there is a chance
of falling in the detection rate of breast cancer. For such complex imaging anal-
ysis, machine learning and deep learning have recently recorded a good milestone
for getting effective results and performance in the healthcare domain [2–7] over
traditional techniques.
Computer-aided diagnosis (CAD) can be a great support to doctors and researchers
for the early detection of breast cancer. The success of CAD completely depends on
how the mammogram images are enhanced and more no. relevant features can be
extracted from it [8]. Akselrod-Ballin et al. [9] have used a combined approach of
machine learning and deep learning for early breast cancer detection applying to
digital mammographic images. They have considered XGBoost classifier for selec-
tion of top breast cancer symptomatic features and deep neural network as a compu-
tational model. They have obtained area under the receiver operator characteristic
(ROC-AUC) of 0.91 with a specificity of 77.3% and sensitivity of 87%. Shen et al.
[8] have considered four convolutional network models (ResNet-ResNet, ResNet-
VGG, VGG-VGG, and VGG-ResNet) for the detection of breast cancer on screening
mammograms. They have ensembled these four models for obtaining a high accuracy
rate of classification on heterogeneous mammographic platforms. They obtained an
AUC of 0.95 for the individual model. However, four-model averaging improved the
AUC to 0.98 (sensitivity: 86.7%, specificity: 96.1%).
Among all the deep learning models, the convolutional neural network is the most
effective for image analysis. For a medical expert, it is an essential task to observe
the whole mammogram for a tumor-like lesion screening and density classification.
Charan et al. [10] employed the CNN model to discriminate between normal and
abnormal mammograms. The model was able to get 65% accuracy by dividing the
Breast Cancer Mammography Identification with Deep … 743
breast area using morphological processes, and this greatly improved its classifi-
cation performance. The classification of benign and malignant mammograms was
undertaken by the use of two CNN models, namely AlexNet C1 and GoogLeNet-
C12. This was done by Samala et al. [11]. They have checked during the experiment
that fixing the parameters of some CNN layers is helping to increase the classification
performance and making the model robust. They obtained an AUC of 1 at over 1000
epochs.
Arora et al. [12] have considered five CNN architectures (AlexNet, VGG16,
ResNet, GoogLeNet, and InceptionResNet). They combined them as an ensemble
model to deal with mammographic classification. They achieved an accuracy of
0.88 with an AUC of 0.88. Chougrad et al. [13] developed a deep convolutional
neural network with the best fine-tuning strategy for the analysis of mammograms.
They have taken three different models (VGG16, ResNet50, Inception v3) for the
experiment, and the proposed model obtained the best result 98.23% accuracy and
0.99 AUC. By using the image patches, the efficiency of feature extraction can be
improved to a certain extent. But it does not always help the radiologists to find the
abnormalities in a large volume of mammograms.
In our proposed study, we analyze the breast cancer detection technique based
on deep learning. This paper investigates the importance of dense layer with CNN
instead of randomly taking the no of dense layer for CNN. Here, we explore the
impact of fine-tuned dense layers and experimentally determine the adopted CNN
architecture along with execution time carried out. This is how the paper is organized.
As well as the suggested CNN classification model, Sect. 2 outlines the full workflow
and design structure for the CNN, Sect. 3 is about the brief description of data
preparation and experimental setup. Section 4 describes the details of experimental
results and analysis, and Sect. 5 is for conclusion.
2 Methodology
In the trending world of artificial intelligence, CNN is the most powerful neural
network model for image recognition, image processing, and image classification.
Most of the trending areas where CNN is widely used are objects detection and
face recognition. A CNN is a modern artificial neural network specifically made to
process pixel data. While doing image classification, the CNN model takes images
as input, processes them, and classifies them into several classes properly. This type
of model uses the input images as an array of pixels and its size depends on image
resolution.
2.1 Proposed CNN Architecture
Most of the CNN models are built with a series of convolution layers consisting
of filters (Kernels), pooling, fully connected layers (FC), and activation functions
according to requirement. The CNN architecture is shown in Fig. 1.
The preprocessed images are used for training the CNN model after performing
resizing and normalization process. The more generalized input image size is 224 ×
224 × 3. The second layer of the proposed CNN model is the convolution layer which
is responsible for selecting potential features from the images. The convolutional
layer contains a set of filters of size 3 × 3. In the process of convolution, the filter
moved over the images and the dot products are computed on the pixel values. By
this process, the convolved features are obtained as the output of the convolution
processes. On the obtained convolved features, the pooling operation is performed
for dimensionality reduction. The powerful layer of CNN is the convolution layer.
The image’s pixels must be linked to achieve seamless transition. This data layer
learns what belongs in each square of input by looking at features. This is achievable
because of a set of filters. Every filter is sized 3 by 3. The filters perform dot product
computations using input images and all other filters to compute the result. This is
a convoluted outcome. Although many pooling techniques are being used by many
researchers such as max-pooling, average pooling, and global pooling, we have used
max-pooling [Eqs. (1) and (2)] in this proposed work.
f maz (x) = max xt (1)

t
1
N
f avg = xt (2)
2 t=1
Fig. 1 Basic CNN architecture


z, z > 0
R(z) = (3)
0, z ≤ 0
1
S(z) = (4)
1 + e−z
The ReLU function [Eq. (3)] is used for both the convolutional layer and pooling
layer. Finally, the extracted features from the pooling process are used for performing
the task of classification. Here, a fully connected layer is used in which the output
layer nodes are directly connected to the previous layer nodes. Here, we have used
the softmax activation [Eq. (4)] function in the output layer to compute the class
label. The selection of appropriate activation functions is one of the crucial parts of
the model as it controls the learning process of the model’s network. We can have
many other alternative activation functions such as sigmoidal, parametric ReLU, and
tangent, which are being used by the researchers as per their requirements. In this
paper, two architectures of the CNN model with Adam optimization algorithm have
been used for the detection of the type of breast cancer on the breast cancer dataset.
In our first architecture, we have taken CNN with one hidden layer neural network
and the second architecture is CNN with two hidden layers. These two architectures
are shown in Figs. 2 and 3, respectively. The complete workflow of this proposed
work is shown in Fig. 4.
Fig. 2 Architecture of CNN

with a single dense layer
Fig. 3 Architecture of CNN

with two-dense layer
3 Experimental Setup and Data Preparation
The experiments have been carried out on Dell laptop system with Windows 8.1 Pro
64-bit OS, Processor Intel Core (TM) i5-6700 CPU @3.40 GHz (8 CPUs) ~3.4 GHz,
and 4 GB RAM. The proposed models have been developed and tested in a Python
TensorFlow environment with Spyder IDE. The programming environment includes
OpenCV for image processing which converts the raw images to machine-readable
format in the form of a Numpy array. Other important packages such as Tensor-
Flow, Keras, Numpy, Matplotlib, OS, Time, Random, and Pillow are used in this
experiment.
3.1 Dataset Overview
For our experiment, we have considered the benign breast tumor dataset collected
from the IEEE-data port [14], which is the collection of the information of 83 patients
from India. This dataset is including information such as the patient’s clinical history,
histopathological features, and mammograms. Here, the task is to classify the patients
into ten subclasses of benign tumors from their clinical history, histopathological
features, and mammograms.
Fig. 4 Workflow of proposed CNN model

4 Experiment and Results
The proposed CNN model has been tested on a breast cancer dataset, and the results
and observations are presented in this section. The performances of the proposed
models have been improved by increasing the dense layer in the fully connected
network of CNN. We have considered two proposed architectures for the compar-
ative study. Especially, the considered models will help analyze the execution time
difference with various dense layers.
In this experiment, many performance measures are considered such as precision
(Eq. 5), recall (Eq. 6), F1-Score (Eq. 7), and ROC-AUC to estimate the efficiency of
our proposed models. The performances of the proposed models have been measured
in these metrics. Equation 5 shows the precision as the ratio of accurately detected
positive cases to the total number of expected positive cases. As illustrated in Eq. (6),
recall is the ratio between the total number of true positives and the total number
of true positives with false negatives. Precision and recall can be used to explain
F1-Score in Eq. (7).
TP
Precision = (5)
TP + FP
TP
Recall = (6)
TP + FN
2 × TP
F1 score = (7)
2 × TP + FP + FN
The comparative analysis of the performances of the studied models in terms of

performance metrics and execution time is presented in Table 1. It is noticeable that
both the suggested models are efficient for mammogram classification. However,
CNN with a single layer (consisting of 32 neurons) is having an F1-score of 1.0 with
an execution time of 38.64 s, and CNN with two layers (consisting of 64 neurons)
is having an F1-score of 1.0 with an execution time of 42.64 s. So increment in the
number of layers with the increase in nodes takes more time than the model with less
number of the neuron.
Figures 5 and 6 show the accuracy versus the number of epoch curves of the studied
CNN models. It can be visualized that the training accuracy of these models is initially
Table 1 Results on performance metrics of prediction models

Prediction models Precision Recall F1-score ROC-AUC Execution time (s)
CNN with single layer (32 1.0 1.0 1.0 1.0 38.64
neurons)
CNN with two layers (64 1.0 1.0 1.0 1.0 42.64
neurons)
Proposed model 1.0 1.0 1.0 1.0 42.64
Fig. 5 Accuracy vs No of
Epoch analysis for CNN
with single dense layer
Fig. 6 Accuracy vs No of
Epoch analysis for CNN
with two-dense layer
low in epoch 1 and gradually increases toward epoch 30. Finally, it reaches around
95% accuracy for both the proposed models. In the case of testing these models, the
accuracy curve is quite different. During the testing phase, the test accuracy is coming
100% from epoch 20 onwards without fluctuation with execution time 38.64 s for the
proposed CNN with a single dense layer. However, for the proposed CNN with two
dense layers, the 100% accuracy is coming from epoch 7 onwards without fluctuation
with an execution time of 42.64 s.
Figures 7 and 8 show the loss curve for both proposed CNN models. It has been
observed that the training loss gradually decreases to 0.25 after 17 epochs for CNN
with a single dense layer. In the case of the CNN with two dense layers model, the
training loss comes 0.19 after 10 epochs. The test loss is coming to 0.05 at epoch 10
and 0.07 at epoch 18 for CNN with two dense layers and CNN with a single dense
layer, respectively.
Moreover, we have conducted the ROC analysis on the classification models for
both CNN with a single dense layer and CNN with two dense layers and are presented
Fig. 7 Loss vs No of Epoch

analysis for CNN with single
dense layer
Fig. 8 Loss vs No of Epoch

analysis for CNN with
two-dense layer
in Figs. 9 and 10, respectively. The CNN with two dense layer classifiers is more
significant and robust to classify the breast cancer types.
5 Conclusion
Deep learning-based methods are considered the most significant approaches for
solving medical domain problems over traditional machine learning algorithms. Such
models are having enormous potential to improve the detection rate of breast cancer
by screening mammography. Our approach may help for future development of a
superior system like CAD which will help the radiologist to identify the most suspi-
cious cases with high priority. In this chapter by using CNN, we have developed
two models for breast cancer mammography classification. The developed methods
Fig. 9 ROC curve for CNN with single dense layer
Fig. 10 ROC curve for CNN with two-dense layer
possess the accuracy of 100% in classifying the mammogram images with an execu-
tion time of 38.64 s using the CNN with a single dense layer, whereas 100% accu-
racy with an execution time of 42.64 s for CNN with two-dense layer classification
model. It clearly shows that the dense layer plays a vital role in CNN architecture
for better classification. In the future, we further plan to develop a more efficient
intelligent-based model to work in the direction of identifying the wrong indications
of mammography images, early symptoms of dense breast-related issues, and the
dangerous invasive lobular carcinomas issues.
References
1. A. Bhale, M. Joshi, Automatic sub classification of benign breast tumor, in Smart Trends in
Systems, Security and Sustainability (Springer, Singapore, 2018), pp. 221–232. https://doi.org/
10.1007/978-981-10-6916-1_20
2. V. Gulshan, L. Peng, M. Coram et al., Development and validation of a deep learning algorithm
for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316(22), 2402–2410
(2016). https://doi.org/10.1001/jama.2016.17216
3. A. Esteva, B. Kuprel, R.A. Novoa et al., Dermatologist-level classification of skin cancer with
deep neural networks. Nature 542(7639), 115–118 (2017) [Published correction appears in
Nature; 546(7660): 686 (2017)]. https://doi.org/10.1038/nature21056
4. B.E. Bejnordi, M. Veta, J.P. van Diest et al. Diagnostic assessment of deep learning algorithms
for detection of lymph node metastases in women with breast cancer. JAMA 318(22), 2199–
2210 (2017). https://doi.org/10.1001/jama.2017.14585
5. P. Rajpurkar, J. Irvin, K. Zhu et al., CheXNet: Radiologist-level pneumonia detection on
chest X-rays with deep learning. arXiv:1711.05225 [cs, stat]. http://arxiv.org/abs/1711.05225.
Published November 2017. Accessed 10 Sept 2018. https://doi.org/10.1371/journal.pmed.100
2686
6. G. Litjens, T. Kooi, B.E. Bejnordi et al., A survey on deep learning in medical image analysis.
Med. Image Anal. 42, 60–88 (2017). https://doi.org/10.1016/j.media.2017.07.005
7. C.D. Lehman, A. Yala, T. Schuster et al., Mammographic breast density assessment using deep
learning: Clinical implementation. Radiology 290(1), 52–58 (2019). https://doi.org/10.1148/
radiol.2018180694
8. L. Shen et al., Deep learning to imp.breast cancer detection on screening mammography. Sci.
Rep. 9(1), 1–12 (2019). https://doi.org/10.1038/s41598-019-48995-4
9. A. Akselrod-Ballin et al., Predicting breast cancer by applying deep learning to linked health
records and mammograms. Radiology 292(2), 331–342 (2019). https://doi.org/10.1148/radiol.
2019182622
10. S. Charan, M.J. Khan, K. Khurshid, Breast cancer detection in mammograms using convo-
lutional neural network, in Processing of the 2018 International Conference on Computing,
Mathematics and Engineering Technologies (iCoMET) (Sukkur, Pakistan, 2018), 3–4 Mar
2018. https://doi.org/10.1109/ICOMET.2018.8346384
11. R.K. Samala, H. Chan, L.M. Hadjiiski, M.A. Helvie, C.D. Richter, Generalization error analysis
for deep convolutional neural network with transfer learning in breast cancer diagnosis. Phys.
Med. Biol. 65(10), 1–13 (2020). https://doi.org/10.1088/1361-6560/ab82e8 (PMID: 32208369)
12. R. Arora, P.K. Rai, B. Raman, Deep feature-based automatic classification of mammo-
grams. Med. Biol. Eng. Comput. (2020). https://doi.org/10.1007/s11517-020-02150-8
(PMID:32200453)
13. H. Chougrad, H. Zouaki, O. Alheyane, Deep convolutional neural networks for breast cancer
screening. Comput. Methods Programs Biomed. 157, 19–30 2018. https://doi.org/10.1016/j.
cmpb.2018.01.011 PMID: 29477427
14. https://ieee-dataport.org/open-access/benign-breast-tumor-dataset. Accessed on 24 Apr 2020
CatBoosting Approach for Anomaly
Detection in IoT-Based Smart Home
Environment
Dukka Karun Kumar Reddy and H. S. Behera
Abstract With the advances in technology, IoT devices have become an integral
part of our daily lives due to their rapid expansion and deployment. As the IoT
devices communicate continuously leads to concerns about privacy and security due
to vulnerabilities which the attackers can exploit. The raw observations of sensor
nodes influence the decision-making in IoT networks. So, it requires an established
method to monitor and analyze the data from the sensor nodes. Therefore, data
collected nodes in IoT systems must resist attacks through anomaly detection imple-
mented on the data collected sensor nodes. This paper proposed the CatBoosting
approach to perform intelligent and adaptive anomaly detection for smart home envi-
ronment devices. The proposed approach assists in supervising the data with normal
and abnormal activities with enhanced resources management. The DS2OS dataset
with several attacks on the IoT environment is considered to evaluate the effective-
ness of the proposed anomaly-based detection system. Moreover, various researchers
existing approaches on the DS2OS dataset is studied and described briefly with the
proposed method. Finally, the experimentation of the simulation model performance
and evaluation is evaluated through different metrics.
Keywords IoT · CatBoosting · DS2OS · Machine learning
1 Introduction
The IoT integrates computing power device machines with a network of heteroge-
neous and constraint nodes connected wirelessly to communicate over the Internet in
real-time. The IoT is a smart network architecture used to exchange information with
agreed protocols without human intervention. IoT uses unique addressing protocols
to interact and cooperate with things (objects) to create new application services. The
stay-connected characteristic of IoT allows for a connection anytime and anywhere
for users through coordinated, integrated, monitored, and controlled computing and
D. K. K. Reddy (B) · H. S. Behera

Deparment of Information Technology, Veer Surendra Sai University of Technology, Burla
768018, India
754 D. K. K. Reddy and H. S. Behera
communication systems [1]. The IoT devices are mounting swiftly and will attain
about twenty-one billion connected devices. Due to the large scale of the objects
and heterogeneity character, IoT devices exchange massive information. This gains
the probability of getting affected by malicious attackers [2]. As IoT is allied with a
wide range of protocols, applications, and platforms, such as health monitoring, smart
house, smart city, smart grid, smart environment, and so on, ensuring a contented
place for spiteful users to launch attacks without any hinder. Perception, transporta-
tion, and application layer are the three levels that make up an IoT security frame-
work [3]. Perception nodes (Ex. sensors) are used to collect data in the perception
layer. The critical security problems in this layer are secure communications between
nodes and lightweight authentication. The transportation layer has three layers: (i)
access network, (ii) core network, and (iii) local area network. The transportation
layer provides ubiquitous access to the perception layer by using wireless networks,
ad-hoc networks, and other methods. As a result, various attacks, such as informa-
tion leaks, network disablement, DoS attack, and so on, become common security
challenges in this layer. An attack detection and prevention method could be imple-
mented before the entire layer suffers a significant loss to address these issues. And
also, IoT devices do not have malware or virus protection software due to the char-
acteristic outcome of the low-memory and low-power nature [4, 5]. The malicious
activity makes IoT devices profoundly susceptible to becoming bots and carrying
out to other network devices. The root cause of current IoT threats and significant
breaches in IoT’s security and privacy issues are determined by ubiquitous, mobile,
unattended, constrained, interdependence, myriad, diversity, and intimacy [6, 7].
A rise in potential security threats could be a result of the increased distribution of
such smart devices. As a result, a dependable, smart, and secure system for detecting
cyber-attacks and recovering itself automatically is required. One of the main goals
of IoT technologies is to monitor, detect anomalies, and detect changes or drifts
in the environment [8]. Our goal is to identify new events that previous models do
not describe to maximize IoT adoption. Anomaly detection is used in smart homes
applications to spot abnormal activity in high-dimensional data. This research aims
to offer a Boosting learning-based method to perform anomaly-based detection in
smart home IoT devices in an aberrant state. The following are the key contributions
of this article:
(1) We demonstrate and implement a CatBoosting ensemble learning-based
anomaly detection technique on categorical IoT traffic traces datasets using
multiclass rather than binary classification.
(2) We show that ensemble learning models outperform traditional machine
learning models in attack and anomaly detection systems in IoT contexts.
(3) We compare the performance of our method to that of other methods on the
DS2OS dataset.
(4) On the categorical dataset, the proposed method performs optimal for anomaly
detection and low false positives.
CatBoosting Approach for Anomaly Detection in IoT-Based … 755
2 Related Works
Anomaly detection in IoT environments is one of the most critical issues that require
immediate attention in the IoT domain. Several researchers have been striving to
secure the IoT environment by using undiscriminating attack detection algorithms.
This section includes the performance of various researchers’ approaches on the
DS2OS dataset, a summary of their study, and some brief information on attack
detection in IoT networks.
Hasan et al. [9] performed anomaly detection for IoT infrastructure through
several ML models performances with artificial neural network (ANN), support
vector machine (SVM), logistic regression (LR), decision tree (DT), and random
forest (RF). Though these ML strategies are equally accurate, other measurements
show that RF outperforms well. Vangipuram et al. [10] proposed a novel imputation
approach to impute missing values in the data. The traditional ML classifier’s perfor-
mance is investigated using an imputed dataset generated by using the proposed
imputation, F-K-means, and K-means methods. The ML classifiers performed well
in categorizing the attacks through the proposed imputation process approach. The
accuracy for attack classes is 99% for malicious control when using the suggested
imputation and classification technique. For the attack classes wrong setup, data
type probing, scan, DoS, spying, and malicious operation achieved the accuracy of
100%. Dash et al. [11] proposed an IoT-based security framework using an adaptive
boosting algorithm with a synthetic minority oversampling technique (SMOTE). The
proposed work first handles the imbalanced nature of the data through SMOTE, and
next adaptive boosting is used to classify the multiclass for anomaly detection. The
proposed framework derives 100% accuracy in identifying normal and abnormal
activity. Cheng et al. [12] proposes hierarchical stacking temporal convolutional
network with semisupervised learning. The semisupervised technique is based on
both unsupervised and supervised learning, where the unlabeled data is trained using a
small portion of labeled data. The experiment results reveal that the proposed method
improves anomaly detection performance with an accuracy of 98.22% with increased
inefficiency. Latif et al. [13] proposed a lightweight random neural network-based
(RaNN) prediction model to forecast the DS2OS attacks. According to the evalu-
ation results, the RaNN model is an advanced scheme of ANN, which obtains an
accuracy of 99.20%. The model outstands well compared to traditional ML clas-
sifiers (ANN, SVM, and DT). Reddy et al. [14] proposed an examination of deep-
learning (DL) neural networks for anomaly exposure systems to estimate the system’s
behavior into untruthful and truthful actions. The DL models show a high likelihood
of exploring performance comparative to machine learning algorithms. The studied
dataset DS2OS is pre-processed by removing the NAN instances, and the remaining
noisy data is encoded into numerical data. The proposed approach derives an accu-
racy of 98.28% in categorizing anomaly detection. Sahu and Mukherjee [15] made
a comparisons study with LR and ANN classification algorithm. The classification
algorithm is trained on the complete dataset, and the algorithms are tested after
removing the feature value. The proposed approach derives an accuracy of 99.99%
in classifying normal and abnormal behavior of anomaly detection. Islam et al. [16]
proposed a study between shallow and deep belief networks for different IoT threats.
Where the shallow models contain DT, RF, and SVM. The DL models contain deep
belief networks, deep neural networks (DNN), long short-term memory (LSTM),
stacked LSTM, bidirectional LSTM. Among the shallow models, SVM produces an
accuracy of 99.44%, and among DL models, bidirectional LSTM produces an accu-
racy of 99.39%. Kumar et al. [17] proposed a novel intrusion detection system with
distributed ensemble design using fog computing. The framework is a combination
of two levels. The first level consists of individual learners with k-NN, Gaussian
naive Bayes, and XGBoost. RF uses the first level prediction results for final classifi-
cation. For most of the attacks, the proposed work shows a detection rate of 99.99%.
Singh and Singh [18] proposed a hyperparameter tuned gradient boosting (GB) algo-
rithm. Initially, the feature selection procedure is being used to reduce the dataset’s
dimensions, which improves the attack and anomaly detection environment. The
next GB algorithm is used with some hyperparameter modification to achieve the
best results. The model outperforms the competition when it comes to identifying
attacks on IoT sensor environments, with an accuracy of 99.40%. Bokka and Sada-
sivam [19] proposed a DL-based DNN to detect attacks in the IoT-based smart home
environment. The proposed model derives an accuracy of 99.42%. For classifying
and addressing various attacks and abnormalities operations on the network, Reddy
et al. [20] proposed a supervised meta-algorithm-based approach known as bagging.
To improve the DS2OS dataset efficiency, the proposed work consists of bagging with
several ML models performances with k-NN, DT, RF, and extra trees classifier. This
proposed work estimates the overall experimentation performance and evaluation of
the simulation model with an accuracy of 99.9%.
This section describes the working of the Bossting and CatBoosting approaches.
3.1 Boosting
Traditionally, developing an ML application involves taking training by a single

learner. The ensemble methods involve using many learners to enhance the perfor-
mance of any single learner individually. This approach is described as a technique
to create a stronger and aggregated model from a group of weak learners. To uncover
weak rules, traditional ML algorithms with a different distribution are used. The
base learning method is used at each iteration, where a new weak prediction rule
is generated. The boosting method combines numerous weak rules into a single
powerful prediction rule. Through weighting or filtering, the data used to train the
weak learners are trained sequentially. At each iteration, new learners are modified
with weight for trained observations that the previous learners have poorly classified.
3.2 Cat Boosting
CatBoost combines the words Category (processing categorical features with

an effective and efficient way to deal with comparative to ML algorithms)
and Boosting (Boosting derives from gradient boosting algorithm). It produces
state-of-the-art results without the need for significant data training that other ML
approaches require. The CatBoost is an ordered boosting with a permutation-driven
alternative to the conventional gradient boosting technique that has been implemented
with decision trees as base predictors.
The CatBoost creates independent permutations of the training dataset randomly.
The advantage of several permutations will decrease the final model predictions’
variance compared to the general boosting algorithm. These permutations are used
to evaluate splits that define tree structures to serve to choose the leaf values of the
obtained trees. The tree building in CatBoost is generated by the base predictors or
oblivious decision trees, i.e., the same splitting criterion. The same splitting criterion
is applied to the entire tree level to make the tree less prone to overfitting and balanced
to speed up execution at testing time suggestively. The paper [21] provides the full
description of the algorithms.
4 Dataset Description
The IoT DS2OS dataset is publicly available from the Kaggle website [22]. The
dataset is different from the conventional network dataset because the traces are
captured from the application layer in the IoT environment DS2OS. The environ-
ment is architected by different simulated IoT sites with lightning control sensors,
movement sensors, temperature control sensors, battery sensors, door lock sensors,
heating control sensors, washing machine sensors, and questing service sensors
spread across different locations. The dataset consists of 13 features comprising
357,952 data instances of categorical data consisting of numerical and nominal data
and the timestamp feature. The timestamp is discrete, so this feature is excluded. The
paper [14] provides in depth description of the dataset with various feature and its
datatype.
Table 1 Percentage
Attacks Total Aggregated Anomalous
distribution of anomalies
instances data (%) data (%)
Wrong setup 122 0.03 1.21
Data probing 342 0.09 3.41
Spying 532 0.14 5.31
Malicious 805 0.22 8.03
operation
Malicious 889 0.24 8.87
control
Scan 1547 0.43 15.44
DoS 5780 1.61 57.70
Normal 347,935 97.24 –
Data is said to be unclean if it comprises missing attribute values, contains noise

or outliers, and duplicates, degrading the quality of the results. Data preprocessing
is critical for addressing and prioritizing characteristics as they directly impact the
algorithm’s success rate and improve productivity by allowing more reliable and
effective decisions. The dataset features with the attributes Accessed Node Type
and Value holds missing values considered as NAN. The attribute Access Node
Type and Value have contained 148 and 2050 missing values, respectively, and are
pre-processed accordingly. To facilitate the classifiers in improving the accuracy,
the Access Node Type and Value characteristics are transformed into significant
continuous values. Table 1 shows the percentage distribution of anomalies from
the Normality attribute. The Accessed Node Type attribute with NaN is replaced
by /sensor Service. The Value attribute with NAN and other noisy data is also
replaced to preserve the valuable information. Under realistic noise patterns, the
dataset would not demonstrate appropriateness as shown in Table 2. Finally, during
the preprocessing stage, the data in categorical form is converted into numeric data.
5 Result Analysis
It is critical to test the procedure to ensure that the visualization and study anal-
ysis for the dataset are adequate. The results analysis of the simulation work on
the DS2OS dataset reveals the undetected anomalies cases. This section represents
the result analysis of the proposed CatBoosting algorithm. The dataset consists of
multi-classification with 357,952 instances. 80% of the data is used for training (i.e.,
286,361), and 20% of the data (i.e., 71,591) is used for testing purposes. The evalua-
tion measures that are taken into consideration are confusion matrix, accuracy, FPR,
Table 2 Pre-processing of categorical data

Normality Value Total Filled with
Scan False 49 0
NAN 16 1
Malicious operation NAN 148 1
DoS NAN 1780 0
Probing Twenty 200 20
NAN 100 20
Wrong setup True 28 1
False 164 0
Malicious control False 47 0
True 829 1
Normal None 106 1
Address parameters 11 1
False 25706 0
NAN 6 1
F1-score, Recall/TPR, and precision to validate the performance of our suggested

technique. The derived results for the testing instances are shown in Table 3. All the
classes are categorized precisely except the DoS attack. Seven hundred and thirty
four instances of DoS class are correctly identified and the remaining 380 instances
are classified as normal class. Figures 1, 2 and 3 shows the ROC, precision-recall
curve, and confusion matrix of the proposed work. The comparative study of various
researcher’s works on the DS2OS dataset in brief with various evaluation metrics are
shown in Table 4.
Table 3 Evaluation measures using CatBoosting

Evaluation DoS Data Malicious Malicious Scan Spying Wrong Normal
measures probing control operation setup
TP 734 74 169 143 322 107 26 69,636
TN 70,477 71,517 71,422 71,448 71,269 71,484 71,565 1575
FP 0 0 0 0 0 0 0 380
FN 380 0 0 0 0 0 0 0
Recall 0.66 1.0 1.0 1.0 1.0 1.0 1.0 1.0
FPR 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.19
F1-score 0.8 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Precision 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.99
Accuracy 99.47
Fig. 1 ROC curve
Fig. 2 Precision-recall curve
6 Conclusion
Attacks and malicious activity pose a more severe threat to privacy and security than
ever before due to steep increase of IoT device and applications. The promising exper-
tise knowledge of IoT makes people’s life handiness but still privacy and security
issues are key apprehension. Several existing anomaly detection methods and models
on DS2OS are considered to analyze and improve the detection performance. As a
Fig. 3 Confusion matrix
Table 4 Comparative study

References Various researcher proposed work Proposed work
[9] Accuracy = 99.4% Accuracy = 99.47%
[10] Accuracy of normal and malicious control = 98.9%
[11] AdaBoost = 93.4%
AdaBoost_SMOTE = 100%
[12] Accuracy = 98.22%, precision = 97.67%, recall =
97.28%, and F1-score = 97.35%
[13] Accuracy = 99.2%
[14] Accuracy = 98.28%
[15] Precision-recall area of DoS attack = 47%, and accuracy
= 99%
[16] Accuracy = 99.3%
[17] ROC area of spying = 79%, and accuracy = 99.4%
[18] Malicious control (Precision and F1-score = 99%), spying
(Precision = 96% and F1-score = 98%), scan (Recall =
97%, and F1-score = 98%), normal (Precision = 98%,
Recall = 66%, and F1-score = 79%), wrong setup
(Precision = 99%), and accuracy = 99.4%
[19] DoS (Precision = 98.1%, recall = 64.8%, and F1-score =
78%), scan (Precision = 99.2%, and F1-score = 99.6%),
normal (Precision = 99.4%, and F1-score = 99.7%),
[20] Accuracy = 99.9%
result, CatBoosting is provided to address the problem of anomaly identification in

the DS2OS dataset. This proposed model has the ability to increase IoT security with
heterogeneous and vulnerable devices, according to the provided model and eval-
uated findings. In comparison to various researchers approaches, the experimental
investigation and result analysis from modeling knowledge of CatBoosting algorithm
demonstrates a promising approach for anomaly identification and categorization of
attack detection in smart home devices.
References
1. D.K.K. Reddy, H.S. Behera, J. Nayak, B. Naik, U. Ghosh, P.K. Sharma, Exact greedy algorithm
based split finding approach for intrusion detection in fog-enabled IoT environment. J. Inf.
Secur. Appl. 60(June), 102866 (2021)
2. Z.-K. Zhang, M.C.Y. Cho, C.-W. Wang, C.-W. Hsu, C.-K. Chen, S. Shieh, IoT security: Ongoing
challenges and research opportunities, in 2014 IEEE 7th International Conference on Service-
Oriented Computing and Applications (2014), pp. 230–234
3. Q. Jing, A.V. Vasilakos, J. Wan, J. Lu, D. Qiu, Security of the Internet of Things: perspectives
and challenges. Wirel. Netw. 20(8), 2481–2501 (2014)
4. J. Nayak, P.S. Kumar, D.K.K. Reddy, B. Naik, D. Pelusi, Machine learning and big data in
cyber-physical system: Methods, applications and challenges, in Cognitive Engineering for
Next Generation Computing (Wiley, 2021), pp. 49–91
5. A.P. Johnson, H. Al-Aqrabi, R. Hill, Bio-inspired approaches to safety and security in IoT-
enabled cyber-physical systems. Sensors 05 Feb 2020. [Online]. Available: https://www.mdpi.
com/1424-8220/20/3/844
6. K.B. Prakash, J. Nayak, B.T.P. Madhav, S. Padmanaban, V.E. Balas, Big data analytics and
intelligent techniques for smart cities (CRC Press, Boca Raton, 2021)
7. W. Zhou, Y. Jia, A. Peng, Y. Zhang, P. Liu, The effect of IoT new features on security and
privacy: New threats, existing solutions, and challenges yet to be solved. IEEE Internet Things
J. 6(2), 1606–1616 (2019)
8. U. Ghosh, M. Alazab, A.K. Bashir, A.-S.K. Pathan, Deep Learning for Internet of Things
Infrastructure, vol. s8-IX, no. 234 (CRC Press, Boca Raton, 2021)
9. M. Hasan, M.M. Islam, M.I.I. Zarif, M.M.A. Hashem, Attack and anomaly detection in IoT
sensors in IoT sites using machine learning approaches. Internet of Things 7, 100059 (2019)
10. R. Vangipuram, R.K. Gunupudi, V.K. Puligadda, J. Vinjamuri, A machine learning approach
for imputation and anomaly detection in <scp>IoT</scp> environment. Expert Syst. 37(5),
647–661 (2020)
11. P.B. Dash, J. Nayak, B. Naik, E. Oram, S.H. Islam, Model based IoT security framework using
multiclass adaptive boosting with SMOTE. Secur. Priv. 3(5), 1–15 (2020)
12. Y. Cheng, Y. Xu, H. Zhong, Y. Liu, Leveraging semisupervised hierarchical stacking temporal
convolutional network for anomaly detection in IoT communication. IEEE Internet Things J.
8(1), 144–155 (2021)
13. S. Latif, Z. Zou, Z. Idrees, J. Ahmad, A novel attack detection scheme for the industrial Internet
of Things using a lightweight random neural network. IEEE Access 8, 89337–89350 (2020)
14. D.K. Reddy, H.S. Behera, J. Nayak, P. Vijayakumar, B. Naik, P.K. Singh, Deep neural network
based anomaly detection in Internet of Things network traffic tracking for the applications of
future smart cities. Trans. Emerg. Telecommun. Technol. 32(7), 1–26 (2021)
15. N.K. Sahu, I. Mukherjee, Machine learning based anomaly detection for IoT Network:
(Anomaly detection in IoT Network), in 2020 4th International Conference on Trends in
Electronics and Informatics (ICOEI)(48184) (2020), no. Icoei, pp. 787–794
16. N. Islam et al., Towards Machine Learning Based Intrusion Detection in IoT Networks. Comput.
Mater. Contin. 69(2), 1801–1821 (2021)
17. P. Kumar, G.P. Gupta, R. Tripathi, A distributed ensemble design based intrusion detection
system using fog computing to protect the internet of things networks. J. Ambient Intell.
Humaniz. Comput. 0123456789, Nov (2020)
18. K. Singh, N. Singh, An ensemble hyper-tuned model for IoT sensors attacks and anomaly
detection. J. Inf. Optim. Sci. 41(7), 1715–1739 (2020)
19. R. Bokka, T. Sadasivam, Deep learning model for detection of attacks in the Internet of Things
based smart home environment. Expert. Syst. 37(5), 725–735 (2021)
20. D.K.K. Reddy, H.S. Behera, G.M.S. Pratyusha, R. Karri, Ensemble bagging approach for IoT
sensor based anomaly detection, in Information, vol. 11(5) (Springer, Singapore, 2021), pp
647–665
21. L. Prokhorenkova, G. Gusev, A. Vorobev, A.V. Dorogush, A. Gulin, CatBoost: unbiased
boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018-Decem(Section 4),
6638–6648, Jun (2017)
22. R. Pinto, M2M USING OPC UA (2020). [Online]. Available: https://ieee-dataport.org/open-
access/m2m-using-opc-ua. [Accessed: 18 Sep 2020]

Computational Intelligence

Uploaded by

Copyright:

Available Formats

Computational Intelligence

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computational Intelligence

Uploaded by

Copyright:

Available Formats

Smart Innovation, Systems and Technologies 281

Janmenjoy Nayak · H. S. Behera ·

More information about this series at https://link.springer.com/bookseries/8767

ISSN 2190-3018 ISSN 2190-3026 (electronic)

Dr. K. Someswara Rao, Chairman, AITAM

Sri L. L. Naidu, Secretary

Honorary Advisory Chair

Prof. V. E. Balas, Aurel Vlaicu University, Romania

Honorary General Chairs

Prof. Lakhmi C. Jain, University of Canberra, Australia

Dr. B. K. Panigrahi, Department of EEE, Indian Institute of Technology (IIT), Delhi,

Dr. Ch. Ramesh, Department of CSE, AITAM, AP

Dr. A. K. Das, IIEST, Shibpur, WB

International Advisory Committee

Dr. Claude Delpha, Université Paris Saclay

National Advisory Committee

Dr. D. Vishnu Murthy, Dean, FS, AITAM, AP

Prof. J. K. Mondal, Kalyani University

International Technical Committee Members

Dr. Adel M. Alimi, REGIM-Lab., ENIS, University o Sfax, Tunisia

Dr. Sarat Ch. Nayak, GITAM University, Visakhapatnam, AP, India

Mr. K. Eswara Rao, Department of CSE, AITAM, AP

Dr. T. Ravi Kumar, Department of CSE, AITAM, AP

Dr. Y. Ramesh, Department of CSE, AITAM, AP

Dr. B. Rajesh, Department of MBA, AITAM, AP

Dr. G. Nageswara Rao, Department of IT, AITAM, AP

Sri P. Suresh Kumar, Department of CSE, AITAM, AP

Organizing Committee Members

Sri R. Srinivas, Department of CSE, AITAM, AP

Anindya Banerjee, Capgemini Technology Services India

Prachi Sharma, Sagar Institute of Research and Technology, Bhopal

The twenty-first century has witnessed the emergence of some groundbreaking

Multi-Sensor Data Fusion for Occupancy Detection Using

A Study on Evaluating the Performance of Robot Motion Using

Disaster Event Detection from Text: A Survey . . . . . . . . . . . . . . . . . . . . . . . . 281

Classification of Tumorous and Non-tumorous Brain MRI Images

Prediction of Dynamic Virtual Machine (VM) Provisioning

Dr. Janmenjoy Nayak is working as Assistant Professor, Department of Computer

Dr. H. S. Behera is working as Associate Professor in the Department of Information

Dr. Bighnaraj Naik is Assistant Professor in the Department of Computer Applica-

Dr. S. Vimal is working in the Department of Computer Science and Engineering,

Sunanta Sarkar, Amit Ghosh, and Sankhadeep Chatterjee

Keywords Multi-sensor data fusion · Dempster–Shafer evidence theory ·

S. Sarkar · A. Ghosh · S. Chatterjee (B)

2.1 Logistic Regression

2.2 K-Nearest Neighbors

2.3 Support Vector Machines

2.4 Decision Tree Classifier

2.5 Random Forest Classifier

2.6 Proposed Methodology

3 Dempster–Shafer Evidence Theory

Sensor 1 Sensor 2 Sensor n

Select the hypothesis with the highest

Calculate the FSV of the classes

Select the attribute with smallest FSV

Calculate the differences of the selected at-

Find the class with smallest absolute difference

Return that class

Fig. 1 Multi-sensor data fusion model based on Dempster–Shafer evidence theory

3.1 Frame of Discernment